Now in early access

Text extraction for AI agents.

One API call. Send a PDF, DOCX, email or message and get back the plain text your agent can actually read, with structure preserved.

Request access → See the output

02 The problem

Agents should reason over files, not fight them.

Most agentic workflows still waste time getting files into a usable shape. vena8 moves extraction out of the prompt and into a purpose-built API that exposes the internal structure of files before the agent decides what to read next.

01 · LLM TAX

Stop parsing with prompts.

Use models for judgement and reasoning, not routine text extraction. Every parsing prompt is latency, cost and another source of error.

02 · CONSISTENT

One shape across formats.

PDF, DOCX, EML, HTML and images all return the same response shape. Workflows stop branching on file type.

03 · STRUCTURED

Text, not text soup.

Headings, lists, tables and paragraphs come through as readable text. Page breaks are marked. No leftover formatting noise.

04 · FAIL CLEAN

Honest failure states.

If a file is encrypted, corrupted or unsupported, you get a structured error, not a hallucinated string of content.

03 The output

One response. Just the text.

Send a file, get a single JSON response with the extracted text and minimal metadata. No schemas to learn, no nested query language, no per-format branching in your code.

The text field is what 99% of agent workflows want: clean, readable text with structure preserved through Markdown-style cues for headings, lists and tables.

Page boundaries are marked so a follow-up step can chunk by page if needed. Detected language, page count and source MIME type round out the response.

text pages language mime characters warnings[]

extract.json 200 OK

{
  "file": {
    "name": "quarterly-report.docx",
    "mime": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "pages": 14
  },
  "text": "# Q3 2026 Financial Results\n\nReported revenue of $148.2M, up 22% year-over-year...",
  "meta": {
    "language": "en",
    "characters": 42081,
    "images": 7,
    "embedded_files": 3
  },
  "warnings": [ /* empty on clean files */ ]
}

04 Supported formats

Every common business file, handled.

v1 focuses on the document types agents see most: native PDFs, the Office suite, plain-text variants and single-message email formats. One endpoint handles all of them.

PDF

Native PDFs

Text-layer extraction for born-digital PDFs. Headings, paragraphs and page boundaries preserved. Scanned PDFs are not supported in v1.

.pdf

OFFICE

Word, PowerPoint, Excel

DOCX bodies, PPTX slides (including speaker notes), XLSX sheets flattened into Markdown-style tables.

.docx.pptx.xlsx

Single messages

EML and MSG inputs. Headers, body and attachment names extracted as one clean transcript per message.

.eml.msg

PLAIN TEXT

Text formats & images

TXT, MD, HTML and RTF are passed through cleanly. Embedded images in any file are surfaced as referenced assets in the response.

.txt.md.html.rtf

05 Built for agents

Predictable enough to automate against.

The API is designed for machine consumption first: stable fields, repeatable extraction, selective expansion, composable exports and first-class errors.

POST /extract

Synchronous extraction.Upload a file and receive the response inline. Best for documents under 25 MB; typical latency is sub-second.

Extractionjson

POST /extract/async

Background extraction for large files.Returns a job id immediately. Poll the job or register a webhook for completion. Handles files up to 1 GB.

Jobjson

GET /jobs/{id}

Check job status and retrieve results.Standard job lifecycle: queued → running → succeeded or failed. The succeeded response includes the same shape as synchronous extract.

Job · Extractionjson

GET /formats

List currently supported types.Useful for client-side gating; returns the canonical list with status flags so you can fail fast on inputs we don't yet handle.

FormatListjson

06 Early access · v1

Ship the file-reading part of your agent in an afternoon.

v1 is one endpoint and one response shape. Drop it in, stop maintaining ten different parsers, and get on with the actual product.

v1 · onboarding now

Want plain text from every file your agent sees?

Request access → Talk to engineering