v1.0  Text extraction for AI agents

Clean text from any document.

One API call. Send a PDF, DOCX, email, image or scan — get back the plain text your agent can actually read, with structure preserved.

40+
Document, email, image and slide formats supported on day one.
1 call
A single endpoint. Send a file, get back plain text — no chain to orchestrate.
0‑prompt
Extraction runs without an LLM in the loop — deterministic and repeatable.
~80 ms
Typical per-page latency for native documents; OCR available for scans.
02 The problem

Agents should reason over files, not fight them.

Most agentic workflows still waste time getting files into a usable shape. vena8 moves extraction out of the prompt and into a purpose-built API that exposes the internal structure of files before the agent decides what to read next.

01 · LLM TAX

Stop parsing with prompts.

Use models for judgement and reasoning, not routine text extraction. Every parsing prompt is latency, cost and another source of error.

02 · CONSISTENT

One shape across formats.

PDF, DOCX, EML, HTML, images — they all return the same response shape. Workflows stop branching on file type.

03 · STRUCTURED

Text, not text soup.

Headings, lists, tables and paragraphs come through as readable text. Page breaks are marked. No leftover formatting noise.

04 · FAIL CLEAN

Honest failure states.

If a file is encrypted, corrupted or unsupported, you get a structured error — not a hallucinated string of content.

03 The output

One response. Just the text.

Send a file, get a single JSON response with the extracted text and minimal metadata. No schemas to learn, no nested query language, no per-format branching in your code.

The text field is what 99% of agent workflows want — clean, readable text with structure preserved through Markdown-style cues for headings, lists and tables.

Page boundaries are marked so a follow-up step can chunk by page if needed. Detected language, page count and source MIME type round out the response.

text pages language mime characters warnings[]
extract.json 200 OK
{
  "file": {
    "name": "quarterly-report.docx",
    "mime": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "pages": 14
  },
  "text": "# Q3 2026 Financial Results\n\nReported revenue of $148.2M, up 22% year-over-year...",
  "meta": {
    "language": "en",
    "characters": 42081,
    "images": 7,
    "embedded_files": 3,
    "ocr_used": false
  },
  "warnings": [ /* empty on clean files */ ]
}
04 Supported formats

Every common business file, handled.

v1 focuses on the document types agents see most: native and scanned PDFs, the Office suite, plain-text variants, emails and common images. One endpoint handles all of them.

PDF

Native & scanned PDFs

Text-layer extraction for born-digital PDFs; OCR pipeline kicks in automatically for scans and image-only pages.

.pdf+ ocr
OFFICE

Word, PowerPoint, Excel

DOCX bodies, PPTX slides (including speaker notes), XLSX sheets flattened into Markdown-style tables.

.docx.pptx.xlsx
EMAIL

Single messages

EML and MSG inputs — headers, body and attachment names extracted as one clean transcript per message.

.eml.msg
TEXT & IMAGE

Plain text and images

TXT, MD, HTML, RTF passed through cleanly. PNG, JPG and TIFF routed through OCR with confidence scoring.

.txt.html.png.jpg
05 Built for agents

Predictable enough to automate against.

The API is designed for machine consumption first: stable fields, repeatable extraction, selective expansion, composable exports and first-class errors.

POST /extract
Synchronous extraction.Upload a file and receive the response inline. Best for documents under 25 MB; typical latency is sub-second.
Extractionjson
POST /extract/async
Background extraction for large files.Returns a job id immediately. Poll the job or register a webhook for completion. Handles files up to 1 GB.
Jobjson
GET /jobs/{id}
Check job status and retrieve results.Standard job lifecycle: queued → running → succeeded or failed. The succeeded response includes the same shape as synchronous extract.
Job · Extractionjson
GET /formats
List currently supported types.Useful for client-side gating — returns the canonical list with status flags so you can fail fast on inputs we don't yet handle.
FormatListjson
06 Early access · v1

Ship the file-reading part of your agent in an afternoon.

v1 is one endpoint and one response shape. Drop it in, stop maintaining ten different parsers, and get on with the actual product.

v1 · onboarding now
Want plain text from every file your agent sees?
Request access Talk to engineering