Clean text from any document.
One API call. Send a PDF, DOCX, email, image or scan — get back the plain text your agent can actually read, with structure preserved.
Agents should reason over files, not fight them.
Most agentic workflows still waste time getting files into a usable shape. vena8 moves extraction out of the prompt and into a purpose-built API that exposes the internal structure of files before the agent decides what to read next.
Stop parsing with prompts.
Use models for judgement and reasoning, not routine text extraction. Every parsing prompt is latency, cost and another source of error.
One shape across formats.
PDF, DOCX, EML, HTML, images — they all return the same response shape. Workflows stop branching on file type.
Text, not text soup.
Headings, lists, tables and paragraphs come through as readable text. Page breaks are marked. No leftover formatting noise.
Honest failure states.
If a file is encrypted, corrupted or unsupported, you get a structured error — not a hallucinated string of content.
One response. Just the text.
Send a file, get a single JSON response with the extracted text and minimal metadata. No schemas to learn, no nested query language, no per-format branching in your code.
The text field is what 99% of agent workflows want — clean, readable text with structure preserved through Markdown-style cues for headings, lists and tables.
Page boundaries are marked so a follow-up step can chunk by page if needed. Detected language, page count and source MIME type round out the response.
{ "file": { "name": "quarterly-report.docx", "mime": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "pages": 14 }, "text": "# Q3 2026 Financial Results\n\nReported revenue of $148.2M, up 22% year-over-year...", "meta": { "language": "en", "characters": 42081, "images": 7, "embedded_files": 3, "ocr_used": false }, "warnings": [ /* empty on clean files */ ] }
Every common business file, handled.
v1 focuses on the document types agents see most: native and scanned PDFs, the Office suite, plain-text variants, emails and common images. One endpoint handles all of them.
Native & scanned PDFs
Text-layer extraction for born-digital PDFs; OCR pipeline kicks in automatically for scans and image-only pages.
Word, PowerPoint, Excel
DOCX bodies, PPTX slides (including speaker notes), XLSX sheets flattened into Markdown-style tables.
Single messages
EML and MSG inputs — headers, body and attachment names extracted as one clean transcript per message.
Plain text and images
TXT, MD, HTML, RTF passed through cleanly. PNG, JPG and TIFF routed through OCR with confidence scoring.
Predictable enough to automate against.
The API is designed for machine consumption first: stable fields, repeatable extraction, selective expansion, composable exports and first-class errors.
Ship the file-reading part of your agent in an afternoon.
v1 is one endpoint and one response shape. Drop it in, stop maintaining ten different parsers, and get on with the actual product.