Text extraction for AI agents.
One API call. Send a PDF, DOCX, email or message and get back the plain text your agent can actually read, with structure preserved.
Agents should reason over files, not fight them.
Most agentic workflows still waste time getting files into a usable shape. vena8 moves extraction out of the prompt and into a purpose-built API that exposes the internal structure of files before the agent decides what to read next.
Stop parsing with prompts.
Use models for judgement and reasoning, not routine text extraction. Every parsing prompt is latency, cost and another source of error.
One shape across formats.
PDF, DOCX, EML, HTML and images all return the same response shape. Workflows stop branching on file type.
Text, not text soup.
Headings, lists, tables and paragraphs come through as readable text. Page breaks are marked. No leftover formatting noise.
Honest failure states.
If a file is encrypted, corrupted or unsupported, you get a structured error, not a hallucinated string of content.
One response. Just the text.
Send a file, get a single JSON response with the extracted text and minimal metadata. No schemas to learn, no nested query language, no per-format branching in your code.
The text field is what 99% of agent workflows want: clean, readable text with structure preserved through Markdown-style cues for headings, lists and tables.
Page boundaries are marked so a follow-up step can chunk by page if needed. Detected language, page count and source MIME type round out the response.
{ "file": { "name": "quarterly-report.docx", "mime": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "pages": 14 }, "text": "# Q3 2026 Financial Results\n\nReported revenue of $148.2M, up 22% year-over-year...", "meta": { "language": "en", "characters": 42081, "images": 7, "embedded_files": 3 }, "warnings": [ /* empty on clean files */ ] }
Every common business file, handled.
v1 focuses on the document types agents see most: native PDFs, the Office suite, plain-text variants and single-message email formats. One endpoint handles all of them.
Native PDFs
Text-layer extraction for born-digital PDFs. Headings, paragraphs and page boundaries preserved. Scanned PDFs are not supported in v1.
Word, PowerPoint, Excel
DOCX bodies, PPTX slides (including speaker notes), XLSX sheets flattened into Markdown-style tables.
Single messages
EML and MSG inputs. Headers, body and attachment names extracted as one clean transcript per message.
Text formats & images
TXT, MD, HTML and RTF are passed through cleanly. Embedded images in any file are surfaced as referenced assets in the response.
Predictable enough to automate against.
The API is designed for machine consumption first: stable fields, repeatable extraction, selective expansion, composable exports and first-class errors.
Ship the file-reading part of your agent in an afternoon.
v1 is one endpoint and one response shape. Drop it in, stop maintaining ten different parsers, and get on with the actual product.