Markdown Converter Guide
Turn PDF and DOCX documents into clean Markdown without ever uploading them. Everything runs in your browser.
Quick Navigation
1. Overview
Markdown Converter reads PDF and DOCX files directly in your browser and turns them into clean Markdown. PDF text is extracted with pdf.js, DOCX is parsed with mammoth.js, and both are translated into plain Markdown headings, lists, links, and tables.
The tool is built for the same workflows the rest of ASD123.ai targets: feeding documents into local AI chats, preparing snippets for the Optimizer or Anonymizer, or simply archiving content in a portable text format.
Best effort, not perfect: PDFs do not always carry semantic structure. Headings are detected by font size, so unusual layouts may need a quick review after conversion.
2. Basic Workflow
Pick a document
Drag a PDF or DOCX into the drop area, or use the Choose file button. With the OCR engine you can also drop image files (PNG, JPG, WebP, BMP). The maximum size is 25 MB.
Adjust the options
Toggle heading detection, formatting, tables, links, and whitespace before or after conversion. Changing an option re-runs the conversion automatically.
Review and export
Switch between the Raw Markdown and Preview tabs, then copy to clipboard or download a .md file.
3. Supported Formats
Read with pdf.js. Text-based PDFs work best. Headings are inferred from larger font sizes, bulleted and numbered lists are recognized by common markers.
DOCX
Parsed with mammoth.js. Word styles such as Title, Heading 1–6, and Quote are mapped to the matching Markdown elements. Bold, italic, links, lists, and tables are preserved.
Images & scanned PDFs (OCR engine)
Switch the parsing engine to OCR to read PNG, JPG, WebP, or BMP images and scanned PDFs that have no text layer. Recognition runs locally with PP-OCRv6; see the engines section below.
4. Parsing Engines
For PDFs, the PDF parsing engine selector offers four choices. The default EdgeParse is a Rust engine compiled to WebAssembly that handles multi-column pages and tables well and emits Markdown directly. LiteParse is a second high-accuracy WebAssembly engine, Standard (pdf.js) is the lightest option with no extra download, and OCR reads scanned PDFs and images that have no text layer. Every engine loads locally the first time you pick it; everything runs entirely in your browser.
High accuracy · EdgeParse (default)
A Rust engine that emits Markdown directly, with XY-cut reading order and native GitHub-flavoured tables. Strong on borderless and complex tables, and the recommended default for most documents. One-time ~2.7 MB module.
High accuracy · LiteParse
Exposes the exact position and font size of every line. The converter groups text into columns by real x-coordinates, so a left column is read fully before the right one. One-time ~4 MB module.
Standard · pdf.js
The built-in reader. The fastest option with no extra download, fine for simple single-column PDFs. Headings are detected from font sizes.
OCR · PP-OCRv6
Optical character recognition for scanned PDFs and image files (PNG, JPG, WebP, BMP) that have no selectable text. Runs the PP-OCRv6 Tiny model on onnxruntime-web. One-time ~18 MB download on first use.
Which one to pick
EdgeParse is the default and best for table-heavy documents; try LiteParse for multi-column text or if a specific PDF renders better there; Standard is the quickest for plain text; choose OCR for scans and images. If a high-accuracy engine ever fails to load or parse, the converter quietly falls back to pdf.js so you always get a result.
Always local: Every engine runs entirely in your browser — your file and the WebAssembly modules never leave your device. EdgeParse, LiteParse, and Standard apply to PDF only; OCR also handles image files; DOCX always uses mammoth.js. On some PDFs that use embedded subset fonts, extracted words can show extra spaces between letters (for example inside table cells); this comes from how the font stores character widths, so switching engines, using OCR, or cleaning up in the Raw view is the best workaround.
Engines & versions
| Engine | Used for | Version | Source |
|---|---|---|---|
| EdgeParse (default) | PDF — high accuracy | 0.2.5 | GitHub |
| LiteParse | PDF — high accuracy | 2.0.8 | GitHub |
| pdf.js | PDF — standard | 6.0.227 | GitHub |
| PP-OCRv6 Tiny | Scanned PDFs & images — OCR | v6 (onnxruntime-web 1.26) | GitHub |
| mammoth.js | DOCX | 1.12.0 | GitHub |
Engines are self-hosted at pinned versions and updated manually after testing — nothing is fetched from third-party CDNs at runtime.
5. Conversion Options
PDF heading detection
Auto groups text by font size and promotes the larger lines to Markdown headings. Off emits every line as a regular paragraph.
Preserve bold & italic
DOCX only. Turn this off if you want plain text without inline emphasis. PDF emphasis is not preserved because pdf.js does not expose reliable styling info.
Convert tables
DOCX tables are rendered as Markdown pipe tables. Complex layouts (merged cells, nested tables) may need a manual touch-up.
Keep hyperlinks
When enabled, link text appears as [text](url). Turning it off keeps only the link text and drops the URL.
Collapse extra whitespace
Removes repeated blank lines and stray spaces left behind by complex layouts. Most documents benefit from leaving this on.
6. Output and Export
The output panel offers two views. Raw Markdown shows the plain text exactly as it will be copied or downloaded. Preview renders the Markdown so you can sanity-check headings, lists, and tables before exporting.
- Copy places the Markdown on your clipboard.
- Download .md saves the Markdown next to your other files. The download name is derived from the original file name.
- Switching options after a conversion re-runs the conversion on the same file without re-uploading.
7. Limits and Tips
- Scanned, image-only, or password-protected PDFs cannot be converted. Run OCR first or use a text-based copy.
- Multi-column PDFs can interleave columns in the standard engine. Switch the PDF parsing engine to LiteParse or EdgeParse for proper column ordering, or copy from the Raw view and clean up manually.
- Complex DOCX templates with custom styles may fall back to plain paragraphs. Map them to standard Heading styles for the best result.
- For very large documents, split them first or paste sections instead of running the entire file through at once.
- Markdown rendered in the Preview is sanitized for display only. Use the Raw view if you need the exact text for downstream tools.
8. Privacy
PDFs and DOCX files are loaded into memory and parsed locally by pdf.js, mammoth.js, and — when selected — the LiteParse or EdgeParse WebAssembly engines. ASD123.ai does not receive your document, the extracted text, or the generated Markdown.
Nothing is stored in localStorage or IndexedDB by this tool. Closing the tab discards the loaded document and the Markdown output. Downloads and clipboard actions stay on your device.
9. Use Cases
Feed local AI chats
Convert a report to Markdown, then drop it into Local AI Chat or any Ollama or LM Studio session for grounded answers.
Prepare snippets for the Optimizer
Pull text out of a PDF, clean it with the Optimizer, and ship it into your CMS or wiki without leaving your browser.
Anonymize before sharing
Turn a DOCX into Markdown, then run it through the Anonymizer to redact PII before pasting into a remote model.
Estimate context first
Pair the converter with the Context Estimator to check whether a long document fits a 16K, 32K, or 128K window before sending it to a model.