pdf2md/skills/pdf-to-markdown-mineru/SKILL.md
qz 22165a3c26 Import pdf-to-markdown converter and shorten hosted image suffixes.
Bring the local project into the remote repository and reduce generated image object suffixes to six characters for shorter URLs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 14:37:42 +08:00

38 lines
2.1 KiB
Markdown

---
name: pdf-to-markdown-mineru
description: Convert local PDF files, especially academic papers, into Markdown via MinerU and rewrite extracted local image references to hosted URLs on an R2-compatible object store.
---
# PDF to Markdown via MinerU
Use this skill when the user wants a local PDF converted into Markdown and the final Markdown should keep working across machines by replacing extracted local image paths with hosted URLs.
## Included files
- `scripts/convert_pdf_to_markdown.py`: standalone CLI for MinerU submission, polling, download, unzip, image upload, and Markdown rewrite.
- `scripts/requirements.txt`: minimal Python dependencies for the CLI.
- `.env`: bundled MinerU and R2 configuration so the skill can run directly in this workspace.
## Workflow
1. Confirm the source PDF path and choose an output `.md` path.
2. Ensure Python dependencies are installed. Prefer `uv pip install -r <skill-dir>/scripts/requirements.txt` or `python -m pip install -r <skill-dir>/scripts/requirements.txt`.
3. This skill first loads `.env` from the skill root, then falls back to the current working directory or an explicit `--env-file`.
4. Ensure these environment variables are available before running:
- Required: `MINERU_API_TOKEN`, `R2_BASE_URL`, `R2_BEARER_TOKEN`
- Optional: `R2_PREFIX`, `R2_PUBLIC_BASE_URL`, `POLL_INTERVAL_SECONDS`, `TIMEOUT_SECONDS`
5. Run the converter:
```bash
python scripts/convert_pdf_to_markdown.py /path/to/paper.pdf -o /path/to/paper.md
```
6. For scanned PDFs, add `--ocr`. Disable extraction features with `--disable-table` or `--disable-formula` if needed.
## Operational notes
- The script requires outbound network access to MinerU and the R2-compatible object store.
- Progress messages are written to stderr. The final Markdown path is written to stdout.
- Only local image references are uploaded and rewritten. Existing `http`, `https`, and `data:` image URLs are left unchanged.
- If the caller wants Markdown without any image hosting step, this skill is the wrong default; adjust the script first instead of running it as-is.