mixedbread-skills

用 Mixedbread 进行搜索、RAG 与文档解析的 agent skill。

3 个 Skill

# Mixedbread Parsing

Parse documents, extract structured content, and run OCR using the Parsing API. Supports PDFs, Word documents, PowerPoint presentations, and images.

Docs: https://www.mixedbread.com/docs/parsing/overview.md
Agent-readable docs: https://www.mixedbread.com/docs/llms.txt
Latest docs search: https://www.mixedbread.com/question?q=parsing&section=docs

## Setup

```bash
pip install mixedbread          # Python
npm install @mixedbread/sdk     # TypeScript
```

```bash
export MXBAI_API_KEY=your_api_key
```

## Quick Start

**Python:**
```python
from mixedbread import Mixedbread

mxbai = Mixedbread()

# Upload and parse a document (waits for completion)
job = mxbai.parsing.jobs.upload_and_poll(
    file=open("report.pdf", "rb"),
    return_format="markdown",
)

for chunk in job.result.chunks:
    print(chunk.content)
```

**TypeScript:**
```typescript
import Mixedbread from '@mixedbread/sdk';
import fs from 'fs';

const mxbai = new Mixedbread();

const job = await mxbai.parsing.jobs.uploadAndPoll(
    fs.createReadStream('report.pdf'),
    { return_format: 'markdown' },
);

for (const chunk of job.result.chunks) {
    console.log(chunk.content);
}
```

## Decision Tree

- **Which convenience method?**
  - File on disk → `upload_and_poll()` (uploads + creates job + polls)
  - File already uploaded via Files API → `create_and_poll()` (creates job + polls)
  - Need async control → `upload()` or `create()` then `poll()` separately
- **Which parsing mode?**
  - Born-digital PDF (selectable text) → `fast` mode. Fastest, lowest cost. Extracts text, structure, and layout.
  - Scanned document, image, or complex layout → `high_quality` mode. Uses OCR. Extracts text with confidence scores, handles rotated/skewed pages, multi-column layouts.
- **Need specific elements only?** → Set `element_types` to reduce processing time

## Supported File Types

PDF (`.pdf`), Word (`.doc`, `.docx`, `.dotx`, `.docm`, `.dotm`, `.odt`, `.rtf`), Slides (`.ppt`, `.pptx`, `.ppsx`, `.pptm`, `.potm`, `.ppsm`, `.odp`), Images (`.jpeg`, `.png`, `.webp`, `.avif`).

Element types: `text`, `title`, `section-header`, `header`, `footer`, `page-number`, `list-item`, `figure`, `picture`, `table`, `form`, `footnote`, `caption`, `formula`.

## Workflows

### Extract Tables from Documents

Filter for table elements to pull structured data from reports.

**Python:**
```python
job = mxbai.parsing.jobs.upload_and_poll(
    file=open("financial-report.pdf", "rb"),
    element_types=["table"],
    return_format="html",
    mode="high_quality",
)
for chunk in job.result.chunks:
    for element in chunk.elements:
        if element.type == "table":
            print(f"Page {element.page}, confidence {element.confidence:.2f}")
            print(element.content)
```

**TypeScript:**
```typescript
const job = await mxbai.parsing.jobs.uploadAndPoll(
    fs.createReadStream('financial-report.pdf'),
    { element_types: ['table'], return_format: 'html', mode: 'high_quality' },
);
for (const chunk of job.result.chunks) {
    for (const element of chunk.elements) {
        if (element.type === 'table') {
            console.log(`Page ${element.page}, confidence ${element.confidence.toFixed(2)}`);
            console.log(element.content);
        }
    }
}
```

### Batch Parse Multiple Files

Upload multiple files asynchronously, then poll all jobs:

**Python:**
```python
import os

jobs = []
for filename in os.listdir("./documents"):
    if filename.endswith(".pdf"):
        job = mxbai.parsing.jobs.upload(
            file=open(f"./documents/{filename}", "rb"),
            return_format="markdown",
        )
        jobs.append(job)

# Poll all jobs
for job in jobs:
    completed = mxbai.parsing.jobs.poll(job_id=job.id)
    print(f"{completed.filename}: {len(completed.result.chunks)} chunks")
```

**TypeScript:**
```typescript
import { readdirSync, createReadStream } from 'fs';
import path from 'path';

const files = readdirSync('./documents').filter(f => f.endsWith('.pdf'));
const jobs = await Promise.all(
    files.map(f => mxbai.parsing.jobs.upload(
        createReadStream(path.join('./documents', f)),
        { return_format: 'markdown' },
    )),
);

// Poll all jobs
for (const job of jobs) {
    const completed = await mxbai.parsing.jobs.poll(job.id);
    console.log(`${completed.filename}: ${completed.result.chunks.length} chunks`);
}
```

## Rules

### CRITICAL
- **Don't double-parse.** Store uploads auto-parse documents. Files uploaded with `parsing_strategy: "high_quality"` automatically get OCR text (images), summaries (images), and transcriptions (audio & video) extracted. These are available as fields on search result chunks. There is no benefit to also running the Parsing API on the same file. Use the Parsing API only for standalone document extraction outside of stores.
- **Use `upload_and_poll()` / `create_and_poll()` instead of manual polling loops.** These methods handle backoff automatically. Manual `while` loops with `retrieve()` are fragile and waste API calls.

### HIGH
- **Specify `element_types` when you only need certain elements.** Requesting all types increases processing time and response size. If you only need tables, set `element_types` to `table` only.
- **Use `fast` mode for born-digital PDFs.** The `high_quality` mode adds OCR overhead that provides no benefit when text is already selectable.
- **Check `confidence` scores on OCR output.** Low-confidence elements (< 0.5) may contain errors. Filter or flag them.

### MEDIUM
- **Check `job.error` before retrying failed jobs.** Common causes: unsupported file type, corrupt file, file too large. Blindly retrying wastes quota.
- **Use `content_to_embed` for embedding pipelines.** Each chunk provides both `content` (full text) and `content_to_embed` (optimized for embedding). Use the latter when feeding into vector stores outside Mixedbread.
- **Verify file format before parsing.** Only PDF, Word, PowerPoint, and images are supported. Convert other formats first.

## Troubleshooting

| Symptom | Cause | Fix |
|---------|-------|-----|
| Job stuck in `pending` | Queue is busy | Use `poll()` with a longer `poll_timeout_ms`. Check job status with `retrieve()`. |
| Job status `failed` | Unsupported file type, corrupt file, or file too large | Check `job.error` for details. Verify file format is supported. |
| Empty chunks in result | File has no extractable content (blank pages) | Verify the file has content. Try `high_quality` mode for scanned documents. |
| Low confidence scores | Scanned or low-resolution source | Use `high_quality` mode for better OCR accuracy. |
| Missing tables or figures | Element types not requested | Set `element_types` to include `table` and `figure` explicitly. |
| `upload_and_poll()` timeout | Very large document or slow processing | Increase `poll_timeout_ms`, or use `upload()` + `poll()` separately for more control. |

RTFM↓ 0

面向 AI agent 的开放检索层，可索引代码、文档、法律、研究与数据，支持 FTS5 全文搜索、语义搜索、10 个内置解析器与增量自动同步，提供 MCP 服务器、CLI 与 Python API，兼容 Claude Code、Cursor、Codex。

pinRAG↓ 5

为编辑器或 CLI 提供 RAG：索引 PDF、YouTube、GitHub 仓库和 Discord 导出内容，并带引用地查询。

suprsonic-mcp↓ 4

统一的 agent API，可搜索、抓取、丰富画像、查找邮箱、生成图像，并支持 TTS、STT 等功能。一个 API key，无需逐家注册服务商。

marketplace-search-mcp↓ 3

统一的 MCP 服务器，可搜索 20 多个在线交易市场（TCGPlayer、Reverb、Grailed、Poshmark 等）、验证职业执照并查询纽约市建筑违规信息。

PinRAG↓ 2

面向编辑器或 CLI 的 RAG：索引 PDF、YouTube、GitHub 仓库与 Discord 导出内容，并带引用进行查询。

RTFM↓ 1

AI 编码 Agent 开源检索层：15 种格式索引，FTS5 + 语义搜索 + 知识图谱，经 MCP 提供精准上下文。本地、开源、免费。

Polaris↓ 1

为 AI agent 提供经核实的情报：搜索网络、用证据核查事实并给出可信答案而非幻觉。

dash-mcp-server↓ 1

面向 macOS API 文档浏览器 Dash 的 MCP 服务器，可检索 200 多套最新 API 文档集与速查表。

相关插件