Olostep - web search, scraping and crawling for AI
Turn the entire web into LLM-ready context. Olostep integrates web search, scraping, crawling, and AI answers directly into Cursor. Essential for agents workflows: debug obscure errors by scraping GitHub issues, automate framework migrations by scraping breaking changes, turn websites into typed JSON schemas, or crawl entire API docs into context. Handles JS rendering, anti-bot, and proxies automatically.
olostep web
MCP server: olostep web search, scraping and crawling
{
"command": "npx",
"args": [
"-y",
"olostep-mcp"
],
"env": {
"OLOSTEP_API_KEY": "${OLOSTEP_API_KEY}"
}
}answers
Get AI-powered answers with citations from live web data using Olostep. Use when the user needs up-to-date facts, competitive intelligence, pricing comparisons, structured data from the web, or wants web-sourced answers in a specific JSON shape they can use directly in code or docs.
# Olostep AI Answers
Ask a question, get an AI-synthesised answer from live web sources — with citations and optional structured JSON output you can paste straight into code.
## Why this matters
This is the most powerful tool in the Olostep toolkit. Instead of scraping individual pages and parsing them yourself, you describe what you want to know and the shape you want the answer in. Olostep searches the web, reads multiple sources, and returns a synthesised, cited answer — optionally as structured JSON.
## When to use
- User needs **current** information (pricing, versions, status, rankings, team, funding)
- User wants structured data from the web — prices, specs, features — in a JSON shape they define
- User is comparing tools, libraries, or services and needs a fair, cited comparison
- User wants to fact-check a claim with current sources
- User asks anything that starts with "what's the current...", "is X still...", "compare...", "find me..."
## Workflow
1. Use `answers` with the user's question as `task`.
2. **Always use the `json` parameter** when the user needs structured, comparable, or code-ready output.
3. Return the answer with source URLs.
4. Offer to `/scrape` any cited source for deeper detail.
## The power of `json`
The `json` parameter transforms `answers` from a simple Q&A tool into a **structured data extraction engine**. You define the output shape, and Olostep fills it with live web data.
```
// Get an array of competitors with structured fields
task: "Top 5 alternatives to Segment for product analytics"
json: [{"name": "", "pricing": "", "key_feature": "", "best_for": ""}]
// Get a single object with specific data points
task: "Company info for Linear"
json: {"ceo": "", "founded": "", "funding_total": "", "employees": "", "tech_stack": []}
// Get a plain string description
task: "What changed in Next.js 15?"
json: "Return a bullet list of the top 5 changes with one-line descriptions"
```
## Real developer workflows
**"Populate a comparison table in my app"**
> "Get the pricing for Vercel, Netlify, and Render — return as JSON I can use in my React component"
→ `answers` with `json: [{"provider": "", "free_tier": "", "pro_price": "", "key_limits": ""}]`
→ Returns a JSON array you can import directly into your code.
**"Is this library still maintained?"**
> "Is Zustand still actively maintained? What's the latest version and when was it released?"
→ `answers` with `json: {"maintained": true, "latest_version": "", "released": "", "weekly_downloads": ""}` → live npm/GitHub data with citations.
**"Get company data for my CRM"**
> "Find the CEO, founding year, employee count, and last funding round for Linear.app"
→ `answers` with structured schema → returns verified data with sources.
**"Compare before choosing a tool"**
> "Compare Prisma, Drizzle, and Kysely for TypeScript ORM — which should I pick for a serverless project?"
→ `answers` with `json: [{"orm": "", "pros": [], "cons": [], "serverless_fit": ""}]` → structured comparison you can reason about.
**"Research for a blog post"**
> "What are the most common mistakes developers make with React Server Components in 2026?"
→ `answers` returns a cited, synthesised answer aggregated from multiple blog posts and discussions.
**"Get live stats for a feature decision"**
> "How many developers use TypeScript vs JavaScript in 2026? What's the adoption rate?"
→ `answers` with `json: {"stat": "", "value": "", "source": "", "year": ""}` → cited data points.
**"Tech stack research"**
> "What tech stack does Figma use? I want to understand how they handle real-time collaboration."
→ `answers` aggregates from engineering blog posts, job postings, and conference talks → detailed technical breakdown.
## Chain with other skills
- **answers → scrape**: Get a quick answer with citations, then `/scrape` a cited source for the full page content.
- **answers → batch**: Research a topic, then `/batch` scrape all cited sources for comprehensive detail.
- **answers → code**: Get structured technical data, then use it to write code (e.g. comparison component, config file, feature matrix).
## Parameters
- **task**: Question or research task (required)
- **json**: Output shape for structured results (optional but highly recommended)
- **Object**: `{"name": "", "price": "", "features": []}` — returns a filled object
- **Array**: `[{"name": "", "price": ""}]` — returns a filled array
- **String**: `"return a bullet list of the top 5 changes"` — returns formatted text
## Tips
- **Always use `json`** when the output needs to be structured, compared, or used in code — it's the killer feature
- Share citation URLs with the user so they can verify and explore further
- For deeper detail on any cited source, offer to `/scrape` it
- This replaces dozens of manual Google searches → use it liberally for any "current state of X" question
- Works great for: pricing, versions, status, team info, tech stacks, market data, tool comparisonsbatch
Scrape up to 10,000 URLs in parallel using Olostep. Use when the user has a list of URLs to extract content from at once — product pages, job listings, pricing pages, changelogs, docs, or any set of known pages. Much faster than scraping one-by-one.
# Olostep Batch Scrape
Scrape up to 10,000 URLs in parallel. All pages are scraped concurrently with full browser rendering, anti-bot bypass, and residential proxies — no rate limiting, no blocking, no setup.
## Why this matters
Scraping 50 pages sequentially takes minutes and often fails halfway through (rate limits, CAPTCHAs, IP bans). Olostep batch processes them all in parallel through its browser infrastructure — you get every page back, fully rendered, in seconds.
## When to use
- User has 3+ URLs they want content from (don't scrape them one-by-one)
- User has a list from `/map` and wants to pull content from matching pages
- User wants to compare multiple competitor pages, product pages, or pricing pages side-by-side
- User needs to build a dataset from a list of known URLs
- User wants to audit, analyse, or extract patterns across many pages
## Workflow
1. Collect URLs from the user (or from a previous `/map` result).
2. Build the array: `[{"url": "...", "custom_id": "competitor-1"}, ...]`.
3. Call `batch_scrape_urls` with the array.
4. Use `custom_id` on each URL so you can match results back to the source.
5. Default to `markdown`. Use `json` + `parser` for structured product data.
6. After batch completes, **synthesise across all results** — compare, extract patterns, build tables, generate code.
## Real developer workflows
**"Compare competitor pricing pages"**
> "Batch scrape the pricing pages of Vercel, Netlify, Render, Railway, and Fly.io — extract each plan name, price, and key limits"
→ Batch 5 URLs with `custom_id` per competitor → extract pricing into a comparison table.
**"Build a product dataset from Amazon"**
> "Here are 30 Amazon product URLs — scrape them all and extract name, price, rating, and review count"
→ Batch with `parser: "@olostep/amazon-product"` → get clean structured JSON for every product.
**"Audit all docs pages for completeness"**
> "I mapped our docs site and got 45 URLs. Batch scrape all of them and tell me which pages are missing code examples"
→ `/map` first → feed URLs into `batch` → scan each page's content for code blocks → report gaps.
**"Extract every job listing"**
> "Scrape all 20 of these job posting URLs and give me a table of: title, company, salary range, required skills, location"
→ Batch all listings → extract fields from each → return as a markdown table or JSON array.
**"Dependency changelog audit"**
> "Scrape the changelogs for these 12 npm packages and flag any breaking changes in their latest major releases"
→ Batch all changelog URLs → scan for "BREAKING", "removed", "deprecated" → report per package.
**"Conference speaker research"**
> "Batch scrape these 20 speaker profile pages and extract name, company, talk title, and bio"
→ Batch with `custom_id` per speaker → extract structured profiles → generate a speakers.json file.
**"Geo-targeted pricing comparison"**
> "Scrape this SaaS pricing page from US, UK, and Germany to see if prices differ by region"
→ Three batch calls with `country: "US"`, `country: "GB"`, `country: "DE"` → compare results.
## Chain with other skills
- **map → batch**: The most powerful combo. `/map` discovers all URLs on a site, you filter to the ones you need, then `/batch` scrapes them all. Perfect for docs sites, blogs, product catalogs.
- **search → batch**: Use `/search` to find relevant URLs, then `/batch` scrape all results for full content.
- **batch → code**: Batch scrape API docs pages, then generate typed clients or integration code from the combined content.
## Parameters
- **urls_to_scrape**: Array of `{url, custom_id?}` objects, 1–10,000 (required)
- **output_format**: `markdown` (default), `html`, `json`, `text`
- **wait_before_scraping**: ms to wait per URL for JS rendering, 0–10000 (optional)
- **country**: Country code for geo-targeted content — e.g. `US`, `GB`, `DE` (optional)
- **parser**: Specialised parser e.g. `@olostep/amazon-product` (optional)
## Tips
- **Always use `custom_id`** — it makes matching results to sources trivial (e.g. `"custom_id": "vercel-pricing"`)
- For 1–2 URLs, use `/scrape`. For 3+, always use `/batch` — it's parallel and faster
- Combine with `/map` for the most powerful pattern: discover → filter → batch scrape → synthesise
- The `country` param routes through residential proxies in that country — great for comparing regional content
- Always synthesise across results (tables, comparisons, summaries) — don't dump raw contentcrawl
Autonomously crawl an entire website by following links from a starting URL using Olostep. Use when the user wants to ingest an entire docs site, blog, knowledge base, or any multi-page site as context for coding, writing, analysis, or building a knowledge base.
# Olostep Crawl
Autonomously discover and scrape an entire website by following links from a start URL. Every page is rendered in a full browser with anti-bot bypass — perfect for ingesting docs, blogs, and knowledge bases.
## Why this matters
Manually collecting pages from a docs site is tedious and incomplete. Crawl does it automatically — it starts at a URL, follows links, renders every page (even JS-heavy ones), and returns clean content from each. You get an entire site's knowledge in one call.
## When to use
- User wants to learn a library by ingesting its complete docs
- User wants all blog posts or articles from a site as context
- User is preparing content for a RAG pipeline or knowledge base
- User wants to audit all pages across a site (content completeness, consistency, SEO)
- User wants a full picture of a competitor's documentation or marketing
## Workflow
1. Use `create_crawl` with the starting URL.
2. Set `max_pages: 10` by default — **always ask the user before going above 20**.
3. Set `follow_links: true` (default) to discover pages automatically.
4. Default to `markdown` — best for AI reasoning.
5. After crawling, **synthesise the collected content** — don't dump it raw.
## Real developer workflows
**"Learn this library from its docs"**
> "Crawl https://tanstack.com/query/latest/docs/overview and give me a cheat sheet of the key hooks, patterns, and gotchas"
→ Crawl 15–20 pages → synthesise into a concise reference with code snippets.
**"Generate working examples from docs"**
> "Crawl the Resend docs and write me 5 practical TypeScript examples: send email, send with template, add attachment, send to list, check delivery status"
→ Crawl → extract each endpoint's signature and parameters → write copy-paste-ready examples.
**"Build a knowledge base for my team"**
> "Crawl our internal docs at https://docs.internal.co and create a structured summary of every service, its owner, and its API endpoints"
→ Crawl 30–50 pages → extract service names, team ownership, endpoint lists → generate a services.json or markdown index.
**"Competitive docs analysis"**
> "Crawl https://docs.competitor.com/api and compare their API capabilities with ours — what do they have that we don't?"
→ Crawl → list all endpoints and features → compare against the user's API → identify gaps.
**"Content audit before a redesign"**
> "Crawl our marketing site and give me every page URL, its title, word count, and whether it has a CTA"
→ Crawl up to 50 pages → extract metadata from each → return as a structured table.
**"Prepare context for a refactor"**
> "Crawl our architecture docs and summarise the auth system before I refactor it"
→ Crawl → filter pages related to auth → extract design decisions, data flow, dependencies.
## Chain with other skills
- **map → crawl**: Use `/map` first to see all URLs, then `/crawl` a specific section. Avoids wasting crawl budget on irrelevant pages.
- **crawl → code**: Crawl the docs, then write complete integration code from the collected knowledge (see `/docs-to-code`).
- **crawl → batch**: If the crawl is too broad, use `/map` + `/batch` for more targeted extraction.
## Parameters
- **start_url**: Starting URL for the crawl (required)
- **max_pages**: Maximum pages to crawl — default 10, increase carefully (optional)
- **follow_links**: Whether to follow links on each page — default `true` (optional)
- **output_format**: `markdown` (default), `html`, `json`, `text`
- **country**: Country code for geo-targeted crawling — e.g. `US`, `GB` (optional)
- **parser**: Specialised parser ID for structured extraction (optional)
## Tips
- **Start small**: 10 pages is usually enough for a focused docs section. Ask before going to 50+.
- Always **synthesise** after crawling — generate a summary, cheat sheet, comparison, or structured data.
- Use `/map` first on large sites to understand the structure, then `/crawl` the relevant section.
- For a **known list** of URLs, `/batch` is better — it's parallel and you control exactly which pages to scrape.
- For a **single page**, just use `/scrape`.debug-error
Search the web for a specific error message or bug using Olostep. Use when the user pastes an error stack trace, says their code is failing, or asks how to fix a bug that requires looking up recent GitHub issues, StackOverflow threads, or framework documentation.
# Olostep Debug Error
Turn obscure error messages into working fixes by searching the live web and scraping the actual solutions from GitHub issues, StackOverflow, or official docs.
## When to use
- User pastes a stack trace or terminal error and asks "how do I fix this?"
- User is facing a bug with a specific library version that the AI wasn't trained on
- User asks if there is an open GitHub issue for a bug they found
- User is stuck and standard debugging hasn't worked
## Workflow
1. Extract the core error message, framework, and version (if applicable) from the user's context.
2. Use `answers` with the query to search the live web for the error.
*Example: "How to fix [exact error message] in [Framework name] v[Version]?"*
3. If the answer cites a specific GitHub issue or StackOverflow thread that looks promising, use `scrape_website` on that exact URL to read the full discussion and find the accepted solution or workaround.
4. Explain the root cause of the error based on the scraped context.
5. Provide the exact code fix or terminal commands the user needs to run to resolve the issue.
## Real developer workflows
**"Fix a weird build error"**
> "I'm getting 'Next.js 14 Error: x is not defined' when running npm run build. Help me fix it."
→ `answers` searches for the error. Finds a GitHub issue. → `scrape_website` reads the issue to find the maintainer's workaround. → Applies fix to user's code.
**"Find out if a bug is a known issue"**
> "My Supabase auth listener is firing twice on login. Is this a known bug?"
→ `answers` searches the Supabase repo for the bug. Returns the status of the issue and any community workarounds.
**"Resolve peer dependency conflicts"**
> "npm ERR! ERESOLVE unable to resolve dependency tree for React 19 and Framer Motion"
→ Searches for the specific compatibility issue. Scrapes the relevant PR or discussion. Advises on the correct version bump or `--legacy-peer-deps` flag.
## Parameters to use
### `answers`
- **task**: The exact error message and framework context.
### `scrape_website`
- **url_to_scrape**: The URL of the GitHub issue, forum thread, or docs page found via `answers`.
- **output_format**: `markdown`
## Tips
- Always prioritize scraping GitHub issues or official framework forums when debugging.
- Don't just paste the scraped text; translate the solution directly into a fix for the user's specific codebase.
- If an issue is marked as "open" with no fix, tell the user and suggest the most upvoted workaround.docs-to-code
Scrape API documentation or library docs and use them to write working, up-to-date code. Use when the user wants to integrate a third-party API, learn a new library, generate SDK usage examples, write code based on a docs URL, or update code after a library version change. This is the flagship workflow — never hallucinate API parameters when you can scrape the real docs.
# Olostep Docs-to-Code
Turn any documentation URL into working, tested code. Scrape the real docs → understand the API surface → write the integration. Never guess parameters when the actual docs are one scrape away.
## Why this matters
AI models hallucinate API parameters, use deprecated methods, and miss breaking changes from new versions. This skill solves that — it scrapes the **actual, current docs** and writes code based on what's really there. The result is code that works on the first try.
## When to use
- User wants to integrate a third-party API and shares (or mentions) a docs URL
- User says "write me code using [library]" — especially if it's new, niche, or recently updated
- User wants examples for a library that shipped a new major version
- User wants to generate a typed client, SDK wrapper, or helper from an API reference
- User says "read the docs and write the code" or "use the latest docs"
- User wants to migrate code after a breaking change in a dependency
## Workflow
1. **Scrape the docs page**: `scrape_website` with `output_format: markdown`, `wait_before_scraping: 2000`.
2. **If docs span multiple pages**: Use `get_website_urls` or `create_map` to find sub-pages (auth, endpoints, types, errors), then scrape the relevant ones.
3. **For entire doc sites**: Use `create_crawl` with `max_pages: 15` to ingest a full docs section.
4. **Parse the API surface**: Extract endpoints, parameters (required vs optional), auth patterns, request/response shapes, error codes.
5. **Write the code**: Use **exactly** what's in the docs — don't guess, don't hallucinate. Include:
- Correct imports and auth setup
- TypeScript types derived from the docs' response shapes
- Error handling for documented error codes
- Working, copy-paste-ready examples
6. **Cite the source**: Link to the doc page you scraped so the user can verify.
## Real developer workflows
**"Integrate this API"**
> "Scrape https://docs.resend.com/api-reference/emails/send-email and write a Next.js API route that sends a welcome email"
→ Scrape the endpoint docs → extract auth header, payload shape (`from`, `to`, `subject`, `html`), response → write a typed `POST /api/send-welcome` handler.
**"Generate a typed client"**
> "Scrape the Anthropic messages API docs and write me a fully typed TypeScript client with streaming support and error handling"
→ Scrape API reference → extract endpoints, input/output types, error types → generate a `AnthropicClient` class with methods for each endpoint.
**"Learn a library and write examples"**
> "Crawl https://docs.upstash.com/redis and generate 5 practical examples for a Next.js app: cache a DB query, rate limit an API, session storage, pub/sub, and leaderboard"
→ Crawl 10–15 docs pages → extract commands and patterns → write 5 focused, working examples.
**"Implement a webhook handler"**
> "Scrape the Stripe webhook verification docs and write me a Next.js webhook handler for payment_intent.succeeded"
→ Scrape → extract signature verification logic → implement with `stripe.webhooks.constructEvent()` and proper error handling.
**"Migrate to a new version"**
> "Scrape the v4 migration guide at https://sdk.vercel.ai/docs/migration and update my code from v3"
→ Scrape migration guide → identify breaking changes → apply each change to the user's existing code → explain what changed and why.
**"Generate an OpenAPI-style reference"**
> "Scrape the Hono framework docs and generate a quick reference of all route methods, middleware, and context helpers"
→ Map docs → scrape relevant pages → extract the full API surface → present as a structured reference.
**"Write the auth flow"**
> "Scrape the Clerk docs for Next.js App Router and write the complete auth setup: middleware, sign-in page, protected routes, and user object access"
→ Scrape 3–4 relevant docs pages → write each piece with correct imports and configuration.
## Chain with other skills
- **map → scrape → code**: Map a docs site to find the right pages → scrape the specific reference pages → write code from them.
- **search → scrape → code**: Don't have the docs URL? Use `/search` to find it, then scrape and code.
- **crawl → code**: For full library learning, crawl the entire docs section → synthesise into working examples.
## Tips
- **Always scrape before writing** — never assume your training data has the current API surface
- If a user mentions a library version ("v5", "latest"), the docs have likely changed — scrape them
- For large APIs, use `/map` first to find the exact reference pages, then scrape those specifically
- Prefer specific **reference pages** over overview/guide pages — they have the exact parameters and types
- If auth is unclear from the endpoint docs, scrape the authentication/quickstart page separately
- Always include error handling based on documented error codes
- Link to the scraped doc page so the user can verify the code matchesextract-schema
Scrape a webpage with Olostep and extract specific structured data matching a TypeScript interface, JSON schema, or database model. Use when the user wants to turn a website (like a product list, directory, or article) into clean, structured JSON or seed data.
# Olostep Extract Schema
Turn any unstructured webpage into perfectly formatted JSON or TypeScript objects. Scrape the site using Olostep's clean markdown extraction, then parse the data to match the user's exact schema.
## When to use
- User provides a URL and a TypeScript interface and says "Extract all the products on this page to match this interface."
- User wants to populate a database or create seed data from a real website.
- User is building a directory or aggregator and needs to scrape profiles, pricing, or features into a JSON file.
- User wants to convert an unstructured blog post or news article into a structured metadata object.
## Workflow
1. Review the TypeScript interface, JSON schema, or data shape requested by the user.
2. Scrape the target URL using `scrape_website` (always use `output_format: markdown` because it is highly optimized for LLM parsing).
3. If scraping multiple pages (like a list of profiles), use `batch_scrape_urls`.
4. Parse the extracted markdown content and map it strictly to the requested schema.
5. Output the result as a raw JSON code block, or write it directly to a `.json` or `.ts` file in the user's workspace if requested.
## Real developer workflows
**"Generate database seed data"**
> "Scrape this YC startup directory page and generate a JSON array matching my `Startup` Prisma schema: `{ name: string, description: string, batch: string, website: string }`. Write it to `seed.json`."
→ Scrapes the URL. Parses the clean markdown into the exact JSON shape. Creates the file.
**"Extract e-commerce products"**
> "Batch scrape these 5 Amazon product URLs and give me an array of objects with `title`, `price`, `rating`, and `inStock` boolean."
→ Runs a `batch` scrape. Extracts the fields for each product. Returns the structured array.
**"Parse article metadata"**
> "Scrape this news article and extract the `author`, `publishDate`, `mainTopics` (array of strings), and a 2-sentence `summary`."
→ Scrapes the article. Uses LLM reasoning to extract the fields. Returns the JSON.
## Parameters to use
### `scrape_website` or `batch_scrape_urls`
- **url_to_scrape** / **urls_to_scrape**: The target URL(s)
- **output_format**: `markdown` (Markdown is the best format for the LLM to read and extract schemas from)
- **wait_before_scraping**: `3000` (Crucial for e-commerce or directory sites that load data via client-side fetching)
## Tips
- Olostep's markdown format strips out all the HTML noise, making it incredibly easy and cheap for you (the AI) to extract fields accurately.
- If data is missing from the page, use `null` or omit the field as defined by the user's schema—do not hallucinate data.
- If the user wants to extract from many URLs, always route them to use `batch_scrape_urls` for speed.integrate
Automatically integrate the Olostep SDK into the user's codebase. Analyzes the project, detects the language and framework, chooses the right integration pattern, installs the SDK, writes all the code, and verifies it works — with minimal prompting. Use when the user wants to add Olostep web scraping, search, or AI answers to their project.
# Olostep Self-Integrating Setup
Analyze the user's codebase, detect their stack, and write a complete, working Olostep integration — SDK installation, client setup, tool/route wiring, and verification — with minimal user input.
## When to use
- User says "integrate Olostep", "add Olostep to my project", "set up Olostep SDK", or "I want web scraping in my app"
- User runs `/olostep:integrate` with or without an API key
- User wants to add web search, scraping, crawling, or AI answers to an existing codebase
- User is migrating from another scraping provider (Firecrawl, Apify, Browserbase, ScrapingBee, etc.)
## Instructions
You are an expert Olostep integration engineer. You know every endpoint, every SDK method, every parameter, and every framework pattern. Your job is to get the user from zero to a working integration as fast as possible. Be opinionated — pick the best approach and run with it. Only ask the user a question when there is genuine ambiguity that affects the integration (e.g., two equally valid patterns).
### Phase 1 — Detect the codebase
Analyze the project to understand:
1. **Language**: Check for `package.json` (Node/TypeScript), `pyproject.toml` / `requirements.txt` / `setup.py` (Python), or both.
2. **Framework**: Detect from dependencies:
- **Node/TS**: Next.js (`next`), Express (`express`), Fastify (`fastify`), Hono (`hono`), NestJS (`@nestjs/core`), Nuxt (`nuxt`), SvelteKit (`@sveltejs/kit`), Remix (`@remix-run/node`), Astro (`astro`)
- **Python**: FastAPI (`fastapi`), Django (`django`), Flask (`flask`), Starlette (`starlette`), Litestar (`litestar`)
3. **AI/Agent frameworks**: LangChain (`langchain`), LangGraph (`langgraph`), CrewAI (`crewai`), Vercel AI SDK (`ai`), OpenAI SDK (`openai`), Anthropic SDK (`@anthropic-ai/sdk`), Google ADK (`google-adk`), Mastra (`mastra`), Haystack, AutoGen
4. **Existing Olostep usage**: Look for `olostep` in dependencies, `OLOSTEP_API_KEY` in `.env` files, existing imports of `olostep` or `Olostep`
5. **Project structure**: Where are API routes? Where are utilities/lib files? Where is the agent loop? Where are environment variables managed?
### Phase 2 — Choose the integration pattern
Based on detection, pick the best pattern. **Default to the most useful pattern for their stack** — don't ask unless two patterns are equally valid.
#### Decision tree
```
Is this a LangChain project?
→ Install `langchain-olostep`, wire tools into agent
Is this a CrewAI project?
→ Install `crewai-olostep`, add tools to agents
Is this a Mastra project?
→ Install `@olostep/mastra`, add tools to mastra config
Is this an AI agent app (Vercel AI SDK, OpenAI function calling, etc.)?
→ Install native SDK (`olostep`), create tools that wrap Olostep endpoints
Is this a web app with API routes (Next.js, Express, FastAPI, etc.)?
→ Install native SDK, create API route(s) for scraping/search
Is this a data pipeline / script?
→ Install native SDK, write the pipeline script
Is this a CLI or automation tool?
→ Install native SDK, add commands
Else (plain project, unclear):
→ Install native SDK, create a utility module, show usage examples
```
If the user's intent is unclear AND the codebase could go multiple ways, ask ONE question:
> "I see you're using [framework]. How do you want to use Olostep?"
> 1. **As an AI tool** — Give your AI agent the ability to search/scrape the web
> 2. **As API routes** — Expose scraping/search as API endpoints in your app
> 3. **As a data pipeline** — Batch scrape/crawl websites for data ingestion
### Phase 3 — API key setup
1. If the user passed an API key via `$ARGUMENTS`, use it.
2. If `OLOSTEP_API_KEY` already exists in `.env` or environment, use it.
3. Otherwise, ask: "What's your Olostep API key? Get one free at https://olostep.com/auth"
Store the key in the project's `.env` file as `OLOSTEP_API_KEY=<key>`. If there's no `.env`, create one (and add `.env` to `.gitignore` if not already there).
### Phase 4 — Install the SDK
Run the appropriate install command:
**Node.js / TypeScript:**
```bash
npm install olostep
# or for LangChain:
npm install langchain-olostep
# or for Mastra:
npm install @olostep/mastra
```
**Python:**
```bash
pip install olostep
# or for LangChain:
pip install langchain-olostep
# or for CrewAI:
pip install crewai-olostep
```
### Phase 5 — Write the integration code
Write idiomatic, production-ready code for the detected stack. Every integration must include:
- Client initialization with proper API key loading
- Error handling
- TypeScript types (if TS project)
- Comments linking to the relevant docs page
Below are the complete SDK references and framework-specific patterns to use.
---
## Olostep SDK Reference — Node.js / TypeScript
**Package**: `olostep` (npm) — https://docs.olostep.com/sdks/node-js
### Client initialization
```ts
import Olostep from 'olostep';
const client = new Olostep({ apiKey: process.env.OLOSTEP_API_KEY });
```
### Scrapes — extract content from any URL
```ts
import Olostep, { Format } from 'olostep';
// Simple scrape
const scrape = await client.scrapes.create('https://example.com');
console.log(scrape.markdown_content);
// With options
const scrape = await client.scrapes.create({
url: 'https://example.com',
formats: [Format.HTML, Format.MARKDOWN, Format.TEXT],
waitBeforeScraping: 1000,
removeImages: true,
});
console.log(scrape.html_content);
console.log(scrape.markdown_content);
// Get scrape by ID
const fetched = await client.scrapes.get(scrape.id);
```
**Parameters for `scrapes.create()`:**
- `url` (string, required) — URL to scrape
- `formats` (Format[], optional) — `Format.HTML`, `Format.MARKDOWN`, `Format.TEXT`, `Format.JSON`
- `waitBeforeScraping` (number, optional) — ms to wait for JS rendering (0–10000)
- `removeImages` (boolean, optional) — strip images
- `removeCssSelectors` (string | string[], optional) — CSS selectors to remove
- `country` (string | Country, optional) — geo-target: `'us'`, `'gb'`, `Country.DE`, etc.
- `parser` (object, optional) — `{ id: '@olostep/google-search' }` for structured extraction
- `llmExtract` (object, optional) — `{ schema: {...} }` or `{ prompt: '...' }` for LLM-powered extraction
- `actions` (array, optional) — browser actions: `{ type: 'wait', milliseconds: 2000 }`, `{ type: 'click', selector: '#btn' }`, `{ type: 'scroll', distance: 1000 }`, `{ type: 'fill_input', selector: '#search', value: 'query' }`
- `context` (object, optional) — `{ id: 'ctx_...' }` for custom cookies/auth
**Response fields:** `id`, `html_content`, `markdown_content`, `text_content`, `json_content`, `screenshot_hosted_url`, `html_hosted_url`, `markdown_hosted_url`, `json_hosted_url`, `text_hosted_url`, `links_on_page`, `page_metadata`
### Batches — scrape up to 10k URLs in parallel
```ts
// Simple
const batch = await client.batches.create([
'https://example.com',
'https://example.org',
]);
// With custom IDs
const batch = await client.batches.create([
{ url: 'https://example.com', customId: 'site-1' },
{ url: 'https://example.org', customId: 'site-2' },
]);
// Wait for completion
await batch.waitTillDone({ checkEveryNSecs: 5, timeoutSeconds: 120 });
// Get info
const info = await batch.info();
// Stream results
for await (const item of batch.items()) {
console.log(item.customId, item.url);
}
```
### Crawls — crawl entire websites
```ts
const crawl = await client.crawls.create({
url: 'https://example.com',
maxPages: 100,
maxDepth: 3,
includeUrls: ['*/blog/*'],
excludeUrls: ['*/admin/*'],
});
await crawl.waitTillDone({ checkEveryNSecs: 10, timeoutSeconds: 300 });
const info = await crawl.info();
for await (const page of crawl.pages()) {
console.log(page.url, page.status_code);
}
```
### Maps — discover all URLs on a site
```ts
const map = await client.maps.create({
url: 'https://example.com',
topN: 100,
includeSubdomain: true,
searchQuery: 'blog posts',
});
for await (const url of map.urls()) {
console.log(url);
}
```
### Answers — AI-powered web search with structured output
> **Note**: The `answers` namespace is not yet in the Node.js SDK (`0.1.2`). Call the REST API directly:
```ts
// Simple question
const response = await fetch('https://api.olostep.com/v1/answers', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.OLOSTEP_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ task: 'What is the pricing of Stripe?' }),
});
const answer = await response.json();
console.log(answer.result); // string when no json param
// With JSON schema for structured output
const response2 = await fetch('https://api.olostep.com/v1/answers', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.OLOSTEP_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
task: 'What is the latest book by J.K. Rowling?',
json: { book_title: '', author: '', release_date: '' },
}),
});
const answer2 = await response2.json();
// answer2.result.json_content → '{"book_title":"...","author":"...","release_date":"..."}'
// answer2.result.sources → ['https://...', ...]
```
### Content retrieval
```ts
import { Format } from 'olostep';
const content = await client.retrieve(retrieveId, Format.MARKDOWN);
console.log(content.markdown_content);
```
### Advanced — geographic scraping
```ts
import Olostep, { Country } from 'olostep';
const scrape = await client.scrapes.create({
url: 'https://example.com',
country: Country.DE, // or any string like 'jp'
});
```
### Advanced — browser actions
```ts
const scrape = await client.scrapes.create({
url: 'https://example.com',
actions: [
{ type: 'wait', milliseconds: 2000 },
{ type: 'click', selector: '#load-more' },
{ type: 'scroll', distance: 1000 },
{ type: 'fill_input', selector: '#search', value: 'query' },
],
});
```
### Advanced — LLM extraction
```ts
import Olostep, { Format } from 'olostep';
// With a JSON schema
const scrape = await client.scrapes.create({
url: 'https://example.com/product',
formats: [Format.JSON],
llmExtract: {
schema: {
title: { type: 'string' },
price: { type: 'number' },
description: { type: 'string' },
},
},
});
console.log(scrape.json_content);
// With a natural language prompt
const scrape2 = await client.scrapes.create({
url: 'https://example.com/events',
formats: [Format.JSON],
llmExtract: {
prompt: 'Extract all event names, dates, and venues from this page',
},
});
```
### Client configuration
```ts
const client = new Olostep({
apiKey: process.env.OLOSTEP_API_KEY!,
timeoutMs: 150000, // request timeout (default 150s)
retry: {
maxRetries: 3,
initialDelayMs: 1000,
},
});
```
---
## Olostep SDK Reference — Python
**Package**: `olostep` (PyPI, Python 3.11+) — https://docs.olostep.com/sdks/python
### Sync client
```python
from olostep import Olostep
client = Olostep(api_key="your-api-key") # or uses OLOSTEP_API_KEY env var
```
### Async client (recommended for production)
```python
from olostep import AsyncOlostep
async with AsyncOlostep(api_key="your-api-key") as client:
result = await client.scrapes.create(url_to_scrape="https://example.com")
```
### Scrapes
```python
# Simple
result = client.scrapes.create(url_to_scrape="https://example.com")
print(result.html_content)
print(result.markdown_content)
# With options (strings)
result = client.scrapes.create(
url_to_scrape="https://example.com",
formats=["html", "markdown"],
wait_before_scraping=2000,
country="us",
)
# With typed enums (better IDE support)
from olostep import Format, Country
result = client.scrapes.create(
url_to_scrape="https://example.com",
formats=[Format.HTML, Format.MARKDOWN],
wait_before_scraping=2000,
country=Country.US,
)
```
### Batches
```python
batch = client.batches.create(
urls=["https://example.com", "https://example.org"]
)
for item in batch.items():
content = item.retrieve(["html"])
print(f"Processed {item.url}: {len(content.html_content)} bytes")
```
### Crawls
```python
crawl = client.crawls.create(
start_url="https://example.com",
max_pages=100,
include_urls=["/blog/**"],
exclude_urls=["/admin/**"],
include_external=False,
include_subdomain=True,
)
for page in crawl.pages():
content = page.retrieve(["html"])
print(f"Crawled: {page.url}")
```
### Maps
```python
maps = client.maps.create(url="https://example.com")
for url in maps.urls():
print(url)
```
### Answers
```python
# Simple question
answer = client.answers.create(task="What is the capital of France?")
print(answer.answer) # parsed dict: {'result': 'Paris'}
print(answer.sources) # list of source URLs
# Structured output with JSON format
answer = client.answers.create(
task="What is the latest book by J.K. Rowling?",
json_format={"book_title": "", "author": "", "release_date": ""},
)
parsed = answer.answer # {'book_title': '...', 'author': '...', 'release_date': '...'}
print(answer.sources) # ['https://...', ...]
# Retrieve a previous answer by ID
previous = client.answers.get(answer_id=answer.id)
```
### Advanced — browser actions
The Python SDK exports typed action classes for better IDE support and validation:
```python
from olostep import WaitAction, ClickAction, FillInputAction, ScrollAction
result = client.scrapes.create(
url_to_scrape="https://example.com",
actions=[
WaitAction(milliseconds=2000),
ClickAction(selector="#load-more"),
ScrollAction(direction="down", amount=1000),
FillInputAction(selector="#search", value="query"),
],
)
```
You can also use raw dicts (same as the REST API):
```python
result = client.scrapes.create(
url_to_scrape="https://example.com",
actions=[
{"type": "wait", "milliseconds": 2000},
{"type": "click", "selector": "#load-more"},
{"type": "scroll", "direction": "down", "amount": 1000},
{"type": "fill_input", "selector": "#search", "value": "query"},
],
)
```
### Advanced — LLM extraction
The Python SDK exports a typed `LLMExtract` class for schema-based extraction:
```python
from olostep import LLMExtract
# With a JSON schema (typed class)
result = client.scrapes.create(
url_to_scrape="https://example.com/product",
formats=["json"],
llm_extract=LLMExtract(schema={
"title": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"},
}),
)
print(result.json_content)
# With a natural language prompt (raw dict — LLMExtract class only supports schema)
result = client.scrapes.create(
url_to_scrape="https://example.com/events",
formats=["json"],
llm_extract={
"prompt": "Extract all event names, dates, and venues from this page"
},
)
```
### Error handling
```python
from olostep import Olostep, Olostep_BaseError
try:
result = client.scrapes.create(url_to_scrape="https://example.com")
except Olostep_BaseError as e:
print(f"Olostep error: {type(e).__name__}: {e}")
```
### Retry strategy
```python
from olostep import Olostep, RetryStrategy
client = Olostep(
api_key="your-api-key",
retry_strategy=RetryStrategy(max_retries=3, initial_delay=1.0),
)
```
---
## Olostep REST API Reference (for raw HTTP integrations)
**Base URL**: `https://api.olostep.com/v1`
**Auth**: `Authorization: Bearer <API_KEY>`
| Method | Endpoint | Purpose |
|--------|----------|---------|
| POST | `/v1/scrapes` | Scrape a single URL |
| GET | `/v1/scrapes/:id` | Get scrape result |
| POST | `/v1/batches` | Start a batch (up to 10k URLs) |
| GET | `/v1/batches` | List all batches |
| GET | `/v1/batches/:id` | Get batch status |
| GET | `/v1/batches/:id/items` | Get batch results |
| POST | `/v1/crawls` | Start a crawl |
| GET | `/v1/crawls/:id` | Get crawl status |
| GET | `/v1/crawls/:id/pages` | Get crawled pages |
| POST | `/v1/maps` | Map a website's URLs |
| POST | `/v1/answers` | Get AI-powered answers |
| GET | `/v1/answers/:id` | Get answer result |
| GET | `/v1/retrieve` | Retrieve content by ID |
---
## Pre-built parsers
For structured JSON extraction from popular sites, use the `parser` parameter:
- `@olostep/google-search` — Google SERP results
- `@olostep/amazon-it-product` — Amazon product data
- `@olostep/extract-emails` — Extract email addresses
- `@olostep/extract-calendars` — Extract calendar events
- `@olostep/extract-socials` — Extract social media links
More parsers available via the dashboard: https://www.olostep.com/dashboard/parsers
---
## Framework-specific integration patterns
### Next.js (App Router) + Vercel AI SDK
Create `lib/olostep.ts`:
```ts
import Olostep from 'olostep';
export const olostep = new Olostep({
apiKey: process.env.OLOSTEP_API_KEY!,
});
```
Create a tool for the AI SDK in the chat route (e.g., `app/api/chat/route.ts`):
```ts
import { olostep } from '@/lib/olostep';
import { tool } from 'ai';
import { z } from 'zod';
export const webSearchTool = tool({
description: 'Search the web and get AI-powered answers with sources',
parameters: z.object({
query: z.string().describe('The search query'),
}),
execute: async ({ query }) => {
// answers is not yet in the Node SDK — call REST API directly
const res = await fetch('https://api.olostep.com/v1/answers', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.OLOSTEP_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ task: query }),
});
return res.json();
},
});
export const scrapeTool = tool({
description: 'Scrape a webpage and get its content as markdown',
parameters: z.object({
url: z.string().url().describe('The URL to scrape'),
}),
execute: async ({ url }) => {
const result = await olostep.scrapes.create(url);
return result.markdown_content;
},
});
```
### Express.js API routes
```ts
import Olostep from 'olostep';
import express from 'express';
const olostep = new Olostep({ apiKey: process.env.OLOSTEP_API_KEY! });
const router = express.Router();
router.post('/scrape', async (req, res) => {
try {
const { url, format = 'markdown' } = req.body;
const result = await olostep.scrapes.create({ url, formats: [format] });
res.json(result);
} catch (error) {
res.status(500).json({ error: 'Scrape failed' });
}
});
router.post('/search', async (req, res) => {
try {
const { query, json } = req.body;
const response = await fetch('https://api.olostep.com/v1/answers', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.OLOSTEP_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ task: query, ...(json && { json }) }),
});
const result = await response.json();
res.json(result);
} catch (error) {
res.status(500).json({ error: 'Search failed' });
}
});
export default router;
```
### FastAPI (Python)
```python
from fastapi import FastAPI, HTTPException
from olostep import AsyncOlostep
from contextlib import asynccontextmanager
olostep_client: AsyncOlostep | None = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global olostep_client
olostep_client = AsyncOlostep() # reads OLOSTEP_API_KEY from env
yield
if olostep_client:
await olostep_client.close()
app = FastAPI(lifespan=lifespan)
@app.post("/scrape")
async def scrape(url: str, format: str = "markdown"):
result = await olostep_client.scrapes.create(
url_to_scrape=url, formats=[format]
)
return {"content": result.markdown_content}
@app.post("/search")
async def search(query: str):
result = await olostep_client.answers.create(task=query)
return result
```
### LangChain (Python)
```python
from langchain_olostep import (
scrape_website,
answer_question,
scrape_batch,
crawl_website,
map_website,
)
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_openai import ChatOpenAI
tools = [scrape_website, answer_question, scrape_batch, crawl_website, map_website]
llm = ChatOpenAI(model="gpt-4o")
agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
```
### LangChain (TypeScript)
```ts
import { OlostepScrape, OlostepAnswer, OlostepBatch, OlostepCrawl, OlostepMap } from 'langchain-olostep';
const tools = [
new OlostepScrape(),
new OlostepAnswer(),
new OlostepBatch(),
new OlostepCrawl(),
new OlostepMap(),
];
```
### CrewAI
```python
from crewai import Agent, Task, Crew
from crewai_olostep import (
olostep_scrape_tool,
olostep_answer_tool,
olostep_batch_tool,
olostep_crawl_tool,
olostep_map_tool,
)
researcher = Agent(
role="Web Researcher",
goal="Find accurate, current information from the web",
backstory="Expert researcher with web scraping capabilities.",
tools=[olostep_scrape_tool, olostep_answer_tool],
verbose=True,
)
task = Task(
description="Research the pricing of Stripe vs Square vs PayPal",
expected_output="Comparison table with pricing tiers",
agent=researcher,
)
crew = Crew(agents=[researcher], tasks=[task])
result = crew.kickoff()
```
### OpenAI function calling (direct)
```ts
import Olostep from 'olostep';
import OpenAI from 'openai';
const olostep = new Olostep({ apiKey: process.env.OLOSTEP_API_KEY! });
const openai = new OpenAI();
const tools: OpenAI.ChatCompletionTool[] = [
{
type: 'function',
function: {
name: 'web_search',
description: 'Search the web and return answers with sources',
parameters: {
type: 'object',
properties: {
query: { type: 'string', description: 'Search query' },
},
required: ['query'],
},
},
},
{
type: 'function',
function: {
name: 'scrape_url',
description: 'Extract content from a webpage as markdown',
parameters: {
type: 'object',
properties: {
url: { type: 'string', description: 'URL to scrape' },
},
required: ['url'],
},
},
},
];
// In your tool handler:
async function handleToolCall(name: string, args: Record<string, string>) {
if (name === 'web_search') {
// answers is not yet in the Node SDK — call REST API directly
const res = await fetch('https://api.olostep.com/v1/answers', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.OLOSTEP_API_KEY!}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ task: args.query }),
});
return res.json();
}
if (name === 'scrape_url') {
const result = await olostep.scrapes.create(args.url);
return result.markdown_content;
}
}
```
### Data pipeline / script (Python)
```python
from olostep import Olostep
client = Olostep() # reads OLOSTEP_API_KEY from env
# Step 1: Map the site to find URLs
site_map = client.maps.create(url="https://example.com", top_n=500)
urls = [url for url in site_map.urls()]
# Step 2: Batch scrape all URLs
batch = client.batches.create(urls=urls)
for item in batch.items():
content = item.retrieve(["markdown"])
print(f"Scraped: {item.url} — {len(content.markdown_content)} chars")
# Step 3: Get AI answers from the data
answer = client.answers.create(
task="Summarize the key products from this website",
json={"products": [{"name": "", "description": "", "price": ""}]},
)
print(answer)
```
---
## MCP Server integration
If the user is also using Cursor, set up the MCP server config alongside the SDK integration.
Add to `.cursor/mcp.json`:
```json
{
"mcpServers": {
"olostep": {
"command": "npx",
"args": ["-y", "olostep-mcp"],
"env": {
"OLOSTEP_API_KEY": "<API_KEY>"
}
}
}
}
```
MCP tools available: `scrape_website`, `get_webpage_content`, `search_web`, `google_search`, `answers`, `batch_scrape_urls`, `create_crawl`, `create_map`, `get_website_urls`.
---
## Phase 6 — Verify the integration
After writing all code:
1. Run the install command (`npm install` / `pip install`) if not already done.
2. Run a quick test — use the Olostep MCP tools to scrape a simple URL (like `https://example.com`) to verify the API key works.
3. Show the user a summary of what was created:
- Files created or modified
- SDK installed
- Environment variable set
- What they can do next
4. Suggest next steps:
- "Try calling the Olostep scrape endpoint to test"
- "Run your app and test the new API route / tool"
- "Check out `/scrape`, `/search`, and `/research` for ad-hoc web data"
## Migration from other providers
If the user is migrating from another scraping API:
| From | Migration notes |
|------|----------------|
| **Firecrawl** | `FirecrawlApp.scrape_url()` → `client.scrapes.create()`, `FirecrawlApp.crawl_url()` → `client.crawls.create()`, `FirecrawlApp.map_url()` → `client.maps.create()` |
| **Apify** | Replace Actor runs with Olostep endpoints. `ApifyClient.actor().call()` → `client.scrapes.create()` or `client.batches.create()` |
| **ScrapingBee** | Replace `ScrapingBeeClient.get()` → `client.scrapes.create()` with `waitBeforeScraping` for JS rendering |
| **Browserbase** | Replace session-based scraping → `client.scrapes.create()` with `actions` for interaction |
## Tips
- For JS-heavy sites (SPAs, React apps), always use `waitBeforeScraping: 2000` or higher
- Use `answers` for quick web search — it's 1 call instead of search + scrape + parse
- Use `batches` instead of looping `scrapes` when you have >5 URLs — it's faster and cheaper
- Use `maps` before `crawls` to understand site structure first
- For e-commerce/SERP, use pre-built parsers for clean JSON instead of markdown
- Always handle errors — the SDK throws `Olostep_BaseError` (Python) or rejects the promise (Node)
- Set up retry strategy for production: `RetryStrategy(max_retries=3)` in Python, or wrap in try/catch with backoff in Node
## Links
- Get your API key: https://olostep.com/auth
- Node.js SDK docs: https://docs.olostep.com/sdks/node-js
- Python SDK docs: https://docs.olostep.com/sdks/python
- API Reference: https://docs.olostep.com/api-reference/scrapes/create
- Features overview: https://docs.olostep.com/get-started/welcome
- MCP Server: https://github.com/olostep/olostep-mcp-server
- LangChain integration: https://docs.olostep.com/integrations/langchain
- CrewAI integration: https://docs.olostep.com/integrations/crewai
- Parsers list: https://docs.olostep.com/features/structured-content/parsersmap
Discover and list all URLs on a website using Olostep. Use when the user wants to see what pages exist on a site, find specific sections (docs, blog, API reference), audit a site's structure, or get a URL list to feed into batch or crawl.
# Olostep Map
Discover every URL on a website instantly. The essential reconnaissance step before scraping, crawling, or batch processing.
## Why this matters
You can't scrape what you can't find. Map gives you a complete inventory of every page on a site — filtered by keyword, narrowed by URL pattern — so you know exactly what's there before spending crawl or batch credits.
## When to use
- User wants to know what pages exist on a site before committing to a scrape or crawl
- User wants to find specific sections: all blog posts, all API reference pages, all pricing pages
- User wants to understand a site's structure and content organisation
- User needs a URL list to feed into `/batch` or `/crawl`
- User says "find all the pages about X on this site"
## Workflow
1. Use `create_map` with the website URL.
2. Use `search_query` to rank/filter URLs by relevance to a keyword.
3. Use `include_url_patterns` / `exclude_url_patterns` to narrow by path structure.
4. Use `top_n` to limit results if the site is very large.
5. Group URLs by section (e.g. `/docs/`, `/blog/`, `/api/`) when presenting results.
6. **Always suggest a next step**: scrape one URL, batch scrape a subset, or crawl a section.
## Real developer workflows
**"Find the right docs page"**
> "Map https://docs.stripe.com and find all pages about webhooks"
→ `create_map` with `search_query: "webhooks"` → returns webhook-related URLs ranked by relevance → user picks the best one → `/scrape` it.
**"Discover all blog posts"**
> "Map https://vercel.com/blog and list all blog posts"
→ Map with `include_url_patterns: ["/blog/**"]` → returns all blog post URLs → offer to `/batch` scrape for content analysis.
**"Understand a competitor's site"**
> "Map https://competitor.com and tell me what sections they have"
→ Map the site → group URLs by path segment (`/docs/`, `/pricing/`, `/blog/`, `/changelog/`) → summarise content focus areas.
**"Find all API reference pages"**
> "Map https://docs.openai.com and list only the API reference pages, not the guides"
→ `include_url_patterns: ["/api-reference/**"]`, `exclude_url_patterns: ["/guides/**"]` → clean list of reference pages only.
**"Plan before crawling"**
> "I want to scrape the entire Tailwind CSS docs. Map it first so I know what I'm dealing with."
→ Map → see 120 URLs across `/docs/`, `/blog/`, `/showcase/` → user picks just `/docs/` → feed those into `/batch` or `/crawl`.
**"Audit our own site"**
> "Map our marketing site and count how many pages we have per section"
→ Map → group by path → count per section → surface any orphaned or unexpected pages.
## Chain with other skills
- **map → batch**: The most powerful combo. Map a site → filter to the URLs you care about → batch scrape them all. Best for known sites with lots of pages.
- **map → crawl**: Map first to understand structure, then crawl a specific section with `start_url` set to that section's root.
- **map → scrape**: Map to find the exact right page, then scrape just that one.
## Parameters
- **website_url**: URL to map (required)
- **search_query**: Keyword to rank/filter URLs by relevance (optional but very useful)
- **top_n**: Max number of URLs to return (optional — useful for large sites)
- **include_url_patterns**: Glob patterns to include — e.g. `["/blog/**", "/docs/**"]` (optional)
- **exclude_url_patterns**: Glob patterns to exclude — e.g. `["/admin/**", "/internal/**"]` (optional)
## Tips
- **Always suggest a next step**: map alone isn't useful — the value is in what you do with the URLs (scrape, batch, crawl)
- `search_query` is powerful — it ranks all URLs by how relevant they are to your keyword, so the best pages float to the top
- Use `include_url_patterns` / `exclude_url_patterns` to be surgical — e.g. include only `/docs/api/**` and exclude `/docs/legacy/**`
- For very large sites (1000+ pages), use `top_n: 50` to keep results manageable
- This is faster than crawling — it discovers URLs without downloading page contentmigrate-code
Scrape a migration guide, changelog, or breaking changes document using Olostep and automatically update the user's local code. Use when the user is upgrading a framework, moving to a new API version, or says "help me migrate this code based on these docs".
# Olostep Migrate Code
Take the pain out of major version upgrades. Scrape the official migration guide, understand the breaking changes, and refactor the user's codebase automatically.
## When to use
- User provides a URL to a migration guide (e.g., React 18 to 19, Pages to App Router, API v2 to v3)
- User asks to update their deprecated code to use the latest patterns
- User wants to audit a file for deprecated methods based on a new release
## Workflow
1. Scrape the provided migration guide URL using `scrape_website` (use `output_format: markdown`).
2. If the guide spans multiple pages, use `get_website_urls` or `create_map` to find the specific pages relevant to the user's code, and `batch_scrape_urls` to read them all.
3. Extract the specific "Before" and "After" patterns, breaking changes, and deprecated methods from the scraped markdown.
4. Analyze the user's open or selected code files.
5. Apply the necessary changes, explaining exactly what was updated based on the migration guide.
## Real developer workflows
**"Migrate to a new SDK version"**
> "Scrape https://stripe.com/docs/upgrades and update my webhook handler from API version 2022 to 2026."
→ Scrapes the guide, learns the new event payload shapes, and rewrites the TypeScript interfaces and handler logic.
**"Framework upgrades"**
> "Read the Next.js 15 migration guide at https://nextjs.org/docs/app/building-your-application/upgrading/version-15 and update my layout.tsx file."
→ Scrapes the URL, identifies that `params` are now asynchronous, and refactors the user's layout component to `await params`.
**"Replacing a deprecated library"**
> "We are moving from Moment.js to date-fns. Scrape the date-fns docs and rewrite this component's date logic."
→ Scrapes the `date-fns` API reference for the equivalent functions, then swaps out all the Moment.js imports and method calls.
## Parameters to use
### `scrape_website`
- **url_to_scrape**: URL of the migration guide or changelog
- **output_format**: `markdown`
- **wait_before_scraping**: `2000` (many modern docs sites are SPAs)
## Tips
- Always ensure you scrape the docs *before* writing any code. Do not rely on your internal training data for new framework versions.
- If the migration is complex, break it down step-by-step for the user.
- Add code comments linking back to the scraped documentation URL for the user's future reference.research
Do thorough, cited web research to help the user make a decision. Compare tools, libraries, or services. Evaluate competitors. Research tech stacks, market data, or architecture patterns. Use when the user needs a well-sourced answer before making a technical, product, or business decision.
# Olostep Research
Do thorough, multi-source web research and present findings as structured, actionable intelligence. This is the skill for when the user needs to make a decision and wants real data — not opinions from stale training data.
## Why this matters
Developers and product teams make dozens of "which tool should we use?" and "should we build or buy?" decisions. These decisions are expensive to get wrong. This skill uses live web data to give cited, structured, current answers — not guesses from training data that may be 6+ months old.
## When to use
- User is choosing between tools, libraries, frameworks, or services
- User wants to understand competitors' features, pricing, positioning, or tech stack
- User is evaluating build vs buy for a component
- User needs market data, adoption stats, or trend information
- User wants to fact-check a technical claim or marketing statement
- User says "help me decide", "what should I use", "compare X vs Y", "is X worth it"
## Research methodology
1. **Quick answer**: Use `answers` with a `json` schema for structured, cited results.
2. **Deep dive**: Scrape the most relevant sources (`scrape_website`) for full detail.
3. **Multi-site comparison**: Batch scrape competitors' pricing/features/docs pages.
4. **Site-specific research**: Use `get_website_urls` to find relevant pages within a specific site.
5. **Synthesise and recommend**: Present findings as a table, structured JSON, or clear recommendation — **always end with a recommendation**, don't just list facts.
## Real developer workflows
**"Help me pick a tool"**
> "Compare Prisma, Drizzle, and Kysely for a serverless TypeScript API. Which should I use?"
→ `answers` with `json: [{"orm": "", "pros": [], "cons": [], "serverless_fit": "", "bundle_size": "", "learning_curve": ""}]` → structured comparison → clear recommendation based on the user's serverless constraint.
**"Competitive intelligence"**
> "I'm building a Firecrawl alternative. What do Firecrawl, Apify, and Browserbase offer, what do they charge, and where are the gaps?"
→ `answers` for broad overview → `batch_scrape_urls` on each competitor's pricing + features pages → detailed comparison with gap analysis → actionable insights.
**"Build vs buy"**
> "Should I build my own auth system or use Clerk/Auth0/Supabase Auth? What are the trade-offs for a B2B SaaS with SSO?"
→ `answers` with schema covering cost, SSO support, migration difficulty, vendor lock-in → scrape each service's SSO docs for detail → recommendation.
**"Is this production-ready?"**
> "Is Bun production-ready in 2026? I want to replace Node in our API."
→ `answers` aggregates: GitHub activity, open issues, community sentiment, production case studies → returns a structured risk assessment.
**"Tech stack deep dive"**
> "What stack does Figma use for real-time collaboration? I'm building something similar."
→ `answers` → finds relevant engineering blog posts → `scrape_website` on the blog posts → detailed technical breakdown of their architecture.
**"Market sizing"**
> "How many developers use Tailwind CSS? I'm deciding whether to build a Tailwind component library."
→ `answers` with `json: [{"stat": "", "value": "", "source": "", "date": ""}]` → cited data points for npm downloads, GitHub stars, survey results.
**"Vendor evaluation"**
> "We're choosing a vector database. Compare Pinecone, Weaviate, Qdrant, and Milvus for a RAG pipeline with 10M documents."
→ `answers` for capabilities overview → `batch_scrape_urls` on each vendor's pricing page → synthesised comparison with our scale requirements.
**"Architecture research"**
> "How do companies like Vercel and Netlify handle serverless function cold starts? What techniques should I use?"
→ `answers` + scrape relevant engineering blog posts → extract patterns, techniques, benchmarks → summarise with code-applicable recommendations.
## Chain with other skills
- **research → scrape**: Get the quick answer, then scrape cited sources for full detail.
- **research → batch**: Compare multiple competitors by batch scraping their key pages.
- **research → docs-to-code**: Research which library to use, then use `/docs-to-code` to integrate it.
- **research → crawl**: For deep technical research, crawl an entire engineering blog or docs site.
## Parameters
### Core tool: `answers`
- **task**: Research question — be specific and include constraints (required)
- **json**: Output schema — structure the answer so it's directly usable (strongly recommended)
### Supporting tools (for deep dives)
- `scrape_website`: Read a specific source in full
- `batch_scrape_urls`: Compare multiple sites side-by-side (pricing, features, docs)
- `get_website_urls`: Find specific pages within a site (e.g. competitor's pricing page)
- `google_search`: Get raw SERP results when you need to discover sources first
## Tips
- **Always use a `json` schema** — structured output makes comparisons fair and useful
- **Always end with a recommendation** — users want a decision, not just a list of facts
- For competitive research, **scrape actual pricing/features pages** — `answers` gives good overviews but may miss fine print
- Share citation URLs so users can verify and explore deeper
- For "is X ready" questions, look for: GitHub commit activity, open issue count, production usage testimonials, latest release date
- Use `country` on scrape/batch when comparing regional pricing or availabilityscrape
Scrape a single webpage and extract its content as clean markdown, HTML, JSON, or text using Olostep. Use when the user shares a URL and wants to read it, extract content from it, understand its structure, use it as context for coding, or turn a page into data. Handles JavaScript-rendered pages, SPAs, and bot-protected sites automatically.
# Olostep Scrape
Extract clean, LLM-ready content from any URL - including JavaScript-heavy SPAs, bot-protected sites, and geo-restricted pages. No Puppeteer, no proxy config, no CAPTCHA solving - Olostep handles it all.
## Why this matters
Most scraping fails on modern sites: JavaScript doesn't render, CAPTCHAs block you, content varies by country. Olostep runs a full browser with residential proxies and anti-bot bypass built in. You get the exact content a real user sees, in the format you need.
## When to use
- User pastes a URL and says "read this", "summarise this", "use this as context"
- User wants to pull content from a docs page to write code against that API
- User wants to extract structured data from a product page, job listing, or pricing page
- User needs page content from a JS-heavy SPA (React, Next.js, Vue) that raw HTTP can't reach
- User wants geo-specific content (e.g. UK pricing vs US pricing)
## Workflow
1. Use `scrape_website` with the URL.
2. Default to `output_format: markdown` - cleanest for AI reasoning and code generation.
3. For JS-heavy pages (React, Next.js, Vue, Svelte apps), set `wait_before_scraping: 3000` to let the page fully render.
4. For structured product data, use `output_format: json` with a `parser` (e.g. `@olostep/amazon-product`).
5. For geo-targeted content, set `country` (e.g. `GB` for UK pricing, `DE` for German content).
6. After scraping, **act on the content** - write code from it, extract fields, summarise, compare.
## Real developer workflows
**"Read the docs and write the integration"**
> "Scrape https://docs.stripe.com/api/payment_intents and write me a TypeScript function to create a payment intent"
-> Scrape with `markdown` -> parse the endpoint spec -> write a typed function with error handling.
**"What changed in this release?"**
> "Scrape https://github.com/vercel/next.js/releases and tell me what breaking changes are in the latest version"
-> Scrape the releases page -> extract version notes -> flag breaking changes relevant to the user's stack.
**"Get product data from Amazon"**
> "Scrape this Amazon product page and extract the price, rating, and feature bullets"
-> Use `parser: "@olostep/amazon-product"` with `output_format: json` -> get clean structured data instantly.
**"See what users in the UK see"**
> "Scrape https://example.com/pricing with country GB and tell me the UK pricing"
-> Set `country: "GB"` -> request goes through a UK residential proxy -> get the exact page UK users see.
**"Read a JS-heavy SPA"**
> "Scrape this React dashboard at https://app.example.com/public/stats"
-> Set `wait_before_scraping: 5000` -> Olostep's browser waits for React to hydrate -> returns the fully rendered content.
**"Pull a job listing for my cover letter"**
> "Scrape this job posting and write a cover letter paragraph highlighting the matching skills from my resume"
-> Scrape the listing -> extract requirements and qualifications -> generate tailored content.
## Chain with other skills
- **scrape -> code**: Scrape docs, then write an integration (see `/docs-to-code`).
- **map -> scrape**: Use `/map` to find the right page, then scrape it.
- **search -> scrape**: Use `/search` to find relevant URLs, then scrape the best one for full content.
- **scrape -> batch**: Scrape one page to confirm format, then `/batch` the rest.
## Parameters
- **url_to_scrape**: URL to scrape (required)
- **output_format**: `markdown` (default), `html`, `json`, `text`
- **wait_before_scraping**: ms to wait for JS rendering, 0-10000 (use 3000-5000 for SPAs)
- **country**: Country code for geo-targeted content - e.g. `US`, `GB`, `DE`, `JP`, `BR`
- **parser**: Specialised parser for structured extraction - e.g. `@olostep/amazon-product`
## Tips
- Always **act on** the scraped content - don't just return raw markdown to the user
- For JS-heavy pages, `wait_before_scraping: 3000` is usually enough; use 5000 for very slow SPAs
- The `country` parameter routes through residential proxies - use it for pricing, localised content, or region-locked pages
- If you need 3+ URLs, switch to `/batch` - it's parallel and faster
- If you need an entire site, use `/crawl` insteadsearch
Search the web using Olostep for live, up-to-date information. Three tools available — answers (AI-synthesised answers with citations), google_search (structured Google results), and get_website_urls (find pages within a specific site). Use when the user needs current data, comparisons, or facts the AI's training data may not have.
# Olostep Search
Search the live web and get answers, structured results, or find specific pages within a site. Three search tools for different needs.
## Why this matters
AI training data goes stale. When a user asks "which library is best" or "what does X cost" — they need **today's answer**, not last year's. Olostep searches the live web, synthesises answers with source citations, and can return structured JSON you can use directly in code.
## When to use
- User asks "what is the best X for Y" — needs live rankings, not stale training data
- User wants to compare tools, libraries, or services with current pricing
- User needs a fact that may have changed since the AI's cutoff (versions, status, pricing, team)
- User wants to find specific pages within a documentation site or company site
- User asks a question that starts with "is X still...", "what's the latest...", "has Y changed..."
## Which tool to use
| Need | Tool | Why |
|------|------|-----|
| Broad research question, comparison, fact-check | `answers` | AI-synthesised answer from multiple sources, with citations |
| Raw Google results (organic, PAA, knowledge graph) | `google_search` | Structured SERP data when you need the actual search results |
| Find pages within a specific site | `get_website_urls` | Ranked URLs from one domain, sorted by relevance to your query |
## Workflow
**For open research / comparisons / fact-checking:**
1. Use `answers` with the user's question as `task`.
2. If the result needs to be used in code, a table, or a comparison, pass a `json` schema.
3. Present the answer with source URLs — offer to `/scrape` any source for deeper detail.
**For raw Google search results:**
1. Use `google_search` with the query.
2. Returns structured organic results, knowledge graph, People Also Ask, related searches.
3. Use this when you need the actual URLs and snippets, not a synthesised answer.
**For finding pages within a specific site:**
1. Use `get_website_urls` with the site URL and search query.
2. Returns up to 100 URLs ranked by relevance.
3. Offer to `/scrape` the most relevant result for full content.
## Real developer workflows
**"Find the best library for this"**
> "What's the best React drag-and-drop library in 2026?"
→ `answers` with schema `[{"library": "", "stars": "", "bundle_size": "", "last_release": "", "verdict": ""}]` → structured comparison with live data.
**"What's the current price?"**
> "What do Vercel, Netlify, and Render charge for a Pro plan?"
→ `answers` with schema `[{"provider": "", "pro_price": "", "included": ""}]` → live pricing you can paste into a doc or comparison page.
**"Is this still maintained?"**
> "Is moment.js still maintained? What should I use instead?"
→ `answers` returns current status + recommended alternatives with sources.
**"Find the right docs page"**
> "Search the Supabase docs for row-level security examples"
→ `get_website_urls` on `https://supabase.com/docs` with query `"row level security"` → ranked URLs → scrape the top one.
**"Get raw search results for a competitive analysis"**
> "Google 'best headless CMS 2026' and give me the top 10 organic results"
→ `google_search` → structured list of titles, URLs, snippets, and any knowledge graph data.
**"Check current version / status"**
> "What's the latest version of Next.js and when was it released?"
→ `answers` → returns current version, release date, key changes, with source link.
## Chain with other skills
- **search → scrape**: Use `answers` or `google_search` to find URLs, then `/scrape` the best ones for full content.
- **search → docs-to-code**: Search for the docs URL, then use `/docs-to-code` workflow to write the integration.
- **search → batch**: Get a list of URLs from search results, then `/batch` scrape all of them.
## Parameters
### `answers` — AI-synthesised answers with citations
- **task**: Question or research task (required)
- **json**: Output schema for structured results (optional but powerful)
- String: `"return an array with name, price, and pros for each option"`
- Object: `{"name": "", "price": "", "pros": [], "cons": []}`
### `google_search` — Structured Google SERP results
- **query**: Search query (required)
- **country**: Country code for localised results, e.g. `US`, `GB` (default: `US`)
### `get_website_urls` — Find pages within a specific site
- **url**: Site to search within (required)
- **search_query**: What to find (required)
## Tips
- `answers` is the most powerful — it searches the web, reads sources, and returns a synthesised answer with citations
- The `json` param on `answers` lets you get data in a shape you can paste straight into code or a markdown table
- Use `google_search` when you need the raw SERP (organic results, PAA, knowledge graph) rather than a synthesised answer
- Use `get_website_urls` when the user already knows which site to search — it's faster and more precise than broad web search
- Always offer to `/scrape` cited URLs if the user wants deeper detail on any resultsetup
Configure the Olostep API key so the MCP server and all Olostep skills work correctly. Use when the user is setting up Olostep for the first time, has no API key configured, or is getting authentication errors.
# Olostep Setup
Get the Olostep MCP server running in Cursor so all Olostep skills work.
## Steps
1. **Get an API key**: If the user doesn't have one, direct them to https://olostep.com/auth — sign up is free and takes 30 seconds. The free tier includes 100 credits/month.
2. **Configure the MCP server**: Add the following to `.cursor/mcp.json` in the project root (or go to Cursor Settings → Features → MCP Servers → Add New):
```json
{
"mcpServers": {
"olostep": {
"command": "npx",
"args": ["-y", "olostep-mcp"],
"env": {
"OLOSTEP_API_KEY": "<their-api-key>"
}
}
}
}
```
3. **Verify it works**: Suggest the user tries a simple test:
- `/scrape` a URL to confirm scraping works
- `/search` a question to confirm answers works
## Troubleshooting
- **"401 Unauthorized"**: The API key is missing or invalid. Check it at https://olostep.com/auth.
- **"npx not found"**: Node.js is not installed. Install it from https://nodejs.org.
- **Tools not appearing**: Restart Cursor after adding the MCP config. Check Settings → Features → MCP to confirm the server is listed and running.
- **Stale npx cache**: If tools seem outdated, run `npx clear-npx-cache` in a terminal, then restart Cursor.
## What you get
Once configured, all Olostep skills are available:
- `/scrape` — Extract content from any URL (JS rendering + anti-bot bypass built in)
- `/search` — AI-powered web search with structured JSON output
- `/batch` — Scrape up to 10,000 URLs in parallel
- `/crawl` — Autonomously ingest entire websites
- `/map` — Discover all URLs on a site
- `/answers` — Get cited, structured answers from live web data
- `/docs-to-code` — Turn documentation into working code
- `/research` — Deep web research for technical and product decisions
## Links
- Get your API key: https://olostep.com/auth
- Documentation: https://docs.olostep.com
- MCP server on npm: https://www.npmjs.com/package/olostep-mcp
- GitHub: https://github.com/olostep/olostep-mcp-server