webclaw logo

webclaw

CommunityPopular
0xMassi

Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.

Publisher0xMassi
Repositorywebclaw
LanguageRust
Forks
133
Stars
1.1K
Available tools
10
Transport typestdio
Categories
LicenseAGPL-3.0
Links
  • Connect tools to AI workflows

    webclaw exposes MCP capabilities that can be used by compatible AI clients and agents.

  • 10 available tools

    Browse the callable actions below, including names and descriptions when provided by the server.

  • Ready-to-copy setup

    Use the installation snippets to configure this server in your preferred MCP client.

  • Open source signals

    1.1K stars and 133 forks from the linked repository.


Most web scraping tools give your agent one of two bad outputs:

  • a blocked page, login wall, or empty app shell
  • raw HTML full of nav, scripts, styling, ads, and duplicated boilerplate

webclaw.io is the hosted web extraction API for webclaw. This repo contains the open-source CLI, MCP server, extraction engine, and self-hostable server.

webclaw turns a URL into clean content your tools can actually use.

bash
webclaw https://example.com --format markdown
md
# Example Domain

This domain is for use in illustrative examples in documents.

You may use this domain in literature without prior coordination or asking for permission.

Use it from the terminal, wire it into Claude/Cursor through MCP, call the hosted API from your app, or self-host the OSS server.


Install

Agent setup

The fastest way to connect webclaw to Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, and other MCP-compatible tools:

bash
npx create-webclaw

The installer detects supported clients and configures the MCP server for you.

Homebrew

bash
brew tap 0xMassi/webclaw
brew install webclaw

Prebuilt binaries

Download macOS and Linux binaries from GitHub Releases.

Docker

bash
docker run --rm ghcr.io/0xmassi/webclaw https://example.com

Cargo

bash
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli
cargo install --git https://github.com/0xMassi/webclaw.git webclaw-mcp

If building from source fails because native build tools are missing, install the platform prerequisites:

OSCommand
Debian / Ubuntusudo apt install -y pkg-config libssl-dev cmake clang git build-essential
Fedora / RHELsudo dnf install -y pkg-config openssl-devel cmake clang git make gcc
Archsudo pacman -S pkg-config openssl cmake clang git base-devel
macOSxcode-select --install

Quick Start

Scrape one page

bash
webclaw https://stripe.com --format markdown

Return LLM-optimized text

bash
webclaw https://docs.anthropic.com --format llm

Keep only the main content

bash
webclaw https://example.com/blog/post --only-main-content

Include or exclude selectors

bash
webclaw https://example.com \
  --include "article, main, .content" \
  --exclude "nav, footer, .sidebar, .ad"

Crawl a documentation site

bash
webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50

Extract brand assets

bash
webclaw https://github.com --brand

Compare a page over time

bash
webclaw https://example.com/pricing --format json > pricing-old.json
webclaw https://example.com/pricing --diff-with pricing-old.json

MCP Server

webclaw ships with an MCP server for AI agents.

bash
npx create-webclaw

Manual config:

json
{
  "mcpServers": {
    "webclaw": {
      "command": "~/.webclaw/webclaw-mcp"
    }
  }
}

Then ask your agent things like:

text
Scrape these competitor pricing pages and summarize the differences.
text
Crawl this documentation site and prepare clean context for a RAG index.
text
Extract the brand colors, fonts, and logos from this company website.

Tools

ToolWhat it doesLocal
scrapeExtract one URL as markdown, text, JSON, LLM format, or HTMLYes
crawlFollow same-origin links and extract discovered pagesYes
mapDiscover URLs without extracting every pageYes
batchScrape multiple URLs in parallelYes
extractConvert page content into structured dataYes, with local or configured LLM
summarizeSummarize a pageYes, with local or configured LLM
diffCompare page content snapshotsYes
brandExtract colors, fonts, logos, and metadataYes
searchSearch the web and scrape resultsHosted API
researchMulti-source research workflowHosted API

SDKs

bash
npm install @webclaw/sdk
pip install webclaw
go get github.com/0xMassi/webclaw-go
ts
import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });

const page = await client.scrape({
  url: "https://example.com",
  formats: ["markdown"],
  only_main_content: true,
});

console.log(page.markdown);
python
from webclaw import Webclaw

client = Webclaw(api_key="wc_your_key")

page = client.scrape(
    "https://example.com",
    formats=["markdown"],
    only_main_content=True,
)

print(page.markdown)
bash
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

Output Formats

FormatUse it when you need
markdownClean page content with structure preserved
llmCompact context for agents and RAG pipelines
textPlain text with minimal formatting
jsonStructured metadata, links, images, and extracted fields
htmlCleaned HTML for custom processing

Local First, Hosted When Needed

The CLI and MCP server work locally without an account for the core extraction path.

Use the hosted API at webclaw.io when you need:

  • protected-site access without managing infrastructure
  • JavaScript rendering
  • async crawl and research jobs
  • web search
  • watches and production usage tracking
  • SDKs for application code
bash
export WEBCLAW_API_KEY=wc_your_key

webclaw https://example.com --cloud

What You Can Build

Use caseExample
AI agent web accessGive Claude, Cursor, or another MCP client clean page context
RAG ingestionCrawl docs, help centers, blogs, and knowledge bases
Competitor monitoringTrack pricing pages, changelogs, docs, and product pages
Structured extractionTurn messy pages into typed JSON for automations
Research workflowsSearch, scrape, summarize, and cite multiple sources
Brand intelligenceExtract logos, colors, fonts, and social metadata

Architecture

text
webclaw/
  crates/
    webclaw-core     HTML to markdown, text, JSON, and LLM-ready output
    webclaw-fetch    Fetching, crawling, batching, and mapping
    webclaw-llm      Local and hosted LLM provider support
    webclaw-pdf      PDF text extraction
    webclaw-mcp      MCP server for AI agents
    webclaw-cli      Command-line interface

webclaw-core is pure extraction logic: no network I/O, small surface area, and usable independently from the fetching layer.


Configuration

VariableDescription
WEBCLAW_API_KEYHosted API key
OLLAMA_HOSTOllama URL for local LLM features
OPENAI_API_KEYOpenAI-compatible LLM provider key
OPENAI_BASE_URLOpenAI-compatible base URL
ANTHROPIC_API_KEYAnthropic-compatible LLM provider key
ANTHROPIC_BASE_URLAnthropic-compatible base URL
WEBCLAW_PROXYSingle proxy URL
WEBCLAW_PROXY_FILEProxy pool file

Contributing

The most useful contributions right now are practical and small:

  • add examples for real agent and RAG workflows
  • improve SDK snippets
  • report pages that extract poorly
  • add failing fixtures for messy HTML
  • improve docs for MCP clients and local setup
  • test the CLI on more Linux/macOS environments

Good first places to start:

If a page extracts badly, include:

text
URL:
Command or API request:
Expected output:
Actual output:
Format used: markdown / llm / text / json / html
CLI, MCP, SDK, or API:

Please remove secrets, cookies, private tokens, and customer data from logs before posting.


Contributors

Thanks to everyone improving webclaw through issues, examples, docs, bug reports, and pull requests.


Star History


License

AGPL-3.0

Installation

TypingMind
Prerequisites:

Node.js 18+

{
  "mcpServers": {
    "webclaw": {
      "command": "npx",
      "args": [
        "-y",
        "create-webclaw"
      ]
    }
  }
}

Available Tools

  • scrape

    Extract content from any URL

  • crawl

    Recursive site crawl

  • map

    Discover URLs from sitemaps

  • batch

    Parallel multi-URL extraction

  • extract

    LLM-powered structured extraction

  • summarize

    Page summarization

  • diff

    Content change detection

  • brand

    Brand identity extraction

  • search

    Web search + scrape results

  • research

    Deep multi-source research

Use webclaw MCP with multiple AI models

TypingMind connects MCP tools at the workspace level, so once webclaw is connected, you can use it with different AI models in TypingMind instead of setting it up separately for each model. This MCP runs locally through the TypingMind MCP connector on your device.

Setup guide to use the local connector

Use this when the MCP server needs access to local files, apps, or private resources on your computer.

1

Open the MCP settings

In TypingMind, go to Settings, Advanced Settings, then Model Context Protocol and choose Setup Connector.

  1. Open TypingMind in your browser.
  2. Click the Settings icon.
  3. Go to Advanced Settings.
  4. Open the Model Context Protocol section.
  5. Click Setup Connector and choose This Device.
TypingMind MCP connector setup screen with This Device selected
2

Run the connector command

Choose This Device, copy the command from TypingMind, and run it in Terminal. Keep the process running while you use MCP.

  1. Copy the setup command shown by TypingMind.
  2. Open Terminal on macOS or Windows Terminal on Windows.
  3. Paste and run the command.
  4. Approve the package install if Terminal asks you to proceed.
  5. Keep the Terminal window running while using MCP tools.
3

Add webclaw as a server

When the connector status is Ready, click Edit Servers and paste the MCP server configuration.

  1. Wait until the connector status shows Ready.
  2. Click Edit Servers.
  3. Paste the webclaw MCP server configuration.
  4. Save the server list.
  5. Refresh if you want to confirm the connector is still ready.
TypingMind MCP settings showing active server and Edit Servers button
{
  "mcpServers": {
    "webclaw": {
      "command": "npx",
      "args": [
        "-y",
        "create-webclaw"
      ]
    }
  }
}
4

Use it across models

Save the server list, open Plugins, enable the webclaw MCP tools, then select any supported AI model in TypingMind and use the tools in chat or assign them to an AI agent.

  1. Open the Plugins page in TypingMind.
  2. Enable the webclaw MCP tools.
  3. Start a chat and choose the AI model you want to use.
  4. Use the MCP tools in chat or assign them to an AI agent.
  5. Switch to another AI model whenever needed without reconnecting MCP.
TypingMind chat using enabled MCP tools with a selected AI model
Can you use webclaw to help me with this task?
webclaw
Sure. I read it.
Here is what I found using webclaw.

Frequently asked questions

What is the webclaw MCP server used for?

webclaw is an MCP server that lets compatible AI clients connect to external tools and context. In TypingMind, you can add this MCP server once and make its tools available in your AI workspace.

Can I use webclaw MCP with multiple AI models in TypingMind?

Yes. TypingMind connects MCP tools at the workspace level, so you can use webclaw with different AI models such as Claude, ChatGPT, Gemini, or other models you have configured in TypingMind without setting up the MCP server separately for each model.

Why use webclaw MCP with TypingMind?

TypingMind is one of the best frontends for LLM chat because it brings multiple AI models, prompts, plugins, AI agents, API keys, and MCP tools into one workspace. With webclaw connected, you can use its MCP tools across your preferred models while keeping your chat workflow organized in TypingMind.

How do I connect webclaw MCP to TypingMind?

webclaw runs through the TypingMind local MCP connector. This is best when the MCP server needs access to local files, desktop apps, command-line tools, or private resources on your computer.

What tools does webclaw MCP provide in TypingMind?

webclaw exposes 10 MCP tools that can be enabled from the TypingMind Plugins page and used in chat or assigned to AI agents.

Do I need to share my API keys with TypingMind to use webclaw MCP?

No. TypingMind is local-first and lets you keep your model providers, API keys, prompts, and MCP configuration under your control. If webclaw requires authentication, add the required headers, OAuth settings, or local configuration for that MCP server when you create the connection.

Related MCP Servers

View all

Set up your own AI workspace now

Get notified about new features and future giveaways by subscribing to our newsletter 👇