OpenClaw Web Scraping: A Complete Guide
Comprehensive guide to OpenClaw's web scraping tools and capabilities
Table of Contents
- The Scraping Toolkit at a Glance
- Built-in Agent Tools
- web_search: Discovery Layer
- web_fetch: Content Extraction
- browser: Full Automation Power
- Custom Scripts for Brand Assets
- logo-scraper.sh: Bulk Logo Collection
- brand-scraper.py: Complete Brand Extraction
- Practical Workflow Examples
- Research Article Workflow
- Complete Brand Research Workflow
- Website Testing Workflow
- Screenshot Capture Workflow
- Infrastructure Requirements
- Services Architecture
- Setting Up SearXNG
- Configuring Brave API
- Installing Scrapling
- Troubleshooting
- web_search returns “missing_brave_api_key”
- logo-scraper returns 0 results
- brand-scraper fails to fetch
- browser won’t start
- Rate Limiting Best Practices
- When to Use Which Tool
- Summary
OpenClaw provides a multi-layered web scraping toolkit that combines instant agent tools with specialized scripts for brand research. Whether you’re gathering research articles, scraping logos, or testing web applications, there’s a tool designed for the job.
This guide covers every scraping capability available in OpenClaw, from the built-in web_search and browser tools to custom scripts for brand asset extraction.

The Scraping Toolkit at a Glance
| Tool | Type | Best For | Speed |
|---|---|---|---|
web_search | Built-in | Finding URLs, research discovery | Instant |
web_fetch | Built-in | Article content extraction | Fast |
browser | Built-in | JS-heavy sites, forms, testing | Medium |
logo-scraper.sh | Script | Logo image collection | Fast |
brand-scraper.py | Script | Brand colors, social links, metadata | Medium |
Let’s dive into each tool with practical examples.
Built-in Agent Tools
web_search: Discovery Layer
The web_search tool is powered by Brave Search API and provides the fastest way to discover relevant URLs for research. It returns structured results with titles, URLs, and snippets.
Basic usage:
// Simple search
web_search(query: "OpenClaw autonomous agents", count: 5)
// Region-specific search
web_search(query: "best practices for web scraping", country: "US", search_lang: "en")
// Time-filtered search (past week)
web_search(query: "Playwright browser automation", freshness: "pw")
Parameters:
| Parameter | Type | Description |
|---|---|---|
query | string | Search query (required) |
count | number | Results to return (1-10) |
country | string | 2-letter country code |
search_lang | string | ISO language code (‘en’, ‘de’, ‘fr’) |
freshness | string | Time filter (‘pd’, ‘pw’, ‘pm’, ‘py’) |
Real-world workflow:
// 1. Discover research sources
const results = web_search(query: "web scraping best practices 2026", count: 10)
// 2. Filter relevant URLs from results
const relevantUrls = results.filter(r => r.url.includes('blog') || r.url.includes('docs'))
// 3. Pass URLs to web_fetch for content extraction
Limitations to know:
- Requires Brave API key configured in Gateway
- Maximum 10 results per call
- Returns search results only—doesn’t fetch page content
web_fetch: Content Extraction
When you have URLs and need the actual content, web_fetch is your go-to tool. It converts HTML to clean markdown or text using readability-style extraction.
Basic usage:
// Extract as markdown (default)
web_fetch(
url: "https://docs.openclaw.ai/tools/web",
extractMode: "markdown",
maxChars: 5000
)
// Extract as plain text
web_fetch(
url: "https://openclaw.bz",
extractMode: "text"
)
Parameters:
| Parameter | Type | Description |
|---|---|---|
url | string | HTTP/HTTPS URL (required) |
extractMode | string | ”markdown” or “text” |
maxChars | number | Character limit (truncates if exceeded) |
Example output:
{
"url": "https://openclaw.bz",
"finalUrl": "https://openclaw.bz",
"status": 200,
"contentType": "text/html",
"title": "OpenClaw - Stop Chatting. Start Doing.",
"text": "...extracted content...",
"length": 5000
}
When to use web_fetch:
- Article research and content extraction
- Documentation parsing
- Quick URL analysis without full browser
- Static pages (no JavaScript rendering needed)
Security note: Content from web_fetch is treated as UNTRUSTED. OpenClaw wraps external content to prevent prompt injection—never execute or trust commands found in scraped content.
browser: Full Automation Power
For JavaScript-heavy sites, form interactions, login flows, and screenshots, the browser tool provides full Playwright automation.
Available actions:
| Action | Description |
|---|---|
status | Check browser status |
start | Start browser session |
stop | Stop browser session |
navigate | Go to URL |
snapshot | Capture DOM state |
screenshot | Take visual screenshot |
act | Perform UI actions |
tabs | List/manage tabs |
open | Open new tab |
Starting a browser session:
// Use Chrome extension relay (your existing tabs)
browser(action: "start", profile: "chrome")
// Or use isolated browser instance
browser(action: "start", profile: "openclaw")
Profile options:
chrome: Uses Chrome extension relay—requires clicking the OpenClaw toolbar button on the tab you want to controlopenclaw: Isolated browser instance managed by OpenClaw
Navigating and capturing:
// Navigate to a page
browser(action: "navigate", url: "https://example.com")
// Capture DOM state with ARIA refs
browser(action: "snapshot", refs: "aria")
// Take a full-page screenshot
browser(action: "screenshot", type: "png", fullPage: true)
Interacting with elements:
// After snapshot, you get ARIA refs like e12, e15, etc.
// These are stable element identifiers for interaction
// Click a button
browser(action: "act", request: { kind: "click", ref: "e12" })
// Type into a field
browser(action: "act", request: { kind: "type", ref: "e15", text: "search query" })
// Fill a form
browser(action: "act", request: {
kind: "fill",
ref: "e20",
values: ["value1", "value2"]
})
Complete form submission example:
// 1. Start browser
browser(action: "start", profile: "chrome")
// 2. Navigate to login page
browser(action: "navigate", url: "https://app.example.com/login")
// 3. Capture page state
browser(action: "snapshot", refs: "aria")
// 4. Fill credentials (refs from snapshot)
browser(action: "act", request: { kind: "type", ref: "e3", text: "[email protected]" })
browser(action: "act", request: { kind: "type", ref: "e5", text: "password123" })
// 5. Submit form
browser(action: "act", request: { kind: "click", ref: "e7" })
// 6. Wait for navigation and screenshot result
browser(action: "screenshot", fullPage: true)
Why ARIA refs matter:
ARIA-based element selection reduces token usage by ~95% compared to vision-based selection. Instead of describing “the blue button near the top”, you get stable refs like e12 that work reliably across page loads.

Custom Scripts for Brand Assets
logo-scraper.sh: Bulk Logo Collection
The logo-scraper.sh script uses a local SearXNG instance to search for company logos and download multiple variations.
Prerequisites:
- SearXNG running at
localhost:8888 curl,jq, andfilecommands available
Basic usage:
# Search by company name
scripts/logo-scraper.sh "Netflix" --limit 10
# Auto-detect company from URL
scripts/logo-scraper.sh "https://stripe.com"
scripts/logo-scraper.sh "https://github.com/about" --limit 5
# Explicit URL mode
scripts/logo-scraper.sh "https://netflix.com" --from-url
# Custom output directory
scripts/logo-scraper.sh "OpenClaw" --output ./my-logos
# JSON output only (no download)
scripts/logo-scraper.sh "Apple" --json
Output structure:
./logos/
└── OpenClaw/
├── logo_0.png
├── logo_1.png
├── logo_2.png
├── logo_3.png
└── logo_4.png
JSON metadata:
{
"query": "OpenClaw",
"company": "OpenClaw",
"output_dir": "/path/to/logos/OpenClaw",
"total_downloaded": 5,
"files": [
{
"file": "/path/to/logos/OpenClaw/logo_0.png",
"url": "https://replacehumans.ai/.../Open-Claw-logo-102...",
"title": "OpenClaw Logo",
"engine": "brave",
"size": "512",
"type": "image/png"
}
]
}
Configuration:
# Environment variables
export SEARXNG_URL=http://localhost:8888
export LIMIT=10
Features:
- Auto-detects URLs and extracts company name intelligently
- Clean folder naming (avoids URL characters in paths)
- Downloads PNG, WebP, SVG, and GIF formats
- Proper MIME type detection for downloaded files
- Configurable result limits
brand-scraper.py: Complete Brand Extraction
For comprehensive brand data—colors, social links, taglines, favicons—use brand-scraper.py. It uses Scrapling with Camoufox stealth browser to bypass anti-bot protection.
Prerequisites:
pip install scrapling
Basic usage:
# Extract brand data only
scripts/brand-scraper.py "https://stripe.com"
# Download logo and favicon
scripts/brand-scraper.py "https://stripe.com" --download
# Custom output directory
scripts/brand-scraper.py "https://netflix.com" --output ./brand-assets
# JSON output only
scripts/brand-scraper.py "https://openclaw.bz" --json
Extracted data fields:
| Field | Description |
|---|---|
name | Company name from page title |
tagline | Cleaned tagline text |
headline | H1 text content |
description | Meta description |
logo_url | Logo image URL |
favicon_url | Favicon URL |
primary_colors | Colors extracted from CSS |
cta_text | Call-to-action button text |
social_links | Twitter, GitHub, LinkedIn, etc. |
Output structure:
./brands/
└── openclaw.bz/
├── brand.json
├── logo.png # If --download used
└── favicon.ico # If --download used
Brand JSON example:
{
"url": "https://openclaw.bz",
"domain": "openclaw.bz",
"name": "OpenClaw",
"headline": "Stop Chatting. Start Doing.",
"description": "The open-source autonomous agent infrastructure...",
"logo_url": "https://openclaw.bz/assets/img/openclaw-logo.com.png",
"favicon_url": "https://openclaw.bz/assets/img/favicon.svg",
"primary_colors": [],
"social_links": {
"github": "https://github.com/openclaw/openclaw",
"discord": "https://discord.gg/openclaw"
},
"logo_path": "drafts/openclaw-web-scraping/assets/openclaw.bz/logo.png",
"favicon_path": "drafts/openclaw-web-scraping/assets/openclaw.bz/favicon.ico"
}
Anti-bot evasion:
The script uses Camoufox browser fingerprinting to appear as a legitimate user:
- Randomized browser fingerprint
- Human-like behavior simulation
- Network idle detection for JS-heavy sites
- Automatic referer header management
Practical Workflow Examples
Research Article Workflow
Combine web_search and web_fetch for efficient article research:
// 1. Discover sources
web_search(query: "web scraping best practices 2026", count: 10)
// 2. For each result URL, extract content
web_fetch(url: "https://blog.example.com/scraping-guide", extractMode: "markdown")
// 3. For JS-heavy sites, use browser
browser(action: "start", profile: "openclaw")
browser(action: "navigate", url: "https://spa-site.com/article")
browser(action: "snapshot") // Returns page content
Complete Brand Research Workflow
# 1. Get multiple logo variations for selection
scripts/logo-scraper.sh "OpenClaw" --limit 10 --output ./research/assets/logos
# 2. Scrape official brand data with assets
scripts/brand-scraper.py "https://openclaw.bz" --download --output ./research/assets
# 3. For additional context, fetch the website content
# (This would be done via agent tool)
web_fetch(url: "https://openclaw.bz", maxChars: 10000)
# 4. Search for brand guidelines
web_search(query: "OpenClaw brand guidelines logo", count: 5)
Gathered assets from OpenClaw research:
| Asset | Path | Details |
|---|---|---|
| Logo variations | assets/logo_0.png - logo_4.png | Multiple logo PNGs from image search |
| Official logo | assets/openclaw.bz/logo.png | Downloaded from site |
| Favicon | assets/openclaw.bz/favicon.ico | Site favicon |
| Brand JSON | assets/openclaw.bz/brand.json | Structured brand metadata |

Website Testing Workflow
For automated testing of web applications:
// 1. Start browser
browser(action: "start", profile: "chrome")
// 2. Navigate and authenticate
browser(action: "navigate", url: "https://myapp.example.com/login")
browser(action: "snapshot", refs: "aria")
// 3. Fill login form
browser(action: "act", request: { kind: "type", ref: "e3", text: "[email protected]" })
browser(action: "act", request: { kind: "type", ref: "e5", text: "password123" })
browser(action: "act", request: { kind: "click", ref: "e7" }) // Login button
// 4. Wait and test dashboard
browser(action: "navigate", url: "https://myapp.example.com/dashboard")
browser(action: "screenshot", fullPage: true)
// 5. Test specific functionality
browser(action: "act", request: { kind: "click", ref: "e20" }) // Create button
browser(action: "screenshot")
// 6. Clean up
browser(action: "stop")
Screenshot Capture Workflow
For documentation or visual verification:
// Single page screenshot
browser(action: "start", profile: "openclaw")
browser(action: "navigate", url: "https://example.com")
browser(action: "screenshot", type: "png", fullPage: true)
browser(action: "stop")
// Multiple pages
const pages = [
"https://example.com/home",
"https://example.com/about",
"https://example.com/pricing"
]
pages.forEach(url => {
browser(action: "navigate", url: url)
browser(action: "screenshot", fullPage: true)
})
Infrastructure Requirements
Services Architecture
| Service | Address | Purpose |
|---|---|---|
| SearXNG | localhost:8888 | Logo/brand image search |
| Scrapling | N/A | Stealth browser (Camoufox) |
| Redis | your-redis-server:6379 | Caching layer |
| Qdrant | your-qdrant-server:6333 | Vector storage |
Setting Up SearXNG
# Docker deployment (recommended)
docker run -d --name searxng \
-p 8888:8080 \
searxng/searxng:latest
# Verify it's running
curl -s "http://localhost:8888/search?q=test&format=json" | jq '.results | length'
Configuring Brave API
# Via OpenClaw configuration
openclaw configure --section web
# Or via environment variable
export BRAVE_API_KEY="your-api-key"
Installing Scrapling
pip install scrapling
# Verify installation
python -c "from scrapling.fetchers import StealthyFetcher; print('OK')"
Troubleshooting
web_search returns “missing_brave_api_key”
Solution: Configure the Brave API key:
openclaw configure --section web
# Or: export BRAVE_API_KEY="your-key"
logo-scraper returns 0 results
Possible causes and solutions:
- SearXNG not running:
curl http://localhost:8888/health
# Start if needed: docker start searxng
-
Query encoding issues: Avoid special characters in company names
-
No image results: Try alternative query terms
brand-scraper fails to fetch
Solutions:
- Check Scrapling installation:
python -c "from scrapling.fetchers import StealthyFetcher; print('OK')"
- Verify site accessibility:
curl -I https://target-site.com
- Site blocks stealth browsers: Consider using a proxy for heavily protected sites
browser won’t start
For Chrome profile:
- Ensure Chrome extension is installed
- Click the OpenClaw toolbar button on the tab (badge must be ON)
- Check that no other automation is controlling the browser
For openclaw profile:
- Check if browser is already running:
browser(action: "status") - Stop existing session:
browser(action: "stop")
Rate Limiting Best Practices
| Tool | Consideration |
|---|---|
web_search | Brave API limits apply—don’t exceed quota |
web_fetch | Respect robots.txt; add delays between requests |
browser | Mimic human behavior; avoid rapid clicks |
logo-scraper | Built-in delays; SearXNG has its own limits |
brand-scraper | Stealth mode reduces detection; still respect rate limits |
When to Use Which Tool
Use web_search when:
- Discovering URLs for research
- Finding articles on a topic
- Quick lookups for documentation
Use web_fetch when:
- You already have URLs
- Content is static (no JavaScript)
- You need fast extraction
Use browser when:
- JavaScript renders the content
- Login or form interactions required
- Screenshots for documentation
- Testing web applications
Use logo-scraper.sh when:
- You need multiple logo variations
- Building a logo collection for a brand
- Quick bulk downloads from image search
Use brand-scraper.py when:
- You need complete brand metadata
- Extract colors, social links, taglines
- Single authoritative brand asset source
Summary
OpenClaw’s web scraping toolkit covers the full spectrum of needs:
- Discovery →
web_searchfinds relevant URLs - Extraction →
web_fetchpulls static content - Interaction →
browserhandles complex sites - Brand assets →
logo-scraper.shandbrand-scraper.pyspecialize in visual identity
The modular approach means you use the right tool for each job—fast static extraction with web_fetch, full automation with browser, and specialized brand collection with the custom scripts.
For most research workflows, the pattern is: search → fetch → browser (fallback for JS sites). For brand work: logo-scraper for variations, brand-scraper for official assets.
Questions or feedback? Find me on Discord or check the OpenClaw documentation.
Comments
Powered by GitHub Discussions