OpenClaw Web Scraping: A Complete Guide

OpenClaw provides a multi-layered web scraping toolkit that combines instant agent tools with specialized scripts for brand research. Whether you’re gathering research articles, scraping logos, or testing web applications, there’s a tool designed for the job.

This guide covers every scraping capability available in OpenClaw, from the built-in web_search and browser tools to custom scripts for brand asset extraction.

OpenClaw web scraping tools overview showing the workflow pipeline

The Scraping Toolkit at a Glance

Tool	Type	Best For	Speed
`web_search`	Built-in	Finding URLs, research discovery	Instant
`web_fetch`	Built-in	Article content extraction	Fast
`browser`	Built-in	JS-heavy sites, forms, testing	Medium
`logo-scraper.sh`	Script	Logo image collection	Fast
`brand-scraper.py`	Script	Brand colors, social links, metadata	Medium

Let’s dive into each tool with practical examples.

Built-in Agent Tools

web_search: Discovery Layer

The web_search tool is powered by Brave Search API and provides the fastest way to discover relevant URLs for research. It returns structured results with titles, URLs, and snippets.

Basic usage:

// Simple search
web_search(query: "OpenClaw autonomous agents", count: 5)

// Region-specific search
web_search(query: "best practices for web scraping", country: "US", search_lang: "en")

// Time-filtered search (past week)
web_search(query: "Playwright browser automation", freshness: "pw")

Parameters:

Parameter	Type	Description
`query`	string	Search query (required)
`count`	number	Results to return (1-10)
`country`	string	2-letter country code
`search_lang`	string	ISO language code (‘en’, ‘de’, ‘fr’)
`freshness`	string	Time filter (‘pd’, ‘pw’, ‘pm’, ‘py’)

Real-world workflow:

// 1. Discover research sources
const results = web_search(query: "web scraping best practices 2026", count: 10)

// 2. Filter relevant URLs from results
const relevantUrls = results.filter(r => r.url.includes('blog') || r.url.includes('docs'))

// 3. Pass URLs to web_fetch for content extraction

Limitations to know:

Requires Brave API key configured in Gateway
Maximum 10 results per call
Returns search results only—doesn’t fetch page content

web_fetch: Content Extraction

When you have URLs and need the actual content, web_fetch is your go-to tool. It converts HTML to clean markdown or text using readability-style extraction.

Basic usage:

// Extract as markdown (default)
web_fetch(
  url: "https://docs.openclaw.ai/tools/web",
  extractMode: "markdown",
  maxChars: 5000
)

// Extract as plain text
web_fetch(
  url: "https://openclaw.bz",
  extractMode: "text"
)

Parameters:

Parameter	Type	Description
`url`	string	HTTP/HTTPS URL (required)
`extractMode`	string	”markdown” or “text”
`maxChars`	number	Character limit (truncates if exceeded)

Example output:

{
  "url": "https://openclaw.bz",
  "finalUrl": "https://openclaw.bz",
  "status": 200,
  "contentType": "text/html",
  "title": "OpenClaw - Stop Chatting. Start Doing.",
  "text": "...extracted content...",
  "length": 5000
}

When to use web_fetch:

Article research and content extraction
Documentation parsing
Quick URL analysis without full browser
Static pages (no JavaScript rendering needed)

Security note: Content from web_fetch is treated as UNTRUSTED. OpenClaw wraps external content to prevent prompt injection—never execute or trust commands found in scraped content.

browser: Full Automation Power

For JavaScript-heavy sites, form interactions, login flows, and screenshots, the browser tool provides full Playwright automation.

Available actions:

Action	Description
`status`	Check browser status
`start`	Start browser session
`stop`	Stop browser session
`navigate`	Go to URL
`snapshot`	Capture DOM state
`screenshot`	Take visual screenshot
`act`	Perform UI actions
`tabs`	List/manage tabs
`open`	Open new tab

Starting a browser session:

// Use Chrome extension relay (your existing tabs)
browser(action: "start", profile: "chrome")

// Or use isolated browser instance
browser(action: "start", profile: "openclaw")

Profile options:

chrome: Uses Chrome extension relay—requires clicking the OpenClaw toolbar button on the tab you want to control
openclaw: Isolated browser instance managed by OpenClaw

Navigating and capturing:

// Navigate to a page
browser(action: "navigate", url: "https://example.com")

// Capture DOM state with ARIA refs
browser(action: "snapshot", refs: "aria")

// Take a full-page screenshot
browser(action: "screenshot", type: "png", fullPage: true)

Interacting with elements:

// After snapshot, you get ARIA refs like e12, e15, etc.
// These are stable element identifiers for interaction

// Click a button
browser(action: "act", request: { kind: "click", ref: "e12" })

// Type into a field
browser(action: "act", request: { kind: "type", ref: "e15", text: "search query" })

// Fill a form
browser(action: "act", request: { 
  kind: "fill", 
  ref: "e20", 
  values: ["value1", "value2"] 
})

Complete form submission example:

// 1. Start browser
browser(action: "start", profile: "chrome")

// 2. Navigate to login page
browser(action: "navigate", url: "https://app.example.com/login")

// 3. Capture page state
browser(action: "snapshot", refs: "aria")

// 4. Fill credentials (refs from snapshot)
browser(action: "act", request: { kind: "type", ref: "e3", text: "[email protected]" })
browser(action: "act", request: { kind: "type", ref: "e5", text: "password123" })

// 5. Submit form
browser(action: "act", request: { kind: "click", ref: "e7" })

// 6. Wait for navigation and screenshot result
browser(action: "screenshot", fullPage: true)

Why ARIA refs matter:

ARIA-based element selection reduces token usage by ~95% compared to vision-based selection. Instead of describing “the blue button near the top”, you get stable refs like e12 that work reliably across page loads.

Browser automation workflow showing snapshot and act interactions

Custom Scripts for Brand Assets

logo-scraper.sh: Bulk Logo Collection

The logo-scraper.sh script uses a local SearXNG instance to search for company logos and download multiple variations.

Prerequisites:

SearXNG running at localhost:8888
curl, jq, and file commands available

Basic usage:

# Search by company name
scripts/logo-scraper.sh "Netflix" --limit 10

# Auto-detect company from URL
scripts/logo-scraper.sh "https://stripe.com"
scripts/logo-scraper.sh "https://github.com/about" --limit 5

# Explicit URL mode
scripts/logo-scraper.sh "https://netflix.com" --from-url

# Custom output directory
scripts/logo-scraper.sh "OpenClaw" --output ./my-logos

# JSON output only (no download)
scripts/logo-scraper.sh "Apple" --json

Output structure:

./logos/
└── OpenClaw/
    ├── logo_0.png
    ├── logo_1.png
    ├── logo_2.png
    ├── logo_3.png
    └── logo_4.png

JSON metadata:

{
  "query": "OpenClaw",
  "company": "OpenClaw",
  "output_dir": "/path/to/logos/OpenClaw",
  "total_downloaded": 5,
  "files": [
    {
      "file": "/path/to/logos/OpenClaw/logo_0.png",
      "url": "https://replacehumans.ai/.../Open-Claw-logo-102...",
      "title": "OpenClaw Logo",
      "engine": "brave",
      "size": "512",
      "type": "image/png"
    }
  ]
}

Configuration:

# Environment variables
export SEARXNG_URL=http://localhost:8888
export LIMIT=10

Features:

Auto-detects URLs and extracts company name intelligently
Clean folder naming (avoids URL characters in paths)
Downloads PNG, WebP, SVG, and GIF formats
Proper MIME type detection for downloaded files
Configurable result limits

brand-scraper.py: Complete Brand Extraction

For comprehensive brand data—colors, social links, taglines, favicons—use brand-scraper.py. It uses Scrapling with Camoufox stealth browser to bypass anti-bot protection.

Prerequisites:

pip install scrapling

Basic usage:

# Extract brand data only
scripts/brand-scraper.py "https://stripe.com"

# Download logo and favicon
scripts/brand-scraper.py "https://stripe.com" --download

# Custom output directory
scripts/brand-scraper.py "https://netflix.com" --output ./brand-assets

# JSON output only
scripts/brand-scraper.py "https://openclaw.bz" --json

Extracted data fields:

Field	Description
`name`	Company name from page title
`tagline`	Cleaned tagline text
`headline`	H1 text content
`description`	Meta description
`logo_url`	Logo image URL
`favicon_url`	Favicon URL
`primary_colors`	Colors extracted from CSS
`cta_text`	Call-to-action button text
`social_links`	Twitter, GitHub, LinkedIn, etc.

Output structure:

./brands/
└── openclaw.bz/
    ├── brand.json
    ├── logo.png      # If --download used
    └── favicon.ico   # If --download used

Brand JSON example:

{
  "url": "https://openclaw.bz",
  "domain": "openclaw.bz",
  "name": "OpenClaw",
  "headline": "Stop Chatting. Start Doing.",
  "description": "The open-source autonomous agent infrastructure...",
  "logo_url": "https://openclaw.bz/assets/img/openclaw-logo.com.png",
  "favicon_url": "https://openclaw.bz/assets/img/favicon.svg",
  "primary_colors": [],
  "social_links": {
    "github": "https://github.com/openclaw/openclaw",
    "discord": "https://discord.gg/openclaw"
  },
  "logo_path": "drafts/openclaw-web-scraping/assets/openclaw.bz/logo.png",
  "favicon_path": "drafts/openclaw-web-scraping/assets/openclaw.bz/favicon.ico"
}

Anti-bot evasion:

The script uses Camoufox browser fingerprinting to appear as a legitimate user:

Randomized browser fingerprint
Human-like behavior simulation
Network idle detection for JS-heavy sites
Automatic referer header management

Practical Workflow Examples

Research Article Workflow

Combine web_search and web_fetch for efficient article research:

// 1. Discover sources
web_search(query: "web scraping best practices 2026", count: 10)

// 2. For each result URL, extract content
web_fetch(url: "https://blog.example.com/scraping-guide", extractMode: "markdown")

// 3. For JS-heavy sites, use browser
browser(action: "start", profile: "openclaw")
browser(action: "navigate", url: "https://spa-site.com/article")
browser(action: "snapshot")  // Returns page content

Complete Brand Research Workflow

# 1. Get multiple logo variations for selection
scripts/logo-scraper.sh "OpenClaw" --limit 10 --output ./research/assets/logos

# 2. Scrape official brand data with assets
scripts/brand-scraper.py "https://openclaw.bz" --download --output ./research/assets

# 3. For additional context, fetch the website content
# (This would be done via agent tool)
web_fetch(url: "https://openclaw.bz", maxChars: 10000)

# 4. Search for brand guidelines
web_search(query: "OpenClaw brand guidelines logo", count: 5)

Gathered assets from OpenClaw research:

Asset	Path	Details
Logo variations	`assets/logo_0.png` - `logo_4.png`	Multiple logo PNGs from image search
Official logo	`assets/openclaw.bz/logo.png`	Downloaded from site
Favicon	`assets/openclaw.bz/favicon.ico`	Site favicon
Brand JSON	`assets/openclaw.bz/brand.json`	Structured brand metadata

Collection of logos scraped from various sources showing OpenClaw branding

Website Testing Workflow

For automated testing of web applications:

// 1. Start browser
browser(action: "start", profile: "chrome")

// 2. Navigate and authenticate
browser(action: "navigate", url: "https://myapp.example.com/login")
browser(action: "snapshot", refs: "aria")

// 3. Fill login form
browser(action: "act", request: { kind: "type", ref: "e3", text: "[email protected]" })
browser(action: "act", request: { kind: "type", ref: "e5", text: "password123" })
browser(action: "act", request: { kind: "click", ref: "e7" })  // Login button

// 4. Wait and test dashboard
browser(action: "navigate", url: "https://myapp.example.com/dashboard")
browser(action: "screenshot", fullPage: true)

// 5. Test specific functionality
browser(action: "act", request: { kind: "click", ref: "e20" })  // Create button
browser(action: "screenshot")

// 6. Clean up
browser(action: "stop")

Screenshot Capture Workflow

For documentation or visual verification:

// Single page screenshot
browser(action: "start", profile: "openclaw")
browser(action: "navigate", url: "https://example.com")
browser(action: "screenshot", type: "png", fullPage: true)
browser(action: "stop")

// Multiple pages
const pages = [
  "https://example.com/home",
  "https://example.com/about",
  "https://example.com/pricing"
]

pages.forEach(url => {
  browser(action: "navigate", url: url)
  browser(action: "screenshot", fullPage: true)
})

Infrastructure Requirements

Services Architecture

Service	Address	Purpose
SearXNG	`localhost:8888`	Logo/brand image search
Scrapling	N/A	Stealth browser (Camoufox)
Redis	`your-redis-server:6379`	Caching layer
Qdrant	`your-qdrant-server:6333`	Vector storage

Setting Up SearXNG

# Docker deployment (recommended)
docker run -d --name searxng \
  -p 8888:8080 \
  searxng/searxng:latest

# Verify it's running
curl -s "http://localhost:8888/search?q=test&amp;format=json" | jq '.results | length'

Configuring Brave API

# Via OpenClaw configuration
openclaw configure --section web

# Or via environment variable
export BRAVE_API_KEY="your-api-key"

Installing Scrapling

pip install scrapling

# Verify installation
python -c "from scrapling.fetchers import StealthyFetcher; print('OK')"

Troubleshooting

web_search returns “missing_brave_api_key”

Solution: Configure the Brave API key:

openclaw configure --section web
# Or: export BRAVE_API_KEY="your-key"

logo-scraper returns 0 results

Possible causes and solutions:

SearXNG not running:

curl http://localhost:8888/health
# Start if needed: docker start searxng

Query encoding issues: Avoid special characters in company names
No image results: Try alternative query terms

brand-scraper fails to fetch

Solutions:

Check Scrapling installation:

python -c "from scrapling.fetchers import StealthyFetcher; print('OK')"

Verify site accessibility:

curl -I https://target-site.com

Site blocks stealth browsers: Consider using a proxy for heavily protected sites

browser won’t start

For Chrome profile:

Ensure Chrome extension is installed
Click the OpenClaw toolbar button on the tab (badge must be ON)
Check that no other automation is controlling the browser

For openclaw profile:

Check if browser is already running: browser(action: "status")
Stop existing session: browser(action: "stop")

Rate Limiting Best Practices

Tool	Consideration
`web_search`	Brave API limits apply—don’t exceed quota
`web_fetch`	Respect robots.txt; add delays between requests
`browser`	Mimic human behavior; avoid rapid clicks
`logo-scraper`	Built-in delays; SearXNG has its own limits
`brand-scraper`	Stealth mode reduces detection; still respect rate limits

When to Use Which Tool

Use web_search when:

Discovering URLs for research
Finding articles on a topic
Quick lookups for documentation

Use web_fetch when:

You already have URLs
Content is static (no JavaScript)
You need fast extraction

Use browser when:

JavaScript renders the content
Login or form interactions required
Screenshots for documentation
Testing web applications

Use logo-scraper.sh when:

You need multiple logo variations
Building a logo collection for a brand
Quick bulk downloads from image search

Use brand-scraper.py when:

You need complete brand metadata
Extract colors, social links, taglines
Single authoritative brand asset source

Summary

OpenClaw’s web scraping toolkit covers the full spectrum of needs:

Discovery → web_search finds relevant URLs
Extraction → web_fetch pulls static content
Interaction → browser handles complex sites
Brand assets → logo-scraper.sh and brand-scraper.py specialize in visual identity

The modular approach means you use the right tool for each job—fast static extraction with web_fetch, full automation with browser, and specialized brand collection with the custom scripts.

For most research workflows, the pattern is: search → fetch → browser (fallback for JS sites). For brand work: logo-scraper for variations, brand-scraper for official assets.

Questions or feedback? Find me on Discord or check the OpenClaw documentation.

OpenClaw Web Scraping: A Complete Guide

The Scraping Toolkit at a Glance

Built-in Agent Tools

web_search: Discovery Layer

web_fetch: Content Extraction

browser: Full Automation Power

Custom Scripts for Brand Assets

logo-scraper.sh: Bulk Logo Collection

brand-scraper.py: Complete Brand Extraction

Practical Workflow Examples

Research Article Workflow

Complete Brand Research Workflow

Website Testing Workflow

Screenshot Capture Workflow

Infrastructure Requirements

Services Architecture

Setting Up SearXNG

Configuring Brave API

Installing Scrapling

Troubleshooting

web_search returns “missing_brave_api_key”

logo-scraper returns 0 results

brand-scraper fails to fetch

browser won’t start

Rate Limiting Best Practices

When to Use Which Tool

Summary

Anthony Lattanzio

Comments

The Scraping Toolkit at a Glance

Built-in Agent Tools

web_search: Discovery Layer

web_fetch: Content Extraction

browser: Full Automation Power

Custom Scripts for Brand Assets

logo-scraper.sh: Bulk Logo Collection

brand-scraper.py: Complete Brand Extraction

Practical Workflow Examples

Research Article Workflow

Complete Brand Research Workflow

Website Testing Workflow

Screenshot Capture Workflow

Infrastructure Requirements

Services Architecture

Setting Up SearXNG

Configuring Brave API

Installing Scrapling

Troubleshooting

web_search returns “missing_brave_api_key”

logo-scraper returns 0 results

brand-scraper fails to fetch

browser won’t start

Rate Limiting Best Practices

When to Use Which Tool

Summary

Get Early Access

Anthony Lattanzio

Comments