OpenClaw Web Scraping: A Complete Guide

Comprehensive guide to OpenClaw's web scraping tools and capabilities

• 10 min read
web-scrapingopenclawautomationtools
OpenClaw Web Scraping: A Complete Guide

OpenClaw provides a multi-layered web scraping toolkit that combines instant agent tools with specialized scripts for brand research. Whether you’re gathering research articles, scraping logos, or testing web applications, there’s a tool designed for the job.

This guide covers every scraping capability available in OpenClaw, from the built-in web_search and browser tools to custom scripts for brand asset extraction.

OpenClaw web scraping tools overview showing the workflow pipeline

The Scraping Toolkit at a Glance

ToolTypeBest ForSpeed
web_searchBuilt-inFinding URLs, research discoveryInstant
web_fetchBuilt-inArticle content extractionFast
browserBuilt-inJS-heavy sites, forms, testingMedium
logo-scraper.shScriptLogo image collectionFast
brand-scraper.pyScriptBrand colors, social links, metadataMedium

Let’s dive into each tool with practical examples.

Built-in Agent Tools

web_search: Discovery Layer

The web_search tool is powered by Brave Search API and provides the fastest way to discover relevant URLs for research. It returns structured results with titles, URLs, and snippets.

Basic usage:

// Simple search
web_search(query: "OpenClaw autonomous agents", count: 5)

// Region-specific search
web_search(query: "best practices for web scraping", country: "US", search_lang: "en")

// Time-filtered search (past week)
web_search(query: "Playwright browser automation", freshness: "pw")

Parameters:

ParameterTypeDescription
querystringSearch query (required)
countnumberResults to return (1-10)
countrystring2-letter country code
search_langstringISO language code (‘en’, ‘de’, ‘fr’)
freshnessstringTime filter (‘pd’, ‘pw’, ‘pm’, ‘py’)

Real-world workflow:

// 1. Discover research sources
const results = web_search(query: "web scraping best practices 2026", count: 10)

// 2. Filter relevant URLs from results
const relevantUrls = results.filter(r => r.url.includes('blog') || r.url.includes('docs'))

// 3. Pass URLs to web_fetch for content extraction

Limitations to know:

  • Requires Brave API key configured in Gateway
  • Maximum 10 results per call
  • Returns search results only—doesn’t fetch page content

web_fetch: Content Extraction

When you have URLs and need the actual content, web_fetch is your go-to tool. It converts HTML to clean markdown or text using readability-style extraction.

Basic usage:

// Extract as markdown (default)
web_fetch(
  url: "https://docs.openclaw.ai/tools/web",
  extractMode: "markdown",
  maxChars: 5000
)

// Extract as plain text
web_fetch(
  url: "https://openclaw.bz",
  extractMode: "text"
)

Parameters:

ParameterTypeDescription
urlstringHTTP/HTTPS URL (required)
extractModestring”markdown” or “text”
maxCharsnumberCharacter limit (truncates if exceeded)

Example output:

{
  "url": "https://openclaw.bz",
  "finalUrl": "https://openclaw.bz",
  "status": 200,
  "contentType": "text/html",
  "title": "OpenClaw - Stop Chatting. Start Doing.",
  "text": "...extracted content...",
  "length": 5000
}

When to use web_fetch:

  • Article research and content extraction
  • Documentation parsing
  • Quick URL analysis without full browser
  • Static pages (no JavaScript rendering needed)

Security note: Content from web_fetch is treated as UNTRUSTED. OpenClaw wraps external content to prevent prompt injection—never execute or trust commands found in scraped content.

browser: Full Automation Power

For JavaScript-heavy sites, form interactions, login flows, and screenshots, the browser tool provides full Playwright automation.

Available actions:

ActionDescription
statusCheck browser status
startStart browser session
stopStop browser session
navigateGo to URL
snapshotCapture DOM state
screenshotTake visual screenshot
actPerform UI actions
tabsList/manage tabs
openOpen new tab

Starting a browser session:

// Use Chrome extension relay (your existing tabs)
browser(action: "start", profile: "chrome")

// Or use isolated browser instance
browser(action: "start", profile: "openclaw")

Profile options:

  • chrome: Uses Chrome extension relay—requires clicking the OpenClaw toolbar button on the tab you want to control
  • openclaw: Isolated browser instance managed by OpenClaw

Navigating and capturing:

// Navigate to a page
browser(action: "navigate", url: "https://example.com")

// Capture DOM state with ARIA refs
browser(action: "snapshot", refs: "aria")

// Take a full-page screenshot
browser(action: "screenshot", type: "png", fullPage: true)

Interacting with elements:

// After snapshot, you get ARIA refs like e12, e15, etc.
// These are stable element identifiers for interaction

// Click a button
browser(action: "act", request: { kind: "click", ref: "e12" })

// Type into a field
browser(action: "act", request: { kind: "type", ref: "e15", text: "search query" })

// Fill a form
browser(action: "act", request: { 
  kind: "fill", 
  ref: "e20", 
  values: ["value1", "value2"] 
})

Complete form submission example:

// 1. Start browser
browser(action: "start", profile: "chrome")

// 2. Navigate to login page
browser(action: "navigate", url: "https://app.example.com/login")

// 3. Capture page state
browser(action: "snapshot", refs: "aria")

// 4. Fill credentials (refs from snapshot)
browser(action: "act", request: { kind: "type", ref: "e3", text: "[email protected]" })
browser(action: "act", request: { kind: "type", ref: "e5", text: "password123" })

// 5. Submit form
browser(action: "act", request: { kind: "click", ref: "e7" })

// 6. Wait for navigation and screenshot result
browser(action: "screenshot", fullPage: true)

Why ARIA refs matter:

ARIA-based element selection reduces token usage by ~95% compared to vision-based selection. Instead of describing “the blue button near the top”, you get stable refs like e12 that work reliably across page loads.

Browser automation workflow showing snapshot and act interactions

Custom Scripts for Brand Assets

logo-scraper.sh: Bulk Logo Collection

The logo-scraper.sh script uses a local SearXNG instance to search for company logos and download multiple variations.

Prerequisites:

  • SearXNG running at localhost:8888
  • curl, jq, and file commands available

Basic usage:

# Search by company name
scripts/logo-scraper.sh "Netflix" --limit 10

# Auto-detect company from URL
scripts/logo-scraper.sh "https://stripe.com"
scripts/logo-scraper.sh "https://github.com/about" --limit 5

# Explicit URL mode
scripts/logo-scraper.sh "https://netflix.com" --from-url

# Custom output directory
scripts/logo-scraper.sh "OpenClaw" --output ./my-logos

# JSON output only (no download)
scripts/logo-scraper.sh "Apple" --json

Output structure:

./logos/
└── OpenClaw/
    ├── logo_0.png
    ├── logo_1.png
    ├── logo_2.png
    ├── logo_3.png
    └── logo_4.png

JSON metadata:

{
  "query": "OpenClaw",
  "company": "OpenClaw",
  "output_dir": "/path/to/logos/OpenClaw",
  "total_downloaded": 5,
  "files": [
    {
      "file": "/path/to/logos/OpenClaw/logo_0.png",
      "url": "https://replacehumans.ai/.../Open-Claw-logo-102...",
      "title": "OpenClaw Logo",
      "engine": "brave",
      "size": "512",
      "type": "image/png"
    }
  ]
}

Configuration:

# Environment variables
export SEARXNG_URL=http://localhost:8888
export LIMIT=10

Features:

  • Auto-detects URLs and extracts company name intelligently
  • Clean folder naming (avoids URL characters in paths)
  • Downloads PNG, WebP, SVG, and GIF formats
  • Proper MIME type detection for downloaded files
  • Configurable result limits

brand-scraper.py: Complete Brand Extraction

For comprehensive brand data—colors, social links, taglines, favicons—use brand-scraper.py. It uses Scrapling with Camoufox stealth browser to bypass anti-bot protection.

Prerequisites:

pip install scrapling

Basic usage:

# Extract brand data only
scripts/brand-scraper.py "https://stripe.com"

# Download logo and favicon
scripts/brand-scraper.py "https://stripe.com" --download

# Custom output directory
scripts/brand-scraper.py "https://netflix.com" --output ./brand-assets

# JSON output only
scripts/brand-scraper.py "https://openclaw.bz" --json

Extracted data fields:

FieldDescription
nameCompany name from page title
taglineCleaned tagline text
headlineH1 text content
descriptionMeta description
logo_urlLogo image URL
favicon_urlFavicon URL
primary_colorsColors extracted from CSS
cta_textCall-to-action button text
social_linksTwitter, GitHub, LinkedIn, etc.

Output structure:

./brands/
└── openclaw.bz/
    ├── brand.json
    ├── logo.png      # If --download used
    └── favicon.ico   # If --download used

Brand JSON example:

{
  "url": "https://openclaw.bz",
  "domain": "openclaw.bz",
  "name": "OpenClaw",
  "headline": "Stop Chatting. Start Doing.",
  "description": "The open-source autonomous agent infrastructure...",
  "logo_url": "https://openclaw.bz/assets/img/openclaw-logo.com.png",
  "favicon_url": "https://openclaw.bz/assets/img/favicon.svg",
  "primary_colors": [],
  "social_links": {
    "github": "https://github.com/openclaw/openclaw",
    "discord": "https://discord.gg/openclaw"
  },
  "logo_path": "drafts/openclaw-web-scraping/assets/openclaw.bz/logo.png",
  "favicon_path": "drafts/openclaw-web-scraping/assets/openclaw.bz/favicon.ico"
}

Anti-bot evasion:

The script uses Camoufox browser fingerprinting to appear as a legitimate user:

  • Randomized browser fingerprint
  • Human-like behavior simulation
  • Network idle detection for JS-heavy sites
  • Automatic referer header management

Practical Workflow Examples

Research Article Workflow

Combine web_search and web_fetch for efficient article research:

// 1. Discover sources
web_search(query: "web scraping best practices 2026", count: 10)

// 2. For each result URL, extract content
web_fetch(url: "https://blog.example.com/scraping-guide", extractMode: "markdown")

// 3. For JS-heavy sites, use browser
browser(action: "start", profile: "openclaw")
browser(action: "navigate", url: "https://spa-site.com/article")
browser(action: "snapshot")  // Returns page content

Complete Brand Research Workflow

# 1. Get multiple logo variations for selection
scripts/logo-scraper.sh "OpenClaw" --limit 10 --output ./research/assets/logos

# 2. Scrape official brand data with assets
scripts/brand-scraper.py "https://openclaw.bz" --download --output ./research/assets

# 3. For additional context, fetch the website content
# (This would be done via agent tool)
web_fetch(url: "https://openclaw.bz", maxChars: 10000)

# 4. Search for brand guidelines
web_search(query: "OpenClaw brand guidelines logo", count: 5)

Gathered assets from OpenClaw research:

AssetPathDetails
Logo variationsassets/logo_0.png - logo_4.pngMultiple logo PNGs from image search
Official logoassets/openclaw.bz/logo.pngDownloaded from site
Faviconassets/openclaw.bz/favicon.icoSite favicon
Brand JSONassets/openclaw.bz/brand.jsonStructured brand metadata

Collection of logos scraped from various sources showing OpenClaw branding

Website Testing Workflow

For automated testing of web applications:

// 1. Start browser
browser(action: "start", profile: "chrome")

// 2. Navigate and authenticate
browser(action: "navigate", url: "https://myapp.example.com/login")
browser(action: "snapshot", refs: "aria")

// 3. Fill login form
browser(action: "act", request: { kind: "type", ref: "e3", text: "[email protected]" })
browser(action: "act", request: { kind: "type", ref: "e5", text: "password123" })
browser(action: "act", request: { kind: "click", ref: "e7" })  // Login button

// 4. Wait and test dashboard
browser(action: "navigate", url: "https://myapp.example.com/dashboard")
browser(action: "screenshot", fullPage: true)

// 5. Test specific functionality
browser(action: "act", request: { kind: "click", ref: "e20" })  // Create button
browser(action: "screenshot")

// 6. Clean up
browser(action: "stop")

Screenshot Capture Workflow

For documentation or visual verification:

// Single page screenshot
browser(action: "start", profile: "openclaw")
browser(action: "navigate", url: "https://example.com")
browser(action: "screenshot", type: "png", fullPage: true)
browser(action: "stop")

// Multiple pages
const pages = [
  "https://example.com/home",
  "https://example.com/about",
  "https://example.com/pricing"
]

pages.forEach(url => {
  browser(action: "navigate", url: url)
  browser(action: "screenshot", fullPage: true)
})

Infrastructure Requirements

Services Architecture

ServiceAddressPurpose
SearXNGlocalhost:8888Logo/brand image search
ScraplingN/AStealth browser (Camoufox)
Redisyour-redis-server:6379Caching layer
Qdrantyour-qdrant-server:6333Vector storage

Setting Up SearXNG

# Docker deployment (recommended)
docker run -d --name searxng \
  -p 8888:8080 \
  searxng/searxng:latest

# Verify it's running
curl -s "http://localhost:8888/search?q=test&format=json" | jq '.results | length'

Configuring Brave API

# Via OpenClaw configuration
openclaw configure --section web

# Or via environment variable
export BRAVE_API_KEY="your-api-key"

Installing Scrapling

pip install scrapling

# Verify installation
python -c "from scrapling.fetchers import StealthyFetcher; print('OK')"

Troubleshooting

web_search returns “missing_brave_api_key”

Solution: Configure the Brave API key:

openclaw configure --section web
# Or: export BRAVE_API_KEY="your-key"

logo-scraper returns 0 results

Possible causes and solutions:

  1. SearXNG not running:
curl http://localhost:8888/health
# Start if needed: docker start searxng
  1. Query encoding issues: Avoid special characters in company names

  2. No image results: Try alternative query terms

brand-scraper fails to fetch

Solutions:

  1. Check Scrapling installation:
python -c "from scrapling.fetchers import StealthyFetcher; print('OK')"
  1. Verify site accessibility:
curl -I https://target-site.com
  1. Site blocks stealth browsers: Consider using a proxy for heavily protected sites

browser won’t start

For Chrome profile:

  • Ensure Chrome extension is installed
  • Click the OpenClaw toolbar button on the tab (badge must be ON)
  • Check that no other automation is controlling the browser

For openclaw profile:

  • Check if browser is already running: browser(action: "status")
  • Stop existing session: browser(action: "stop")

Rate Limiting Best Practices

ToolConsideration
web_searchBrave API limits apply—don’t exceed quota
web_fetchRespect robots.txt; add delays between requests
browserMimic human behavior; avoid rapid clicks
logo-scraperBuilt-in delays; SearXNG has its own limits
brand-scraperStealth mode reduces detection; still respect rate limits

When to Use Which Tool

Use web_search when:

  • Discovering URLs for research
  • Finding articles on a topic
  • Quick lookups for documentation

Use web_fetch when:

  • You already have URLs
  • Content is static (no JavaScript)
  • You need fast extraction

Use browser when:

  • JavaScript renders the content
  • Login or form interactions required
  • Screenshots for documentation
  • Testing web applications

Use logo-scraper.sh when:

  • You need multiple logo variations
  • Building a logo collection for a brand
  • Quick bulk downloads from image search

Use brand-scraper.py when:

  • You need complete brand metadata
  • Extract colors, social links, taglines
  • Single authoritative brand asset source

Summary

OpenClaw’s web scraping toolkit covers the full spectrum of needs:

  1. Discoveryweb_search finds relevant URLs
  2. Extractionweb_fetch pulls static content
  3. Interactionbrowser handles complex sites
  4. Brand assetslogo-scraper.sh and brand-scraper.py specialize in visual identity

The modular approach means you use the right tool for each job—fast static extraction with web_fetch, full automation with browser, and specialized brand collection with the custom scripts.

For most research workflows, the pattern is: search → fetch → browser (fallback for JS sites). For brand work: logo-scraper for variations, brand-scraper for official assets.


Questions or feedback? Find me on Discord or check the OpenClaw documentation.

Anthony Lattanzio

Anthony Lattanzio

Tech Enthusiast & Builder

I'm a tech enthusiast who loves building things with hardware and software. By night, I run a homelab that's grown way beyond what any reasonable person needs. Check out about me for more.

Comments

Powered by GitHub Discussions