Skip to main content

Site Crawling Guide

Learn how to discover URLs on a website and scrape them in bulk using Meter’s site crawling feature.

Overview

Site crawling automates the process of finding URLs to scrape. Instead of manually collecting URLs, you configure how to discover them and Meter does the rest. Use site crawling when:
  • You need to scrape many pages from the same site
  • URLs follow a pattern (sitemaps, pagination, link structure)
  • You want to create recurring scrapes of discovered URLs

Prerequisites

Before you start:
  • Create a Meter account at meter.sh
  • Have a strategy ready for the pages you want to scrape
  • Know how URLs are organized on your target site

Choosing a discovery method

MethodBest forExample
SitemapSites with sitemap.xmlE-commerce product catalogs
PaginationPredictable page URLsSearch results, listings
Link PatternCrawling by following linksNews articles, blog posts

Method 1: Sitemap discovery

Sitemaps are the fastest and most reliable discovery method.

Step 1: Find the sitemap

Most sites have a sitemap at /sitemap.xml or listed in robots.txt:
# Check common locations
curl https://example.com/sitemap.xml
curl https://example.com/robots.txt | grep -i sitemap

Step 2: Start discovery

  1. Go to the Dashboard and click Discover URLs
  2. Select Sitemap as the discovery method
  3. Enter the sitemap URL (e.g., https://shop.com/sitemap.xml)
  4. Optionally add a URL pattern to filter results
  5. Click Start Discovery

Step 3: Poll for results

Discovery runs asynchronously. Poll until status is completed:
curl https://api.meter.sh/discover/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer sk_live_..."
Response when complete:
{
  "discovery_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "total_urls": 847,
  "sample_urls": [
    "https://shop.com/products/widget-a",
    "https://shop.com/products/widget-b",
    "..."
  ]
}

Method 2: Pagination discovery

Use this when URLs follow a numbered pattern.

Configuration

ParameterDescriptionExample
url_templateURL with {n} placeholderhttps://shop.com/products?page={n}
start_indexFirst page number1
stepIncrement between pages1
max_pagesMaximum pages to generate100

Example

  1. Select Pagination as the discovery method
  2. Enter URL template: https://shop.com/search?page={n}
  3. Set start index, step, and max pages
  4. Click Start Discovery
Use this to crawl a site and collect URLs matching a pattern.

Configuration

ParameterDescriptionExample
seed_urlStarting URLhttps://news.com
link_patternPattern to match (glob)/article/*/
navigation_patternPages to visit during crawl/category/
max_depthHow deep to crawl2
max_urlsMaximum URLs to collect500

Example

  1. Select Link Pattern as the discovery method
  2. Enter seed URL: https://news.com
  3. Enter link pattern: /article/*/
  4. Set max depth and max URLs
  5. Click Start Discovery
The navigation pattern defines which pages to visit during the crawl. The link pattern defines which URLs to collect. They work together: Meter visits navigation pages to find links matching your collection pattern.

Executing discovered URLs

Once discovery completes, you can execute immediately or create a schedule.

One-time execution

Scrape all discovered URLs immediately:
  1. Review the discovered URLs
  2. Select your extraction strategy
  3. Set maximum URLs to process
  4. Click Execute Now

Create a schedule

Set up recurring scrapes:
  1. Click Create Schedule
  2. Choose interval (e.g., every 24 hours) or cron expression
  3. Optionally add a webhook URL
  4. Click Create Scheduled Job

Filtering URLs

Use regex patterns to filter discovered URLs:
# Only URLs containing "widget"
"url_filter": ".*widget.*"

# Only product pages with numeric IDs
"url_filter": "/products/\\d+"

# Exclude certain paths
"url_filter": "^(?!.*/archive/).*$"

Complete example

Here’s a full workflow for scraping a product catalog:
# 1. Create a strategy for product pages
STRATEGY=$(curl -X POST https://api.meter.sh/strategies \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.com/products/sample",
    "description": "Extract product name, price, and description",
    "name": "Shop Products"
  }' | jq -r '.strategy_id')

# 2. Start sitemap discovery
DISCOVERY=$(curl -X POST https://api.meter.sh/discover \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "discovery": {
      "method": "sitemap",
      "sitemap_url": "https://shop.com/sitemap.xml",
      "url_pattern": "products/*/"
    }
  }' | jq -r '.discovery_id')

# 3. Wait for discovery to complete
sleep 30

# 4. Check status
curl https://api.meter.sh/discover/$DISCOVERY \
  -H "Authorization: Bearer sk_live_..."

# 5. Execute with the strategy
curl -X POST https://api.meter.sh/discover/$DISCOVERY/execute \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d "{
    \"strategy_id\": \"$STRATEGY\",
    \"max_urls\": 100
  }"

Troubleshooting

Solutions:
  • Check robots.txt for the sitemap location
  • Try common paths: /sitemap.xml, /sitemap_index.xml, /sitemap/sitemap.xml
  • Some sites use dynamic sitemaps - check the page source for sitemap links
Causes:
  • URL pattern too restrictive
  • Sitemap is empty or blocked
  • Link pattern doesn’t match any URLs
Solutions:
  • Remove or broaden the URL pattern
  • Verify the sitemap loads in a browser
  • Test your link pattern against sample URLs
Cause: Large sitemaps or deep crawls take timeSolutions:
  • Reduce max_urls or max_depth
  • Use URL patterns to target specific sections
  • For very large sites, run multiple smaller discoveries
Solutions:
  • Add a URL pattern to filter during discovery
  • Use url_filter regex when executing
  • For link pattern crawls, be more specific with patterns

Next steps

Need help?

Email me at [email protected]