Site Crawling Guide

Learn how to discover URLs on a website and scrape them in bulk using Meter’s site crawling feature.

Overview

Site crawling automates the process of finding URLs to scrape. Instead of manually collecting URLs, you configure how to discover them and Meter does the rest. Use site crawling when:

You need to scrape many pages from the same site
URLs follow a pattern (sitemaps, pagination, link structure)
You want to create recurring scrapes of discovered URLs

Prerequisites

Before you start:

Create a Meter account at meter.sh
Have a strategy ready for the pages you want to scrape
Know how URLs are organized on your target site

Choosing a discovery method

Method	Best for	Example
Sitemap	Sites with `sitemap.xml`	E-commerce product catalogs
Pagination	Predictable page URLs	Search results, listings
Link Pattern	Crawling by following links	News articles, blog posts

Method 1: Sitemap discovery

Sitemaps are the fastest and most reliable discovery method.

Step 1: Find the sitemap

Most sites have a sitemap at /sitemap.xml or listed in robots.txt:

# Check common locations
curl https://example.com/sitemap.xml
curl https://example.com/robots.txt | grep -i sitemap

Step 2: Start discovery

Dashboard
REST API

Go to the Dashboard and click Discover URLs
Select Sitemap as the discovery method
Enter the sitemap URL (e.g., https://shop.com/sitemap.xml)
Optionally add a URL pattern to filter results
Click Start Discovery

curl -X POST https://api.meter.sh/discover \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "discovery": {
      "method": "sitemap",
      "sitemap_url": "https://shop.com/sitemap.xml",
      "url_pattern": "products/*/",
      "max_urls": 1000
    }
  }'

Response:

{
  "discovery_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "pending",
  "discovery_method": "sitemap",
  "root_url": "https://shop.com/sitemap.xml"
}

Step 3: Poll for results

Discovery runs asynchronously. Poll until status is completed:

curl https://api.meter.sh/discover/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer sk_live_..."

Response when complete:

{
  "discovery_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "total_urls": 847,
  "sample_urls": [
    "https://shop.com/products/widget-a",
    "https://shop.com/products/widget-b",
    "..."
  ]
}

Method 2: Pagination discovery

Use this when URLs follow a numbered pattern.

Configuration

Parameter	Description	Example
`url_template`	URL with `{n}` placeholder	`https://shop.com/products?page={n}`
`start_index`	First page number	`1`
`step`	Increment between pages	`1`
`max_pages`	Maximum pages to generate	`100`

Example

Dashboard
REST API

Select Pagination as the discovery method
Enter URL template: https://shop.com/search?page={n}
Set start index, step, and max pages
Click Start Discovery

curl -X POST https://api.meter.sh/discover \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "discovery": {
      "method": "pagination",
      "url_template": "https://shop.com/search?page={n}",
      "start_index": 1,
      "step": 1,
      "max_pages": 50
    }
  }'

Method 3: Link pattern discovery

Use this to crawl a site and collect URLs matching a pattern.

Configuration

Parameter	Description	Example
`seed_url`	Starting URL	`https://news.com`
`link_pattern`	Pattern to match (glob)	`/article/*/`
`navigation_pattern`	Pages to visit during crawl	`/category/`
`max_depth`	How deep to crawl	`2`
`max_urls`	Maximum URLs to collect	`500`

Example

Dashboard
REST API

Select Link Pattern as the discovery method
Enter seed URL: https://news.com
Enter link pattern: /article/*/
Set max depth and max URLs
Click Start Discovery

curl -X POST https://api.meter.sh/discover \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "discovery": {
      "method": "link_pattern",
      "seed_url": "https://news.com",
      "link_pattern": "/article/*/",
      "navigation_pattern": "/category/",
      "max_depth": 2,
      "max_urls": 500
    }
  }'

The navigation pattern defines which pages to visit during the crawl. The link pattern defines which URLs to collect. They work together: Meter visits navigation pages to find links matching your collection pattern.

Executing discovered URLs

Once discovery completes, you can execute immediately or create a schedule.

One-time execution

Scrape all discovered URLs immediately:

Dashboard
REST API

Review the discovered URLs
Select your extraction strategy
Set maximum URLs to process
Click Execute Now

curl -X POST https://api.meter.sh/discover/550e8400.../execute \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "strategy_id": "660e8400-e29b-41d4-a716-446655440000",
    "max_urls": 100,
    "url_filter": ".*widget.*"
  }'

Response:

{
  "batch_id": "770e8400-e29b-41d4-a716-446655440000",
  "jobs_queued": 100
}

Create a schedule

Set up recurring scrapes:

Dashboard
REST API

Click Create Schedule
Choose interval (e.g., every 24 hours) or cron expression
Optionally add a webhook URL
Click Create Scheduled Job

curl -X POST https://api.meter.sh/discover/550e8400.../schedule \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "strategy_id": "660e8400-e29b-41d4-a716-446655440000",
    "interval_seconds": 86400,
    "webhook_url": "https://your-app.com/webhooks/meter",
    "max_urls": 500
  }'

Filtering URLs

Use regex patterns to filter discovered URLs:

# Only URLs containing "widget"
"url_filter": ".*widget.*"

# Only product pages with numeric IDs
"url_filter": "/products/\\d+"

# Exclude certain paths
"url_filter": "^(?!.*/archive/).*$"

Complete example

Here’s a full workflow for scraping a product catalog:

# 1. Create a strategy for product pages
STRATEGY=$(curl -X POST https://api.meter.sh/strategies \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.com/products/sample",
    "description": "Extract product name, price, and description",
    "name": "Shop Products"
  }' | jq -r '.strategy_id')

# 2. Start sitemap discovery
DISCOVERY=$(curl -X POST https://api.meter.sh/discover \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "discovery": {
      "method": "sitemap",
      "sitemap_url": "https://shop.com/sitemap.xml",
      "url_pattern": "products/*/"
    }
  }' | jq -r '.discovery_id')

# 3. Wait for discovery to complete
sleep 30

# 4. Check status
curl https://api.meter.sh/discover/$DISCOVERY \
  -H "Authorization: Bearer sk_live_..."

# 5. Execute with the strategy
curl -X POST https://api.meter.sh/discover/$DISCOVERY/execute \
  -H "Authorization: Bearer sk_live_..." \
  -H "Content-Type: application/json" \
  -d "{
    \"strategy_id\": \"$STRATEGY\",
    \"max_urls\": 100
  }"

Troubleshooting

Sitemap not found

Solutions:

Check robots.txt for the sitemap location
Try common paths: /sitemap.xml, /sitemap_index.xml, /sitemap/sitemap.xml
Some sites use dynamic sitemaps - check the page source for sitemap links

No URLs discovered

Causes:

URL pattern too restrictive
Sitemap is empty or blocked
Link pattern doesn’t match any URLs

Solutions:

Remove or broaden the URL pattern
Verify the sitemap loads in a browser
Test your link pattern against sample URLs

Discovery times out

Cause: Large sitemaps or deep crawls take timeSolutions:

Reduce max_urls or max_depth
Use URL patterns to target specific sections
For very large sites, run multiple smaller discoveries

Too many unwanted URLs

Solutions:

Add a URL pattern to filter during discovery
Use url_filter regex when executing
For link pattern crawls, be more specific with patterns

Next steps

Site Crawling Concepts

Understand how site crawling works

Discovery API Reference

View all discovery endpoints

Webhooks

Get notified when scrapes complete

Change Detection

Track changes between scrapes

Need help?

Email me at mckinnon@meter.sh

Integration Guides

Examples

Site Crawling Guide

Site Crawling Guide

Overview

Prerequisites

Choosing a discovery method

Method 1: Sitemap discovery

Step 1: Find the sitemap

Step 2: Start discovery

Step 3: Poll for results

Configuration

Example

Method 3: Link pattern discovery

Configuration

Example

Executing discovered URLs

One-time execution

Create a schedule

Filtering URLs

Complete example

Troubleshooting

Next steps

Site Crawling Concepts

Discovery API Reference

Webhooks

Change Detection

Need help?

Integration Guides

Examples

​Site Crawling Guide

​Overview

​Prerequisites

​Choosing a discovery method

​Method 1: Sitemap discovery

​Step 1: Find the sitemap

​Step 2: Start discovery

​Step 3: Poll for results

​Method 2: Pagination discovery

​Configuration

​Example

​Method 3: Link pattern discovery

​Configuration

​Example

​Executing discovered URLs

​One-time execution

​Create a schedule

​Filtering URLs

​Complete example

​Troubleshooting

​Next steps

Site Crawling Concepts

Discovery API Reference

Webhooks

Change Detection

​Need help?

Site Crawling Guide

Overview

Prerequisites

Choosing a discovery method

Method 1: Sitemap discovery

Step 1: Find the sitemap

Step 2: Start discovery

Step 3: Poll for results

Method 2: Pagination discovery

Configuration

Example

Method 3: Link pattern discovery

Configuration

Example

Executing discovered URLs

One-time execution

Create a schedule

Filtering URLs

Complete example

Troubleshooting

Next steps

Need help?