Site Crawling

Site crawling lets you automatically discover URLs on a website so you can scrape them in bulk. Instead of manually collecting URLs, you configure a discovery method and Meter finds all matching pages.

Why use site crawling?

Traditional scraping requires you to know every URL upfront. Site crawling solves this by:

Discovering URLs automatically from sitemaps, pagination, or link patterns
Filtering URLs to target only the pages you need
Batch execution to scrape hundreds or thousands of pages at once
Creating schedules to re-crawl and scrape on a recurring basis

Discovery methods

Meter supports three methods for discovering URLs:

Sitemap

Parse sitemap.xml files to extract all indexed URLs

Pagination

Generate URLs by incrementing a page number in a template

Link Pattern

Crawl a site and collect URLs matching a pattern

Sitemap discovery

Best for sites with a sitemap.xml file. Meter parses the sitemap (including nested sitemaps) and extracts all URLs. Configuration:

Sitemap URL: The URL to the sitemap.xml file
URL Pattern (optional): Glob pattern to filter URLs (e.g., products/*/)
Max URLs: Maximum number of URLs to discover (1-10,000)

Example use case: Scraping all product pages from an e-commerce site that maintains a sitemap.

Pagination discovery

Best for sites with predictable paginated URLs. You provide a URL template with a page number placeholder. Configuration:

URL Template: URL with {n} placeholder for page number (e.g., https://shop.com/products?page={n})
Start Index: First page number (default: 1)
Step: Increment between pages (default: 1)
Max Pages: Maximum pages to generate

Example use case: Scraping all pages of search results or product listings where URLs follow a pattern like ?page=1, ?page=2, etc.

Link pattern discovery

Best for sites without sitemaps or predictable pagination. Meter crawls from a seed URL and collects links matching your pattern. Configuration:

Seed URL: Starting URL for the crawl
Link Pattern: Glob pattern for URLs to collect (e.g., /product/*/)
Navigation Pattern (optional): Pattern for pages to visit during crawl (e.g., /category/)
Max Depth: How many links deep to crawl (1-10)
Max URLs: Maximum URLs to discover

Example use case: Discovering all article pages on a news site by crawling category pages and collecting article links.

How site crawling works

Configure: Choose a discovery method and set parameters
Discover: Meter crawls and finds matching URLs
Review: Check the discovered URLs and adjust if needed
Execute: Run a one-time batch scrape or create a recurring schedule

Execution options

After discovering URLs, you can:

One-time execution

Create scrape jobs for all discovered URLs immediately. Each URL becomes a separate job that runs through your chosen strategy.

Jobs are created with a shared batch_id for tracking
URL filtering with regex is supported
Set a maximum number of URLs to process

Scheduled execution

Create a recurring schedule that re-runs the scrape on a regular basis.

Interval-based: Run every N hours/days (e.g., every 24 hours)
Cron-based: Run on a cron schedule (e.g., 0 9 * * * for 9 AM daily)
Webhook notifications: Get notified when scrapes complete

Schedules store a copy of the discovered URLs. If you need to update the URL list, create a new schedule from a fresh discovery.

Best practices

Start with sitemaps when available

Sitemaps are the fastest and most reliable discovery method. Check if your target site has one at /sitemap.xml or in robots.txt before trying other methods.

Use URL patterns to filter results

Most sitemaps include URLs you don’t need (about pages, terms of service, etc.). Use URL patterns to filter down to just the pages you want to scrape.For example, products/*/ matches product pages while excluding other site content.

Test with small limits first

Start with a low max_urls limit (e.g., 10-50) to verify your configuration before running a full crawl. This saves time and resources.

Match strategies to discovered URLs

Make sure your extraction strategy works with the pages you’re discovering. If you’re crawling product pages, use a strategy created from a product page.

Limits

Parameter	Limit
Max URLs per discovery	10,000
Max crawl depth (link pattern)	10
Max pages (pagination)	1,000

Next steps

Site Crawling Guide

Step-by-step guide to crawling your first site

REST API Reference

View all discovery endpoints

Strategies

Learn how to create extraction strategies

Schedules

Set up recurring scrapes

Getting Started

Core Concepts

Site Crawling

Site Crawling

Why use site crawling?

Discovery methods

Sitemap

Pagination

Link Pattern

Sitemap discovery

Link pattern discovery

How site crawling works

Execution options

One-time execution

Scheduled execution

Best practices

Limits

Next steps

Site Crawling Guide

REST API Reference

Strategies

Schedules

Getting Started

Core Concepts

​Site Crawling

​Why use site crawling?

​Discovery methods

Sitemap

Pagination

Link Pattern

​Sitemap discovery

​Pagination discovery

​Link pattern discovery

​How site crawling works

​Execution options

​One-time execution

​Scheduled execution

​Best practices

​Limits

​Next steps

Site Crawling Guide

REST API Reference

Strategies

Schedules

Site Crawling

Why use site crawling?

Discovery methods

Sitemap discovery

Pagination discovery

Link pattern discovery

How site crawling works

Execution options

One-time execution

Scheduled execution

Best practices

Limits

Next steps