Skip to main content

Site Crawling

Site crawling lets you automatically discover URLs on a website so you can scrape them in bulk. Instead of manually collecting URLs, you configure a discovery method and Meter finds all matching pages.

Why use site crawling?

Traditional scraping requires you to know every URL upfront. Site crawling solves this by:
  • Discovering URLs automatically from sitemaps, pagination, or link patterns
  • Filtering URLs to target only the pages you need
  • Batch execution to scrape hundreds or thousands of pages at once
  • Creating schedules to re-crawl and scrape on a recurring basis

Discovery methods

Meter supports three methods for discovering URLs:

Sitemap

Parse sitemap.xml files to extract all indexed URLs

Pagination

Generate URLs by incrementing a page number in a template

Link Pattern

Crawl a site and collect URLs matching a pattern

Sitemap discovery

Best for sites with a sitemap.xml file. Meter parses the sitemap (including nested sitemaps) and extracts all URLs. Configuration:
  • Sitemap URL: The URL to the sitemap.xml file
  • URL Pattern (optional): Glob pattern to filter URLs (e.g., products/*/)
  • Max URLs: Maximum number of URLs to discover (1-10,000)
Example use case: Scraping all product pages from an e-commerce site that maintains a sitemap.

Pagination discovery

Best for sites with predictable paginated URLs. You provide a URL template with a page number placeholder. Configuration:
  • URL Template: URL with {n} placeholder for page number (e.g., https://shop.com/products?page={n})
  • Start Index: First page number (default: 1)
  • Step: Increment between pages (default: 1)
  • Max Pages: Maximum pages to generate
Example use case: Scraping all pages of search results or product listings where URLs follow a pattern like ?page=1, ?page=2, etc. Best for sites without sitemaps or predictable pagination. Meter crawls from a seed URL and collects links matching your pattern. Configuration:
  • Seed URL: Starting URL for the crawl
  • Link Pattern: Glob pattern for URLs to collect (e.g., /product/*/)
  • Navigation Pattern (optional): Pattern for pages to visit during crawl (e.g., /category/)
  • Max Depth: How many links deep to crawl (1-10)
  • Max URLs: Maximum URLs to discover
Example use case: Discovering all article pages on a news site by crawling category pages and collecting article links.

How site crawling works

  1. Configure: Choose a discovery method and set parameters
  2. Discover: Meter crawls and finds matching URLs
  3. Review: Check the discovered URLs and adjust if needed
  4. Execute: Run a one-time batch scrape or create a recurring schedule

Execution options

After discovering URLs, you can:

One-time execution

Create scrape jobs for all discovered URLs immediately. Each URL becomes a separate job that runs through your chosen strategy.
  • Jobs are created with a shared batch_id for tracking
  • URL filtering with regex is supported
  • Set a maximum number of URLs to process

Scheduled execution

Create a recurring schedule that re-runs the scrape on a regular basis.
  • Interval-based: Run every N hours/days (e.g., every 24 hours)
  • Cron-based: Run on a cron schedule (e.g., 0 9 * * * for 9 AM daily)
  • Webhook notifications: Get notified when scrapes complete
Schedules store a copy of the discovered URLs. If you need to update the URL list, create a new schedule from a fresh discovery.

Best practices

Sitemaps are the fastest and most reliable discovery method. Check if your target site has one at /sitemap.xml or in robots.txt before trying other methods.
Most sitemaps include URLs you don’t need (about pages, terms of service, etc.). Use URL patterns to filter down to just the pages you want to scrape.For example, products/*/ matches product pages while excluding other site content.
Start with a low max_urls limit (e.g., 10-50) to verify your configuration before running a full crawl. This saves time and resources.
Make sure your extraction strategy works with the pages you’re discovering. If you’re crawling product pages, use a strategy created from a product page.

Limits

ParameterLimit
Max URLs per discovery10,000
Max crawl depth (link pattern)10
Max pages (pagination)1,000

Next steps