Site Crawling
Site crawling lets you automatically discover URLs on a website so you can scrape them in bulk. Instead of manually collecting URLs, you configure a discovery method and Meter finds all matching pages.Why use site crawling?
Traditional scraping requires you to know every URL upfront. Site crawling solves this by:- Discovering URLs automatically from sitemaps, pagination, or link patterns
- Filtering URLs to target only the pages you need
- Batch execution to scrape hundreds or thousands of pages at once
- Creating schedules to re-crawl and scrape on a recurring basis
Discovery methods
Meter supports three methods for discovering URLs:Sitemap
Parse sitemap.xml files to extract all indexed URLs
Pagination
Generate URLs by incrementing a page number in a template
Link Pattern
Crawl a site and collect URLs matching a pattern
Sitemap discovery
Best for sites with asitemap.xml file. Meter parses the sitemap (including nested sitemaps) and extracts all URLs.
Configuration:
- Sitemap URL: The URL to the sitemap.xml file
- URL Pattern (optional): Glob pattern to filter URLs (e.g.,
products/*/) - Max URLs: Maximum number of URLs to discover (1-10,000)
Pagination discovery
Best for sites with predictable paginated URLs. You provide a URL template with a page number placeholder. Configuration:- URL Template: URL with
{n}placeholder for page number (e.g.,https://shop.com/products?page={n}) - Start Index: First page number (default: 1)
- Step: Increment between pages (default: 1)
- Max Pages: Maximum pages to generate
?page=1, ?page=2, etc.
Link pattern discovery
Best for sites without sitemaps or predictable pagination. Meter crawls from a seed URL and collects links matching your pattern. Configuration:- Seed URL: Starting URL for the crawl
- Link Pattern: Glob pattern for URLs to collect (e.g.,
/product/*/) - Navigation Pattern (optional): Pattern for pages to visit during crawl (e.g.,
/category/) - Max Depth: How many links deep to crawl (1-10)
- Max URLs: Maximum URLs to discover
How site crawling works
- Configure: Choose a discovery method and set parameters
- Discover: Meter crawls and finds matching URLs
- Review: Check the discovered URLs and adjust if needed
- Execute: Run a one-time batch scrape or create a recurring schedule
Execution options
After discovering URLs, you can:One-time execution
Create scrape jobs for all discovered URLs immediately. Each URL becomes a separate job that runs through your chosen strategy.- Jobs are created with a shared
batch_idfor tracking - URL filtering with regex is supported
- Set a maximum number of URLs to process
Scheduled execution
Create a recurring schedule that re-runs the scrape on a regular basis.- Interval-based: Run every N hours/days (e.g., every 24 hours)
- Cron-based: Run on a cron schedule (e.g.,
0 9 * * *for 9 AM daily) - Webhook notifications: Get notified when scrapes complete
Best practices
Start with sitemaps when available
Start with sitemaps when available
Sitemaps are the fastest and most reliable discovery method. Check if your target site has one at
/sitemap.xml or in robots.txt before trying other methods.Use URL patterns to filter results
Use URL patterns to filter results
Most sitemaps include URLs you don’t need (about pages, terms of service, etc.). Use URL patterns to filter down to just the pages you want to scrape.For example,
products/*/ matches product pages while excluding other site content.Test with small limits first
Test with small limits first
Start with a low
max_urls limit (e.g., 10-50) to verify your configuration before running a full crawl. This saves time and resources.Match strategies to discovered URLs
Match strategies to discovered URLs
Make sure your extraction strategy works with the pages you’re discovering. If you’re crawling product pages, use a strategy created from a product page.
Limits
| Parameter | Limit |
|---|---|
| Max URLs per discovery | 10,000 |
| Max crawl depth (link pattern) | 10 |
| Max pages (pagination) | 1,000 |