Site Crawling Guide
Learn how to discover URLs on a website and scrape them in bulk using Meter’s site crawling feature.Overview
Site crawling automates the process of finding URLs to scrape. Instead of manually collecting URLs, you configure how to discover them and Meter does the rest. Use site crawling when:- You need to scrape many pages from the same site
- URLs follow a pattern (sitemaps, pagination, link structure)
- You want to create recurring scrapes of discovered URLs
Prerequisites
Before you start:- Create a Meter account at meter.sh
- Have a strategy ready for the pages you want to scrape
- Know how URLs are organized on your target site
Choosing a discovery method
| Method | Best for | Example |
|---|---|---|
| Sitemap | Sites with sitemap.xml | E-commerce product catalogs |
| Pagination | Predictable page URLs | Search results, listings |
| Link Pattern | Crawling by following links | News articles, blog posts |
Method 1: Sitemap discovery
Sitemaps are the fastest and most reliable discovery method.Step 1: Find the sitemap
Most sites have a sitemap at/sitemap.xml or listed in robots.txt:
Step 2: Start discovery
- Dashboard
- REST API
- Go to the Dashboard and click Discover URLs
- Select Sitemap as the discovery method
- Enter the sitemap URL (e.g.,
https://shop.com/sitemap.xml) - Optionally add a URL pattern to filter results
- Click Start Discovery
Step 3: Poll for results
Discovery runs asynchronously. Poll until status iscompleted:
Method 2: Pagination discovery
Use this when URLs follow a numbered pattern.Configuration
| Parameter | Description | Example |
|---|---|---|
url_template | URL with {n} placeholder | https://shop.com/products?page={n} |
start_index | First page number | 1 |
step | Increment between pages | 1 |
max_pages | Maximum pages to generate | 100 |
Example
- Dashboard
- REST API
- Select Pagination as the discovery method
- Enter URL template:
https://shop.com/search?page={n} - Set start index, step, and max pages
- Click Start Discovery
Method 3: Link pattern discovery
Use this to crawl a site and collect URLs matching a pattern.Configuration
| Parameter | Description | Example |
|---|---|---|
seed_url | Starting URL | https://news.com |
link_pattern | Pattern to match (glob) | /article/*/ |
navigation_pattern | Pages to visit during crawl | /category/ |
max_depth | How deep to crawl | 2 |
max_urls | Maximum URLs to collect | 500 |
Example
- Dashboard
- REST API
- Select Link Pattern as the discovery method
- Enter seed URL:
https://news.com - Enter link pattern:
/article/*/ - Set max depth and max URLs
- Click Start Discovery
Executing discovered URLs
Once discovery completes, you can execute immediately or create a schedule.One-time execution
Scrape all discovered URLs immediately:- Dashboard
- REST API
- Review the discovered URLs
- Select your extraction strategy
- Set maximum URLs to process
- Click Execute Now
Create a schedule
Set up recurring scrapes:- Dashboard
- REST API
- Click Create Schedule
- Choose interval (e.g., every 24 hours) or cron expression
- Optionally add a webhook URL
- Click Create Scheduled Job
Filtering URLs
Use regex patterns to filter discovered URLs:Complete example
Here’s a full workflow for scraping a product catalog:Troubleshooting
Sitemap not found
Sitemap not found
Solutions:
- Check
robots.txtfor the sitemap location - Try common paths:
/sitemap.xml,/sitemap_index.xml,/sitemap/sitemap.xml - Some sites use dynamic sitemaps - check the page source for sitemap links
No URLs discovered
No URLs discovered
Causes:
- URL pattern too restrictive
- Sitemap is empty or blocked
- Link pattern doesn’t match any URLs
- Remove or broaden the URL pattern
- Verify the sitemap loads in a browser
- Test your link pattern against sample URLs
Discovery times out
Discovery times out
Cause: Large sitemaps or deep crawls take timeSolutions:
- Reduce
max_urlsormax_depth - Use URL patterns to target specific sections
- For very large sites, run multiple smaller discoveries
Too many unwanted URLs
Too many unwanted URLs
Solutions:
- Add a URL pattern to filter during discovery
- Use
url_filterregex when executing - For link pattern crawls, be more specific with patterns
Next steps
Site Crawling Concepts
Understand how site crawling works
Discovery API Reference
View all discovery endpoints
Webhooks
Get notified when scrapes complete
Change Detection
Track changes between scrapes