Workflows
A workflow chains multiple scraping strategies into a directed acyclic graph (DAG), where the output of one scraper feeds into the next. This lets you build multi-step pipelines like “scrape an index page for links, then scrape each detail page.”When to use workflows
Use workflows when:- You need to scrape pages discovered by a previous scrape
- Data extraction requires multiple stages (index -> detail -> sub-detail)
- You want to filter results between stages
- Different pages need different extraction strategies
- You’re scraping a single URL or static list of URLs
- All pages use the same extraction strategy
- There’s no dependency between scrapes
Key concepts
Nodes
Each node in a workflow represents a scraping operation using a specific strategy. Nodes have one of three input types:| Input Type | Description | Use Case |
|---|---|---|
static_urls | Fixed list of URLs | Starting nodes (index pages, sitemaps) |
upstream_urls | Extract URLs from a field in upstream results | Following links to detail pages |
upstream_data | Pass full upstream results as context | Parameter mapping between stages |
Edges
Edges connect nodes and define data flow. Each edge goes from a source node to a target node. Edges can optionally include filters that control which upstream results are passed downstream.Filters
Filters let you selectively pass data between nodes. For example, only follow links that contain “article” or skip items where the price is below a threshold.| Method | Description |
|---|---|
Filter.contains(field, value) | Field contains substring |
Filter.not_contains(field, value) | Field does not contain substring |
Filter.equals(field, value) | Exact match |
Filter.not_equals(field, value) | Not exact match |
Filter.regex_match(field, pattern) | Regex match |
Filter.exists(field) | Field exists and is non-empty |
Filter.not_exists(field) | Field is missing or empty |
Filter.gt(field, value) | Greater than |
Filter.lt(field, value) | Less than |
Filter.all(*conditions) | AND — all conditions must match |
Filter.any(*conditions) | OR — at least one condition must match |
case_sensitive parameter (default: False).
How workflows execute
- Root nodes execute first using their static URLs
- Results flow through edges, optionally filtered
- Downstream nodes receive URLs or data from upstream
- This continues until all leaf nodes complete
- Final results are collected from leaf nodes, grouped by URL
Building workflows
Basic chain (A -> B)
Scrape an index page, then follow each link to a detail page:Fan-out (A -> B, C, D)
One source feeding multiple downstream scrapers:Filtered pipeline
Only follow links that match a condition:Multi-stage chain (A -> B -> C)
Change detection
Workflows support change detection through two mechanisms:trigger_on_change_only: When set on an edge, downstream nodes only execute if the upstream results have changed since the last runforce: When running a workflow withforce=True, change detection is skipped and all nodes re-execute
Scheduling workflows
Workflows can be scheduled to run automatically, just like single-strategy schedules:Parameters
Nodes can pass parameters to their strategies:Next steps
Python SDK Reference
Complete workflow class and method documentation
REST API Reference
Workflow endpoints in the REST API
Strategies
Learn about the extraction strategies workflows use
Schedules
Compare with simple scheduled scrapes