Skip to main content

Change Detection

Meter’s change detection system identifies when scraped content has actually changed, filtering out layout updates, ads, and timestamps that don’t represent meaningful updates.

Why change detection matters

Traditional scraping wastes resources by re-processing unchanged data. For RAG systems, this means:
  • Wasted embeddings: Re-embedding identical content
  • Stale timestamps: Triggers on irrelevant date changes
  • Layout noise: Reacting to CSS class or ad changes
  • Higher costs: Unnecessary API calls and storage
Meter solves this by comparing content structurally, detecting only meaningful changes.

How it works

Meter generates multiple signatures for each scrape job:

Content Hash

A hash of the extracted data itself. Changes only if the actual content changes.

Structural Signature

A fingerprint of the content structure and patterns. Detects additions, removals, and reordering.

Content hash

The content hash is a cryptographic hash of the extracted data:
job = client.get_job(job_id)
print(job['content_hash'])  # e.g., "7f3d9a2b4c1e..."
Changes trigger when:
  • Text content is different
  • Prices, numbers, or values change
  • New items appear or old ones disappear
  • Item order changes significantly
Doesn’t change for:
  • CSS classes or styling
  • Ad content (if not part of extraction)
  • Timestamps (if not extracted)

Structural signature

The structural signature captures patterns in the data:
job = client.get_job(job_id)
print(job['structural_signature'])  # Structural fingerprint dict
This detects:
  • Number of items changing
  • Field presence/absence
  • Data type changes
  • List length changes

Comparing jobs

Automatic comparison (schedules)

Schedules automatically compare new jobs with previous ones:
# Create schedule
schedule = client.create_schedule(
    strategy_id=strategy_id,
    url="https://example.com/products",
    interval_seconds=3600
)

# Check for changes
changes = client.get_schedule_changes(schedule['id'])

if changes['count'] > 0:
    print(f"Detected {changes['count']} jobs with changes")
Only jobs where content actually changed are returned.

Manual comparison

Compare two specific jobs:
comparison = client.compare_jobs(job_id_1, job_id_2)

print(f"Content hash match: {comparison['content_hash_match']}")
print(f"Structural match: {comparison['structural_match']}")

if not comparison['content_hash_match']:
    print("Content has changed!")
Use manual comparison to build custom change detection logic or investigate specific changes.

Change detection strategies

Pull-based monitoring

Poll for changes periodically:
# Check every hour for changes
import time

while True:
    changes = client.get_schedule_changes(
        schedule_id,
        mark_seen=True
    )

    if changes['count'] > 0:
        print(f"Processing {changes['count']} changes")
        for change in changes['changes']:
            process_change(change['results'])

    time.sleep(3600)  # Wait 1 hour

Webhook-based monitoring

Receive immediate notifications:
# Create schedule with webhook
schedule = client.create_schedule(
    strategy_id=strategy_id,
    url="https://example.com/products",
    interval_seconds=3600,
    webhook_url="https://your-app.com/webhooks/meter"
)

# Webhook endpoint (FastAPI example)
from fastapi import FastAPI

app = FastAPI()

@app.post("/webhooks/meter")
async def handle_webhook(payload: dict):
    if payload['has_changes']:
        # Process only changed content
        results = payload['results']
        await update_vector_db(results)

    return {"status": "ok"}

Use cases

Only re-embed when content changes:
changes = client.get_schedule_changes(schedule_id)

for change in changes['changes']:
    # Get current embeddings for this URL
    existing_vectors = vector_db.get(url=change['url'])

    # Delete old vectors
    vector_db.delete(existing_vectors.ids)

    # Generate new embeddings only for changed content
    new_vectors = embed(change['results'])
    vector_db.upsert(new_vectors)
Savings: Up to 95% reduction in embedding costs
Alert only on actual price changes:
changes = client.get_schedule_changes(schedule_id)

for change in changes['changes']:
    for product in change['results']:
        current_price = float(product['price'].replace('$', ''))

        if current_price < price_threshold:
            send_alert(f"{product['name']} dropped to ${current_price}!")
Track when content was last updated:
changes = client.get_schedule_changes(schedule_id)

for change in changes['changes']:
    # Update "last modified" timestamp
    db.update(
        url=change['url'],
        last_modified=change['completed_at']
    )

Filtering noise

Meter’s change detection automatically filters:
  • Layout changes: CSS classes, div structure changes
  • Ad rotations: If ads aren’t part of your extraction strategy
  • Timestamps: If not included in extraction fields
  • Order changes: Minor reordering that doesn’t affect content
To further filter noise in your extraction:

Focus extractions

Be specific about what you extract:
# Don't extract dynamic timestamps
result = client.generate_strategy(
    url="https://example.com",
    description="Extract article title and content only, ignore publish date"
)

# Extract only static product info
result = client.generate_strategy(
    url="https://shop.com/product/123",
    description="Extract product name, price, and description. Ignore related products and ads."
)

Compare strategically

Only compare the fields that matter:
def meaningful_change(job1, job2):
    """Check if price or availability changed, ignore descriptions"""
    for item1, item2 in zip(job1['results'], job2['results']):
        if item1['price'] != item2['price']:
            return True
        if item1['in_stock'] != item2['in_stock']:
            return True
    return False

Roadmap: Semantic similarity

Coming soon: Semantic similarity detection using embeddings to detect meaning-level changes even when wording differs.
Future versions will include:
  • Semantic comparison of text content
  • Paraphrase detection
  • Meaning-level change scoring
This will enable even smarter filtering: “Product is now on sale” vs. “Item currently discounted” would be detected as semantically identical.

Best practices

Avoid duplicate processing by marking changes as seen:
# Always use mark_seen=True in production
changes = client.get_schedule_changes(
    schedule_id,
    mark_seen=True
)

# Only use mark_seen=False for previewing
preview = client.get_schedule_changes(
    schedule_id,
    mark_seen=False
)
Not all scrapes will detect changes:
changes = client.get_schedule_changes(schedule_id)

if changes['count'] == 0:
    print("No changes detected - content is fresh")
else:
    process_changes(changes['changes'])
Track when changes are detected:
changes = client.get_schedule_changes(schedule_id)

logger.info(f"Checked schedule {schedule_id}: {changes['count']} changes")

for change in changes['changes']:
    logger.info(
        f"Job {change['job_id']}: "
        f"{change['item_count']} items, "
        f"content_hash={change['content_hash']}"
    )

Troubleshooting

Problem: Changes detected for minor updatesSolutions:
  • Make extraction more specific (exclude dynamic elements)
  • Regenerate strategy with clearer description
  • Implement custom filtering logic on top of Meter’s detection
Problem: Actual changes aren’t detectedPossible causes:
  • Changes already marked as seen
  • Looking at wrong schedule
  • Strategy extraction failing
Solutions:
  • Use mark_seen=False to check without affecting state
  • Verify schedule ID
  • Check recent jobs for failures: client.list_jobs(status='failed')
Problem: Want to understand why change was detectedSolution: Compare jobs manually:
# Get last two jobs
jobs = client.list_jobs(strategy_id=strategy_id, limit=2)

if len(jobs) >= 2:
    comparison = client.compare_jobs(jobs[0]['id'], jobs[1]['id'])

    print(f"Content hash match: {comparison['content_hash_match']}")
    print(f"Structural match: {comparison['structural_match']}")

    if 'changes' in comparison:
        for change in comparison['changes']:
            print(f"  - {change}")

Next steps

Need help?

Email me at [email protected]