Change Detection

Meter’s change detection system identifies when scraped content has actually changed, filtering out layout updates, ads, and timestamps that don’t represent meaningful updates.

Why change detection matters

Traditional scraping wastes resources by re-processing unchanged data. For RAG systems, this means:

Wasted embeddings: Re-embedding identical content
Stale timestamps: Triggers on irrelevant date changes
Layout noise: Reacting to CSS class or ad changes
Higher costs: Unnecessary API calls and storage

Meter solves this by comparing content structurally, detecting only meaningful changes.

How it works

Meter generates multiple signatures for each scrape job:

Content Hash

A hash of the extracted data itself. Changes only if the actual content changes.

Structural Signature

A fingerprint of the content structure and patterns. Detects additions, removals, and reordering.

Content hash

The content hash is a cryptographic hash of the extracted data:

job = client.get_job(job_id)
print(job['content_hash'])  # e.g., "7f3d9a2b4c1e..."

Changes trigger when:

Text content is different
Prices, numbers, or values change
New items appear or old ones disappear
Item order changes significantly

Doesn’t change for:

CSS classes or styling
Ad content (if not part of extraction)
Timestamps (if not extracted)

Structural signature

The structural signature captures patterns in the data:

job = client.get_job(job_id)
print(job['structural_signature'])  # Structural fingerprint dict

This detects:

Number of items changing
Field presence/absence
Data type changes
List length changes

Comparing jobs

Automatic comparison (schedules)

Schedules automatically compare new jobs with previous ones:

# Create schedule
schedule = client.create_schedule(
    strategy_id=strategy_id,
    url="https://example.com/products",
    interval_seconds=3600
)

# Check for changes
changes = client.get_schedule_changes(schedule['schedule_id'])

if changes['count'] > 0:
    print(f"Detected {changes['count']} jobs with changes")

Only jobs where content actually changed are returned.

Manual comparison

Compare two specific jobs:

comparison = client.compare_jobs(job_id_1, job_id_2)

print(f"Content hash match: {comparison['content_hash_match']}")
print(f"Structural match: {comparison['structural_match']}")

if not comparison['content_hash_match']:
    print("Content has changed!")

Use manual comparison to build custom change detection logic or investigate specific changes.

Change detection strategies

Pull-based monitoring

Poll for changes periodically:

# Check every hour for changes
import time

while True:
    changes = client.get_schedule_changes(
        schedule_id,
        mark_seen=True
    )

    if changes['count'] > 0:
        print(f"Processing {changes['count']} changes")
        for change in changes['changes']:
            process_change(change['results'])

    time.sleep(3600)  # Wait 1 hour

Webhook-based monitoring

Receive immediate notifications:

# Create schedule with webhook
schedule = client.create_schedule(
    strategy_id=strategy_id,
    url="https://example.com/products",
    interval_seconds=3600,
    webhook_url="https://your-app.com/webhooks/meter"
)

# Webhook endpoint (FastAPI example)
from fastapi import FastAPI

app = FastAPI()

@app.post("/webhooks/meter")
async def handle_webhook(payload: dict):
    if payload['has_changes']:
        # Process only changed content
        results = payload['results']
        await update_vector_db(results)

    return {"status": "ok"}

Use cases

RAG system updates

Only re-embed when content changes:

changes = client.get_schedule_changes(schedule_id)

for change in changes['changes']:
    # Get current embeddings for this URL
    existing_vectors = vector_db.get(url=change['url'])

    # Delete old vectors
    vector_db.delete(existing_vectors.ids)

    # Generate new embeddings only for changed content
    new_vectors = embed(change['results'])
    vector_db.upsert(new_vectors)

Savings: Up to 95% reduction in embedding costs

Price monitoring

Alert only on actual price changes:

changes = client.get_schedule_changes(schedule_id)

for change in changes['changes']:
    for product in change['results']:
        current_price = float(product['price'].replace('$', ''))

        if current_price < price_threshold:
            send_alert(f"{product['name']} dropped to ${current_price}!")

Content freshness tracking

Track when content was last updated:

changes = client.get_schedule_changes(schedule_id)

for change in changes['changes']:
    # Update "last modified" timestamp
    db.update(
        url=change['url'],
        last_modified=change['completed_at']
    )

Filtering noise

Meter’s change detection automatically filters:

Layout changes: CSS classes, div structure changes
Ad rotations: If ads aren’t part of your extraction strategy
Timestamps: If not included in extraction fields
Order changes: Minor reordering that doesn’t affect content

To further filter noise in your extraction:

Focus extractions

Be specific about what you extract:

# Don't extract dynamic timestamps
result = client.generate_strategy(
    url="https://example.com",
    description="Extract article title and content only, ignore publish date"
)

# Extract only static product info
result = client.generate_strategy(
    url="https://shop.com/product/123",
    description="Extract product name, price, and description. Ignore related products and ads."
)

Compare strategically

Only compare the fields that matter:

def meaningful_change(job1, job2):
    """Check if price or availability changed, ignore descriptions"""
    for item1, item2 in zip(job1['results'], job2['results']):
        if item1['price'] != item2['price']:
            return True
        if item1['in_stock'] != item2['in_stock']:
            return True
    return False

Roadmap: Semantic similarity

Coming soon: Semantic similarity detection using embeddings to detect meaning-level changes even when wording differs.

Future versions will include:

Semantic comparison of text content
Paraphrase detection
Meaning-level change scoring

This will enable even smarter filtering: “Product is now on sale” vs. “Item currently discounted” would be detected as semantically identical.

Best practices

Mark changes as seen promptly

Avoid duplicate processing by marking changes as seen:

# Always use mark_seen=True in production
changes = client.get_schedule_changes(
    schedule_id,
    mark_seen=True
)

# Only use mark_seen=False for previewing
preview = client.get_schedule_changes(
    schedule_id,
    mark_seen=False
)

Handle empty changes gracefully

Not all scrapes will detect changes:

changes = client.get_schedule_changes(schedule_id)

if changes['count'] == 0:
    print("No changes detected - content is fresh")
else:
    process_changes(changes['changes'])

Log change detection for debugging

Track when changes are detected:

changes = client.get_schedule_changes(schedule_id)

logger.info(f"Checked schedule {schedule_id}: {changes['count']} changes")

for change in changes['changes']:
    logger.info(
        f"Job {change['job_id']}: "
        f"{change['item_count']} items, "
        f"content_hash={change['content_hash']}"
    )

Troubleshooting

Too many false positives

Problem: Changes detected for minor updatesSolutions:

Make extraction more specific (exclude dynamic elements)
Regenerate strategy with clearer description
Implement custom filtering logic on top of Meter’s detection

Missing real changes

Problem: Actual changes aren’t detectedPossible causes:

Changes already marked as seen
Looking at wrong schedule
Strategy extraction failing

Solutions:

Use mark_seen=False to check without affecting state
Verify schedule ID
Check recent jobs for failures: client.list_jobs(status='failed')

Understanding change signatures

Problem: Want to understand why change was detectedSolution: Compare jobs manually:

# Get last two jobs
jobs = client.list_jobs(strategy_id=strategy_id, limit=2)

if len(jobs) >= 2:
    comparison = client.compare_jobs(jobs[0]['job_id'], jobs[1]['job_id'])

    print(f"Content hash match: {comparison['content_hash_match']}")
    print(f"Structural match: {comparison['structural_match']}")

    if 'changes' in comparison:
        for change in comparison['changes']:
            print(f"  - {change}")

Next steps

Pull-Based Monitoring

Implement change polling in your application

Webhooks

Set up real-time change notifications

RAG Integration

Connect change detection to your vector database

Jobs API Reference

Explore job comparison methods

Need help?

Email me at mckinnon@meter.sh

Getting Started

Core Concepts

Change Detection

Change Detection

Why change detection matters

How it works

Content Hash

Structural Signature

Content hash

Structural signature

Comparing jobs

Automatic comparison (schedules)

Manual comparison

Change detection strategies

Pull-based monitoring

Webhook-based monitoring

Use cases

Filtering noise

Focus extractions

Compare strategically

Roadmap: Semantic similarity

Best practices

Troubleshooting

Next steps

Pull-Based Monitoring

Webhooks

RAG Integration

Jobs API Reference

Need help?

Getting Started

Core Concepts

​Change Detection

​Why change detection matters

​How it works

Content Hash

Structural Signature

​Content hash

​Structural signature

​Comparing jobs

​Automatic comparison (schedules)

​Manual comparison

​Change detection strategies

​Pull-based monitoring

​Webhook-based monitoring

​Use cases

​Filtering noise

​Focus extractions

​Compare strategically

​Roadmap: Semantic similarity

​Best practices

​Troubleshooting

​Next steps

Pull-Based Monitoring

Webhooks

RAG Integration

Jobs API Reference

​Need help?

Change Detection

Why change detection matters

How it works

Content hash

Structural signature

Comparing jobs

Automatic comparison (schedules)

Manual comparison

Change detection strategies

Pull-based monitoring

Webhook-based monitoring

Use cases

Filtering noise

Focus extractions

Compare strategically

Roadmap: Semantic similarity

Best practices

Troubleshooting

Next steps

Need help?