PRD Web Scraping

icon picker
Module: Automated Scheduler

Technical Specification specifically for the Scraper Modules

Overview

This module is responsible for managing the execution of scraping jobs on a scheduled or on-demand basis. It ensures that scrapers run consistently (e.g., daily or weekly), do not overlap, and produce traceable metadata logs for observability. It supports both production automation and developer-triggered runs.

Technical Component

Component
Technology
Purpose
Scheduler Engine
Cron (Linux) / AWS EventBridge / Azure Scheduler
Automates job triggering at fixed intervals
Job Executor
Python CLI / Container Entrypoint
Executes the scraping pipeline when triggered
Lock Mechanism
File lock / Redis / Cloud-native mutex
Prevents overlapping executions
Trigger API
Flask / FastAPI (optional)
Allows manual triggering from a web panel or CLI
Metadata Logger
Python Logging / CloudWatch / Log Analytics
Records timestamps, durations, and execution results
There are no rows in this table

Scheduler Behaviors

Behavior
Logic
Recurring Job Scheduling
Define cron-style schedules in config. Each job is triggered by time.
Conflict Prevention
Use locking mechanism (file lock or cloud-native) to prevent overlap.
Manual Triggering
Expose internal CLI or API to allow QA/dev to initiate ad hoc runs.
Metadata Collection
On each run, log: start time, end time, runtime duration, job status.
Configurable Intervals
Job interval and target scraper modules are configurable via .env or YAML.
There are no rows in this table

Input/Output Specification

Type
Format
Description
Input
Schedule rule (cron/timer)
Triggers automated job (e.g., every day at 02:00)
Input
Manual trigger (optional)
Triggered by user or developer manually
Output
Execution log / job metadata
Contains job ID, timestamp, site name, duration, result status
There are no rows in this table

Developer Configuration Options

Config Key
Example
Description
SCRAPE_INTERVAL
daily, cron
Frequency of scheduled job
SCRAPER_LIST
["JustLease", "123Lease"]
Sites to include in this schedule
RETRY_ON_FAIL
true
Whether to retry failed jobs automatically
LOCK_FILE_PATH
/tmp/scrape.lock
Prevent concurrent runs
There are no rows in this table

Workflow Logic (End-to-End)

Initialize Scheduler: Based on cron definition or cloud schedule (e.g., 02:00 AM daily).
Check for Active Job:
If lock exists → skip job or queue.
If not → continue to next step.
Execute Job:
Call scraping module(s) with appropriate configs.
Write start and end time, along with success/failure.
Log Metadata:
Record execution info to a central logging system.
Report anomalies (if scraping fails or duration is abnormal).
Unlock:
Release job lock and exit cleanly.

Monitoring & Observability

Job Summary Logs:
Timestamp
Duration
Sites scraped
Number of listings extracted
Errors (if any)
Anomaly Detection:
If a site returns zero results → raise a warning
If job duration exceeds threshold → flag as delayed
If job fails multiple times → notify team
Sample Log Output:
json
CopyEdit
{
"job_id": "job_2025-04-07_02:00",
"status": "success",
"duration_sec": 312,
"scraped_sites": ["JustLease", "DirectLease"],
"errors": []
}

Extensibility

Multiple Schedules: Support per-site custom schedules.
Parallelization: Future version can support concurrent site scraping with resource limiters.
UI Trigger: Optional frontend button or API endpoint for triggering jobs manually.
Failover: Re-attempt failed jobs automatically or alert for retry.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.