Explore

PRD Web Scraping

Module: Automated Scheduler

Technical Specification specifically for the Scraper Modules

Overview

This module is responsible for managing the execution of scraping jobs on a scheduled or on-demand basis. It ensures that scrapers run consistently (e.g., daily or weekly), do not overlap, and produce traceable metadata logs for observability. It supports both production automation and developer-triggered runs.

Technical Component

Component

Technology

Purpose

Scheduler Engine

Cron (Linux) / AWS EventBridge / Azure Scheduler

Automates job triggering at fixed intervals

Job Executor

Python CLI / Container Entrypoint

Executes the scraping pipeline when triggered

Lock Mechanism

File lock / Redis / Cloud-native mutex

Prevents overlapping executions

Trigger API

Flask / FastAPI (optional)

Allows manual triggering from a web panel or CLI

Metadata Logger

Python Logging / CloudWatch / Log Analytics

Records timestamps, durations, and execution results

There are no rows in this table

⁠

Scheduler Behaviors

Behavior

Logic

Recurring Job Scheduling

Define cron-style schedules in config. Each job is triggered by time.

Conflict Prevention

Use locking mechanism (file lock or cloud-native) to prevent overlap.

Manual Triggering

Expose internal CLI or API to allow QA/dev to initiate ad hoc runs.

Metadata Collection

On each run, log: start time, end time, runtime duration, job status.

Configurable Intervals

Job interval and target scraper modules are configurable via .env or YAML.

There are no rows in this table

⁠

Input/Output Specification

Type

Format

Description

Input

Schedule rule (cron/timer)

Triggers automated job (e.g., every day at 02:00)

Input

Manual trigger (optional)

Triggered by user or developer manually

Output

Execution log / job metadata

Contains job ID, timestamp, site name, duration, result status

There are no rows in this table

⁠

Developer Configuration Options

Config Key

Example

Description

SCRAPE_INTERVAL

daily, cron

Frequency of scheduled job

SCRAPER_LIST

["JustLease", "123Lease"]

Sites to include in this schedule

RETRY_ON_FAIL

true

Whether to retry failed jobs automatically

LOCK_FILE_PATH

/tmp/scrape.lock

Prevent concurrent runs

There are no rows in this table

⁠

Workflow Logic (End-to-End)

Initialize Scheduler: Based on cron definition or cloud schedule (e.g., 02:00 AM daily).

Check for Active Job:

If lock exists → skip job or queue.

If not → continue to next step.

Execute Job:

Call scraping module(s) with appropriate configs.

Write start and end time, along with success/failure.

Log Metadata:

Record execution info to a central logging system.

Report anomalies (if scraping fails or duration is abnormal).

Unlock:

Release job lock and exit cleanly.

Monitoring & Observability

Job Summary Logs:

Timestamp

Duration

Sites scraped

Number of listings extracted

Errors (if any)

Anomaly Detection:

If a site returns zero results → raise a warning

If job duration exceeds threshold → flag as delayed

If job fails multiple times → notify team

Sample Log Output:

json

CopyEdit

{

"job_id": "job_2025-04-07_02:00",

"status": "success",

"duration_sec": 312,

"scraped_sites": ["JustLease", "DirectLease"],

"errors": []

}

⁠

Extensibility

Multiple Schedules: Support per-site custom schedules.

Parallelization: Future version can support concurrent site scraping with resource limiters.

UI Trigger: Optional frontend button or API endpoint for triggering jobs manually.

Failover: Re-attempt failed jobs automatically or alert for retry.

Overview

Technical Component

Scheduler Behaviors

Input/Output Specification

Developer Configuration Options

Workflow Logic (End-to-End)

Monitoring & Observability

Extensibility

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.