Skip to content

Module: Automated Scheduler

Technical Specification specifically for the Scraper Modules

Overview

This module is responsible for managing the execution of scraping jobs on a scheduled or on-demand basis. It ensures that scrapers run consistently (e.g., daily or weekly), do not overlap, and produce traceable metadata logs for observability. It supports both production automation and developer-triggered runs.

Technical Component

Component
Technology
Purpose
Scheduler Engine
Cron (Linux) / AWS EventBridge / Azure Scheduler
Automates job triggering at fixed intervals
Job Executor
Python CLI / Container Entrypoint
Executes the scraping pipeline when triggered
Lock Mechanism
File lock / Redis / Cloud-native mutex
Prevents overlapping executions
Trigger API
Flask / FastAPI (optional)
Allows manual triggering from a web panel or CLI
Metadata Logger
Python Logging / CloudWatch / Log Analytics
Records timestamps, durations, and execution results
There are no rows in this table

Scheduler Behaviors

Behavior
Logic
Recurring Job Scheduling
Define cron-style schedules in config. Each job is triggered by time.
Conflict Prevention
Use locking mechanism (file lock or cloud-native) to prevent overlap.
Manual Triggering
Expose internal CLI or API to allow QA/dev to initiate ad hoc runs.
Metadata Collection
On each run, log: start time, end time, runtime duration, job status.
Configurable Intervals
Job interval and target scraper modules are configurable via .env or YAML.
There are no rows in this table

Input/Output Specification

Input
Schedule rule (cron/timer)
Triggers automated job (e.g., every day at 02:00)
Input
Manual trigger (optional)
Triggered by user or developer manually
Output
Execution log / job metadata
Contains job ID, timestamp, site name, duration, result status
There are no rows in this table

Developer Configuration Options

There are no rows in this table

Workflow Logic (End-to-End)

Initialize Scheduler: Based on cron definition or cloud schedule (e.g., 02:00 AM daily).
Check for Active Job:
If lock exists → skip job or queue.
If not → continue to next step.
Execute Job:
Call scraping module(s) with appropriate configs.
Write start and end time, along with success/failure.
Log Metadata:
Record execution info to a central logging system.
Report anomalies (if scraping fails or duration is abnormal).
Unlock:
Release job lock and exit cleanly.

Monitoring & Observability

Job Summary Logs:
Timestamp
Duration
Sites scraped
Number of listings extracted
Errors (if any)
Anomaly Detection:
If a site returns zero results → raise a warning
If job duration exceeds threshold → flag as delayed
If job fails multiple times → notify team
Sample Log Output:
json
CopyEdit
{
"job_id": "job_2025-04-07_02:00",
"status": "success",
"duration_sec": 312,
"scraped_sites": ["JustLease", "DirectLease"],
"errors": []
}

Extensibility

Multiple Schedules: Support per-site custom schedules.
Parallelization: Future version can support concurrent site scraping with resource limiters.
UI Trigger: Optional frontend button or API endpoint for triggering jobs manually.
Failover: Re-attempt failed jobs automatically or alert for retry.
Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.