Explore

PRD Web Scraping

Product Requirements Document Web Scrapping

gulangsatriya pangarsa

Module : Scraper

Module : Data Normalization & Cleaning Pipeline

Module : Cross-Platform Matching Engine

Module: Automated Scheduler

Module : Structured Data Output

Module: Error Logging & Alerting

Monitoring Dashboard & Logs

Document Version

Document Version

Version

Description

Date Modified

Author

Notes

1.0

First Initial Document

4/7/2025

Open

There are no rows in this table

⁠

Purpose Of This Document

outlines the project’s objectives, requirements, architecture, and timeline for building the web scraping platform. It details what the system should do (functional requirements such as scraping logic and data matching) and how it should perform (non-functional requirements like scalability, frequency, and error recovery). A high-level architecture is proposed for how data flows from the source websites through scrapers into a unified database, with a sample data schema provided. Finally, the document identifies key milestones for development and open questions that need clarification from stakeholders.

Additionally, I have created this document in Coda as a form of living documentation to ensure it is easy to manage, update, and search for relevant information over time.

Executive Summary

Purpose: Provide a high-level overview of the product or solution. Content: Summarizes the current situation, the problem or complication faced, the key question that needs answering, and the proposed solution or strategy

Situation

AutoCompare Inc. aims to enhance its online car lease comparison platform by providing users with the most current and comprehensive lease offers in the Dutch market. This requires continuously updated data on key attributes such as car make, model, pricing, seller details, specifications, and images from leading lease websites like JustLease.nl, DirectLease.nl, and 123Lease.nl.

The Need for Real-Time and Structured Lease Data

Users demand accurate, up-to-date lease comparisons in a single, centralized platform.

Lease data is fragmented across various third-party websites with inconsistent formats

Manual collection is inefficient and unsustainable as listing volume increases.

Complication

Barriers to Manual Data Collection and Integration

Issue #1:Scattered Data Across Multiple Sources

Lease information is fragmented across various websites (JustLease.nl, DirectLease.nl, 123Lease.nl).

Issue #2: Inconsistent Formats and Inefficient Collection

Each platform presents data differently. Manual collection leads to inefficiencies, errors, and slow updates.

Issue #3: No Automation, Outdated Listings

Without an automated system, listings risk becoming stale—hurting user trust and experience on the platform.

Question

How can AutoCompare collect, clean, and keep lease data current across multiple third-party sites — reliably and at scale?

Answer

Build an automated scraping platform with the following features:

Scrapers for each target website.

Data pipeline for cleaning, standardizing, and matching car listings.

Centralized storage to deliver structured JSON or database output.

Automated refresh schedules to keep listings up-to-date.

Robust error handling to detect and fix scraping issues quickly.

This platform ensures AutoCompare can serve users with accurate, real-time lease comparisons—automated, scalable, and low-maintenance.

Raw Requirement

Purpose: Capture all initial inputs from stakeholders or clients.

Content: Often unstructured or loosely defined (e.g., from meetings, PDFs, or notes), this section acts as the source for detailed refinement later

⁠

PO Assignment.pdf

145 kB

⁠

Objectives

Purpose: Break the goal down into achievable and measurable sub-goals. Content: Key results that indicate success—e.g.,

Goal

To enable AutoCompare to provide users with accurate, up-to-date, and structured car lease data from multiple Dutch providers through an automated, scalable web scraping platform.

Comprehensive Data Collection

Scrape car make, model, price/month, provider, lease terms, specs, images, etc.

Start with JustLease.nl, DirectLease.nl, and 123Lease.nl, with future site expansion in mind.

Data Standardization and Integration

Normalize data into a unified schema, ready for platform integration.

Cross-Site Matching

Detect and link equivalent listings across sites using smart matching logic.

Frequent Updates

Schedule automated scrapes (e.g., daily) to keep listings current.

Robust Error Handling

Detect broken scrapers, missing fields, or anomalies. Trigger alerts and fail gracefully.

Scalability & Maintainability

Easily update existing scrapers or add new ones with minimal dev effort.

Support cloud-based deployment for stability and performance.

Output for Integration

Deliver cleaned, matched, and structured data in JSON (or via API) to AutoCompare’s systems.

Scope of work & Deliverables

The following deliverables are included in the initial project scope and directly support the above goals:

Scraper Modules

Python-based scrapers for JustLease.nl, DirectLease.nl, and 123Lease.nl.

Handle pagination, detail pages, and structured field extraction.

Data Normalization & Cleaning Pipeline

Convert scraped data into a consistent, platform-ready format.

Standardize terminology, data types, and categories.

Cross-Platform Matching Engine

Fuzzy matching logic to group same car models from different sites.

Automated Scheduler

Scripted jobs for periodic (e.g., daily) scraping runs.

Error Logging & Alerting

Detect scraper failures or missing fields, and trigger alerts (e.g., via email or monitoring tool).

Cloud-Ready Deployment

Deploy scraping and processing pipelines to a cloud provider (AWS or Azure).

Support scalable execution and centralized storage.

Structured Data Output

Export final data in JSON or via a basic API endpoint.

Schema to be aligned with AutoCompare’s integration format.

Basic Monitoring Dashboard or Logs

Show job status, number of records scraped, and errors per run.

Out-of-Scope

These items are explicitly not included in this initial delivery:

Front-End Development

No development or UI design for AutoCompare’s user-facing comparison platform.

Advanced Analytics or Reporting

No business intelligence dashboards or trend analytics beyond basic logs.

Legal or Regulatory Compliance Work

Assumes client has approved web scraping and addressed any legal concerns.

Long-Term Maintenance

Ongoing monitoring, bug fixes, or site changes post-launch are not included unless a separate SLA is agreed.

Requirements

Purpose: Define what the system must do and how it should behave. Content: Split into Functional and Non-Functional Requirements.

Functional Requirements

Purpose: Detail specific system behaviors or functions. Content: Covers features

Scraper Modules

Description: Python-based scrapers for JustLease.nl, DirectLease.nl, and 123Lease.nl to collect lease car data.

Functional Requirements:

The system must initialize a separate scraper for each source website, with site-specific navigation logic.

It must iterate over paginated results to ensure complete data capture across all available listings.

For listings that link to a detail page, the system must visit those pages and extract extended fields including specifications, image URLs, and lease conditions.

Each scraper must extract the following fields:

Make & Model: including variant names.

Price per Month: in Euros, extracted cleanly from various formats.

Lease Terms: including duration (months) and mileage (km/year).

Seller/Provider Info: capturing either the main site name or the third-party provider.

Car Specifications: fuel type, engine/battery, transmission, body type, condition (new/used), and other visible specifications.

Image URLs: must include at least one image per car.

Listing URL: the direct page of the offer.

The scraping logic must:

Respect website crawling rules (robots.txt).

Apply throttling/delays between requests.

Include user-agent headers to simulate human behavior.

If a site uses JavaScript for content, the system must use a headless browser (e.g., Selenium), while preferring lightweight libraries (e.g., Scrapy, BeautifulSoup) where possible. for technical details you can see here

Module : Scraper Technical Specification specifically for the Scraper Modules⁠

⁠

Data Normalization & Cleaning Pipeline

Description: Converts raw scraped data into a unified, structured, and standardized format ready for integration.

Functional Requirements:

The system must clean and convert all numeric fields (e.g., monthly prices, durations, mileage) to standardized formats.

It must clean inconsistent representations across sites (e.g., whitespace, punctuation, currency symbols).

It must standardize categorical values (e.g., "Electric" vs "EV", "SUV" vs "Crossover").

The normalization logic must:

Ensure field values match the schema data type.

Allow extension for new fields (e.g., new spec types introduced later).

Validate data and flag inconsistencies.

for technical details you can see here

⁠

Module : Data Normalization & Cleaning Pipeline Technical Specification specifically for the Scraper Modules⁠

⁠

Cross-Platform Matching Engine

Description: Links equivalent car listings across websites using a hybrid semantic-matching algorithm that combines text embeddings (SBERT), fuzzy string matching, and canonical dictionary mapping. This approach enables accurate model grouping despite textual inconsistencies, abbreviations, or variant naming differences across sites.

Functional Requirements:

The system must generate a semantic embedding for each listing using key fields: make, model, variant, and fuel_type.

It must compute cosine similarity between embeddings and group listings above a configured threshold (e.g., ≥ 0.92) under the same model_id.

It must apply fuzzy string matching (e.g., Levenshtein distance or FuzzyWuzzy) as a fallback when embedding confidence is borderline (e.g., 0.85–0.91).

It must use a canonical dictionary to normalize known synonyms (e.g., "VW" → "Volkswagen", "LR" → "Long Range") prior to embedding or comparison.

It must assign a unique model_id to all listings determined to represent the same vehicle, preserving their individual lease terms and pricing.

It must support configurable thresholds and logging of ambiguous matches for QA review or future training refinement.

It must maintain record uniqueness, ensuring different variants or unmatched entries remain distinct. for technical details you can see here

Module : Cross-Platform Matching Engine Technical Specification specifically for the Scraper Modules⁠

⁠

Automated Scheduler

Description: Enables time-based and manual execution of scraping jobs.

Functional Requirements:

The system must support configurable recurring schedules (e.g., daily, weekly).

It must queue or block overlapping runs to avoid conflicts.

It must allow manual triggering of scrapes for QA or hotfixes.

It must generate execution metadata (e.g., timestamps, duration). for technical details you can see here

Module: Automated Scheduler Technical Specification specifically for the Scraper Modules⁠

⁠

Structured Data Output

Description: Aggregated data is exported to a JSON format or exposed through a basic API.

Functional Requirements:

The system must output a JSON array containing all normalized, matched listings.

Each JSON object must include fields for make, model, price, terms, specs, provider, and image URLs.

It must include metadata such as source website and scrape timestamp.

It must support exposing the data through an API endpoint (e.g., /api/leaseoffers).

It must provide filtering capabilities on the API (e.g., by brand, fuel type). for technical details you can see here

Module : Structured Data Output Technical Specification specifically for the Scraper Modules⁠

⁠

Error Logging & Alerting

Description: Identifies, logs, and reports issues during scraping and processing.

Functional Requirements:

The system must log job status (success/failure) and capture all encountered errors.

It must retry network-related errors up to a maximum retry threshold.

It must detect data field extraction failures and log missing or null fields.

It must send email/Slack alerts for scraper failure, zero listings, or excessive null values.

It must produce daily summaries of scraper health and performance. for technical details you can see here

⁠

Module: Error Logging & Alerting Technical Specification specifically for the Scraper Modules⁠

⁠

Basic Monitoring Dashboard or Logs

Description: Provides visual and programmatic insight into scraping and pipeline performance.

Functional Requirements:

The system must track job status, data volume (number of cars scraped), and error rate per run.

It must display time series data for daily/monthly trends.

It must allow download/export of logs for investigation.

It may optionally provide a browser-accessible UI for job status.

It must highlight anomalies (e.g., large drops in record count) for operational visibility.| for technical details you can see here

⁠

Monitoring Dashboard & Logs Technical Specification specifically for the Scraper Modules⁠

⁠

Deployment

This module defines how the AutoCompare scraping platform is deployed and operated. While cloud deployment (AWS or Azure) is the most recommended approach due to its scalability, operational efficiency, and managed services, the final deployment environment (cloud, on-premise, or hybrid) must be confirmed by the client, as no definitive infrastructure requirement has been provided yet.

Deployment Considerations:

Cloud (Recommended): Supports elasticity, auto-scaling, and low DevOps overhead via managed services (e.g., AWS Fargate, Azure App Service).

On-Premise: Can be considered if client requires strict control over data residency or infrastructure.

Hybrid: Possible if certain components (e.g., API exposure, storage) need to reside on-prem while others run in the cloud.

Non - Functional Requirements

Purpose: Specify the qualities the system must have. Content: Includes,Scalability,Reliability,Performance (e.g., scrape duration < 1 hour),Maintainability and observability

Scalability

The scraping platform should handle increasing load and additional sources over time. This means:

Ability to add new websites with minimal changes to the overall system (modular scraper design).

If the volume of data grows (for example, if each site adds many more listings or if AutoCompare wants to scrape 10+ websites in the future), the system’s architecture (using cloud resources, databases, etc.) should be able to scale up.

The database or storage solution must handle more records and possibly concurrent writes/reads if scaled out. Using a cloud database service can ensure we can scale storage and throughput as needed.

Performance

The end-to-end scraping and data processing pipeline should be reasonably efficient:

Timeliness: A full scraping cycle (all sites) should complete within a acceptable time window (e.g., within a few hours at most, ideally under 1 hour if run nightly). This ensures data freshness. If one site is particularly large, consider multithreading or concurrency in that scraper to fetch pages faster, as long as it doesn’t overwhelm the site.

Responsiveness: If an API is provided for the data, its responses should be quick (sub-second for queries) so that the AutoCompare platform can load comparisons without delay. This implies using indices or efficient queries in the data store.

Note that extremely real-time performance is not required (we are not expecting changes minute-by-minute), but the system should not be sluggish in retrieving or updating data.

Reliability & Robustness

The system should be highly reliable in obtaining and delivering data:

Scrapers must be robust to minor changes in HTML structure (using stable selectors, or having fallback strategies). When failures occur, the system should fail gracefully – e.g., if one site is down, it should not prevent others from being scraped and the data pipeline completing for those.

Automate recovery where possible: if a transient network error happens, the scraper could retry a few times. If a parsing error occurs, it might skip that item and continue.

Utilize logging to record normal operations and errors. This will help quickly diagnose issues.

Consider implementing a watchdog or monitor that ensures the scraping jobs actually run on schedule. For example, if a scheduled run did not happen or froze, it should be detected (perhaps via a timeout or a missing “heartbeat” file) so that corrective action can be taken.

Maintainability

The codebase and system design should be maintainable over time:

Use a modular architecture – for example, one module or script per source site. This way, if DirectLease.nl changes its layout, a developer can go directly to the DirectLease scraper module and update the selectors or logic without impacting the others.

Clear documentation of the scraping logic for each site (what URLs are hit, what data is expected) should be provided, perhaps in comments or a wiki. This helps future developers quickly understand and modify scrapers as needed.

The matching and normalization rules might evolve (e.g., new car models, new fields). These should be configured in a way that’s easy to update (for instance, a config file or easily editable code section for the model name mappings, rather than hard-coding everywhere).

If using cloud infrastructure, infrastructure-as-code (like scripts or Terraform) can be used to document how the system is deployed, making it easier to recreate or modify the environment.

Error Handling and Monitoring

Closely related to reliability, the system should have strong monitoring:

Set up monitoring dashboards or alerts (for example, using CloudWatch on AWS or Application Insights on Azure) to track the scraping jobs. Key metrics: success/failure of each job, runtime, number of records scraped, etc.

Send notifications (email, Slack, etc.) to the development/operations team if a job fails or if the scraped data deviates significantly (as mentioned in functional requirements).

Maintain logs of each run. Ideally, logs can be centralized (e.g., stored in CloudWatch Logs or Azure Log Analytics) so that developers can inspect what happened on each run. This helps troubleshoot issues like selectors not finding elements (which would appear as missing data in logs).

Error recovery: If a scraper fails mid-run, the system could either restart that scraper or mark it for manual intervention depending on the error type. Ensuring that one failed site doesn’t block the others is important (e.g., use independent processes or threads per site).

Frequency and Scheduling

The system must support the required update frequency in a reliable way:

We expect to run the scraping at regular intervals (to be determined with the client – e.g., nightly at 2 AM, or twice a week). This schedule should be configurable. Using a scheduler (cron jobs on a server or scheduled functions in cloud) is necessary to automate this.

The scheduling mechanism should also avoid overlapping runs if a previous run hasn’t finished. For example, if using a cloud scheduler, ensure the job cannot start again if it’s still running.

The platform should also allow on-demand runs (for instance, a developer can trigger a scrape manually if needed for testing or if an urgent update is needed outside the schedule).

Security and Compliance

While the data being scraped is public, we should still consider security:

Data Security: Store the scraped data in a secure manner. If using cloud storage or DB, restrict access so that only authorized systems (like AutoCompare’s servers) can read it. Although the data isn’t sensitive personal info, it’s still proprietary aggregate data for AutoCompare.

Credentials: If any target site in the future requires login or API keys, those credentials must be stored securely (e.g., in AWS Secrets Manager or Azure Key Vault) and not hard-coded.

Legal & Ethical Compliance: Ensure that our scraping abides by the websites’ terms of service and robots.txt rules. Many sites allow scraping of public data, but if a site explicitly forbids it, we need to discuss with the client how to proceed (possibly get permission or use an API if offered). In our current scope (the given Dutch lease sites), they are public listings, but this should be verified. Rate limiting and identifying as AutoCompare’s bot via User-Agent can help maintain a good relationship with the data sources.

No PII: We are not scraping personal data, so privacy compliance (GDPR etc.) is not a major concern here, but we should still handle data responsibly.

Extensibility

The solution should be built with future expansion in mind:

Adding a new website should be as simple as writing a new scraper module and plugging it into the pipeline. The core pipeline (normalization, matching, storage) remains the same. This requires a flexible design where new sources can be registered/configured easily.

The matching logic might need to evolve to cover more edge cases or new types of matching (for example, matching specific trim levels or adding VIN matching if ever available). The code should be written to allow such extension without a complete rewrite.

We might also consider internationalization if AutoCompare expands beyond the Netherlands. While not in scope now, keeping the code adaptable (not hardcoding strings in Dutch, etc.) could be beneficial.

Tech Stack Constraints: As per assumptions, the tech stack is Python-based and likely to be deployed on AWS or Azure. Non-functional requirements here include:

Compatibility: Use libraries and versions that are stable and widely supported in the deployment environment (for example, Python 3.x, Scrapy latest version, etc.). Avoid very cutting-edge libraries that might have bugs.

Cloud Deployability: The code should run reliably on cloud services. For instance, if using AWS Lambda for scraping, ensure each scrape can complete within Lambda’s time limits and memory (or use AWS Batch/ECS for longer tasks). If using Azure Functions, similarly ensure compatibility.

Resource Management: Use resources efficiently to keep cloud costs in check. For example, if using an EC2 or VM, schedule it to run only when needed or use serverless approaches so we pay only per use. This is more of a cost consideration but ties into how we design for scalability and efficiency.

Data Flow / Architecture Overview

Purpose: Visualize how data moves through the system. Content: Step-by-step flow from scraper trigger → data extraction → normalization → matching → storage → frontend/API integration → monitoring.

Data Flow Detail

1. Trigger Stage: Job Initialization

Source:

Cloud scheduler (AWS CloudWatch, Azure Logic App), or

On-premise cron job, or

Manual dev/QA trigger

Action:

Triggers the scraping job by launching scraper tasks for all configured websites.

Logs job ID, timestamp, and source metadata.

Output:

Initiates parallel scraper modules for each source.

2. Web Scrapers (Multi-Site)

Source:

Target websites (e.g., JustLease.nl, DirectLease.nl, 123Lease.nl)

Action:

Each site’s scraper:

Crawls paginated pages.

Extracts listings and follows to detail pages.

Uses Selenium when JavaScript rendering is required.

Captures key fields such as: make, model, price_per_month, fuel_type, etc.

Output:

Raw JSON files per site with unstructured or semi-structured listing data.

Initial scrape logs (number of records, crawl time, errors).

3. Processing Stage: Normalization & Matching

Source:

All raw JSONs from scrapers.

Action:

Normalization:

Cleans and standardizes fields (e.g., converts “EV” → “Electric”, strips “€” symbols).

Validates data types and fills missing values with nulls or placeholders.

Semantic Matching:

Uses hybrid logic: canonical mappings + SBERT-based semantic embeddings.

Identifies similar vehicles across providers (e.g., "VW Golf TSI" = "Volkswagen Golf 1.5 TSI").

Assigns a model_id for grouping.

Output:

Unified JSON or structured data object with:

Cleaned fields

Matched entries

Metadata (source site, scrape timestamp, model_id)

4. Persistence Stage: Data Storage

Source:

Unified dataset from processing stage.

Action:

Saves cleaned and matched data to:

Cloud-native (recommended): AWS S3, DynamoDB, or Azure Blob/SQL.

On-prem database/file system, if required.

Tags all records with:

source_site, record_id, created_at, and job_id

Supports both document-based (JSON) and structured (SQL/NoSQL) formats.

All data entries tagged with metadata (timestamp, source, job ID).

Ensures atomic overwrite per scraping job to avoid partial/inconsistent reads.

Output:

Ready-to-query database or JSON feed

Versioned historical datasets (optional)

5. Delivery Stage: API & Frontend Integration

Source:

Central storage (S3, DB, Blob)

Action:

Data exposed via:

Exposed through a lightweight REST API (/api/leaseoffers)

Secure file path (e.g., S3 URL)

Supports query filters (e.g., make=Tesla, fuel_type=Electric)

AutoCompare platform fetches and renders grouped comparisons to users.

Output:

JSON payloads or visual comparison UI on frontend

Real-time or interval-based updates

6. Oversight Stage: Monitoring & Alerting

Tracks:

Scrape success/failure

Record count changes (anomalies, e.g., drop from 200→0)

Field-level nulls or extraction issues

Alerts via:

Email, Slack, or other configured endpoints

Logging stored in:

Cloud (AWS CloudWatch, Azure Log Analytics)

Or centralized file logs for on-premise deployments

Optional: generate daily scraping reports for ops dashboard.

Tools and Technology Recommendations

Tech Stack Table (Summary)

Layer

Tool/Framework

Notes

Scraping

Python 3.10+, Scrapy, BeautifulSoup, Selenium

Per modul per website

Processing & Matching

Python, Pandas, SBERT / SentenceTransformers

Normalization + semantic match

Storage

AWS S3, DynamoDB / JSON files (optional SQL)

Configurable via ENV

API Integration

FastAPI / Flask (optional)

For REST endpoint

Scheduling

AWS CloudWatch Events / cron / Airflow

Daily / on-demand triggering

Monitoring

CloudWatch Logs, Slack API, Python logging

Alerts & summaries

There are no rows in this table

⁠

More Detailed Version

Module : Scraper⁠

Module : Data Normalization & Cleaning Pipeline⁠

Module : Cross-Platform Matching Engine · Overview⁠

Module: Automated Scheduler · Overview⁠

Module : Structured Data Output⁠

Module: Error Logging & Alerting · Overview⁠

Monitoring Dashboard & Logs⁠

⁠

Glossary / Definitions

AutoCompare Platform: The end-user facing application/website of AutoCompare Inc. where consumers can compare car lease deals.

Scraper/Spider: A script or program that extracts information from a website. In this context, one per source website.

Normalization: The process of converting data to a standard format (e.g., uniform units, naming, types).

Matching (Deduplication): Identifying when two records refer to the same real-world item (here, the same car model offer) and linking them.

Cron: A time-based job scheduler in Unix-like systems (used as a generic term for scheduling jobs).

API: Application Programming Interface. Here, it refers to a web service that could provide the data to the front-end in JSON format.

JSON: JavaScript Object Notation, a lightweight data-interchange format, used here for structured output of scraped data.

Lease Terms: Conditions of the lease such as duration and mileage allowance.

Private Lease vs Operational Lease: Private lease is for individual consumers; operational (or business) lease is for companies. Our context includes both if available, but primarily focused on consumer/private lease deals as indicated by the sites.

Occasion: A Dutch term for a used car. “Private Lease occasion” refers to leasing a used car (usually at a lower price or shorter availability). Our scrapers will note whether a listing is an occasion or new.

KM: Kilometer, used for mileage (distance).

RWD: Rear-Wheel Drive (example of a variant detail in a model name, relevant in variant field).

Document Version

Purpose Of This Document

Scope of work & Deliverables

Out-of-Scope

Requirements

Functional Requirements

Scraper Modules

Data Normalization & Cleaning Pipeline

Cross-Platform Matching Engine

Automated Scheduler

Structured Data Output

Error Logging & Alerting

Basic Monitoring Dashboard or Logs

Deployment

Non - Functional Requirements

Scalability

Performance

Reliability & Robustness

Maintainability

Error Handling and Monitoring

Frequency and Scheduling

Security and Compliance

Extensibility

Data Flow / Architecture Overview

Data Flow Detail

1. Trigger Stage: Job Initialization

2. Web Scrapers (Multi-Site)

3. Processing Stage: Normalization & Matching

4. Persistence Stage: Data Storage

5. Delivery Stage: API & Frontend Integration

6. Oversight Stage: Monitoring & Alerting

Tools and Technology Recommendations

Tech Stack Table (Summary)

Glossary / Definitions

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.