Purpose Of This Document
outlines the project’s objectives, requirements, architecture, and timeline for building the web scraping platform. It details what the system should do (functional requirements such as scraping logic and data matching) and how it should perform (non-functional requirements like scalability, frequency, and error recovery). A high-level architecture is proposed for how data flows from the source websites through scrapers into a unified database, with a sample data schema provided. Finally, the document identifies key milestones for development and open questions that need clarification from stakeholders.
Additionally, I have created this document in Coda as a form of living documentation to ensure it is easy to manage, update, and search for relevant information over time.
Executive Summary
Purpose: Provide a high-level overview of the product or solution.
Content: Summarizes the current situation, the problem or complication faced, the key question that needs answering, and the proposed solution or strategy
Situation
AutoCompare Inc. aims to enhance its online car lease comparison platform by providing users with the most current and comprehensive lease offers in the Dutch market. This requires continuously updated data on key attributes such as car make, model, pricing, seller details, specifications, and images from leading lease websites like JustLease.nl, DirectLease.nl, and 123Lease.nl.
The Need for Real-Time and Structured Lease Data
Users demand accurate, up-to-date lease comparisons in a single, centralized platform. Lease data is fragmented across various third-party websites with inconsistent formats Manual collection is inefficient and unsustainable as listing volume increases.
Complication
Barriers to Manual Data Collection and Integration
Issue #1:Scattered Data Across Multiple Sources Lease information is fragmented across various websites (JustLease.nl, DirectLease.nl, 123Lease.nl). Issue #2: Inconsistent Formats and Inefficient Collection Each platform presents data differently. Manual collection leads to inefficiencies, errors, and slow updates. Issue #3: No Automation, Outdated Listings Without an automated system, listings risk becoming stale—hurting user trust and experience on the platform. Question
How can AutoCompare collect, clean, and keep lease data current across multiple third-party sites — reliably and at scale? Answer
Build an automated scraping platform with the following features: Scrapers for each target website. Data pipeline for cleaning, standardizing, and matching car listings. Centralized storage to deliver structured JSON or database output. Automated refresh schedules to keep listings up-to-date. Robust error handling to detect and fix scraping issues quickly. This platform ensures AutoCompare can serve users with accurate, real-time lease comparisons—automated, scalable, and low-maintenance. Raw Requirement
Purpose: Capture all initial inputs from stakeholders or clients.
Content: Often unstructured or loosely defined (e.g., from meetings, PDFs, or notes), this section acts as the source for detailed refinement later
Objectives
Purpose: Break the goal down into achievable and measurable sub-goals.
Content: Key results that indicate success—e.g.,
Goal
To enable AutoCompare to provide users with accurate, up-to-date, and structured car lease data from multiple Dutch providers through an automated, scalable web scraping platform.
Comprehensive Data Collection Scrape car make, model, price/month, provider, lease terms, specs, images, etc. Start with JustLease.nl, DirectLease.nl, and 123Lease.nl, with future site expansion in mind. Data Standardization and Integration Normalize data into a unified schema, ready for platform integration. Detect and link equivalent listings across sites using smart matching logic. Schedule automated scrapes (e.g., daily) to keep listings current. Detect broken scrapers, missing fields, or anomalies. Trigger alerts and fail gracefully. Scalability & Maintainability Easily update existing scrapers or add new ones with minimal dev effort. Support cloud-based deployment for stability and performance. Deliver cleaned, matched, and structured data in JSON (or via API) to AutoCompare’s systems. Scope of work & Deliverables
The following deliverables are included in the initial project scope and directly support the above goals:
Python-based scrapers for JustLease.nl, DirectLease.nl, and 123Lease.nl. Handle pagination, detail pages, and structured field extraction. Data Normalization & Cleaning Pipeline Convert scraped data into a consistent, platform-ready format. Standardize terminology, data types, and categories. Cross-Platform Matching Engine Fuzzy matching logic to group same car models from different sites. Scripted jobs for periodic (e.g., daily) scraping runs. Detect scraper failures or missing fields, and trigger alerts (e.g., via email or monitoring tool). Deploy scraping and processing pipelines to a cloud provider (AWS or Azure). Support scalable execution and centralized storage. Export final data in JSON or via a basic API endpoint. Schema to be aligned with AutoCompare’s integration format. Basic Monitoring Dashboard or Logs Show job status, number of records scraped, and errors per run. Out-of-Scope
These items are explicitly not included in this initial delivery:
No development or UI design for AutoCompare’s user-facing comparison platform. Advanced Analytics or Reporting No business intelligence dashboards or trend analytics beyond basic logs. Legal or Regulatory Compliance Work Assumes client has approved web scraping and addressed any legal concerns. Ongoing monitoring, bug fixes, or site changes post-launch are not included unless a separate SLA is agreed.
Requirements
Purpose: Define what the system must do and how it should behave.
Content: Split into Functional and Non-Functional Requirements.
Functional Requirements
Purpose: Detail specific system behaviors or functions.
Content: Covers features
Scraper Modules
Description: Python-based scrapers for JustLease.nl, DirectLease.nl, and 123Lease.nl to collect lease car data.
Functional Requirements:
The system must initialize a separate scraper for each source website, with site-specific navigation logic. It must iterate over paginated results to ensure complete data capture across all available listings. For listings that link to a detail page, the system must visit those pages and extract extended fields including specifications, image URLs, and lease conditions. Each scraper must extract the following fields: Make & Model: including variant names. Price per Month: in Euros, extracted cleanly from various formats. Lease Terms: including duration (months) and mileage (km/year). Seller/Provider Info: capturing either the main site name or the third-party provider. Car Specifications: fuel type, engine/battery, transmission, body type, condition (new/used), and other visible specifications. Image URLs: must include at least one image per car. Listing URL: the direct page of the offer. Respect website crawling rules (robots.txt). Apply throttling/delays between requests. Include user-agent headers to simulate human behavior. If a site uses JavaScript for content, the system must use a headless browser (e.g., Selenium), while preferring lightweight libraries (e.g., Scrapy, BeautifulSoup) where possible.
for technical details you can see here
Data Normalization & Cleaning Pipeline
Description: Converts raw scraped data into a unified, structured, and standardized format ready for integration.
Functional Requirements:
The system must clean and convert all numeric fields (e.g., monthly prices, durations, mileage) to standardized formats. It must clean inconsistent representations across sites (e.g., whitespace, punctuation, currency symbols). It must standardize categorical values (e.g., "Electric" vs "EV", "SUV" vs "Crossover"). The normalization logic must: Ensure field values match the schema data type. Allow extension for new fields (e.g., new spec types introduced later). Validate data and flag inconsistencies.
for technical details you can see here
Cross-Platform Matching Engine
Description: Links equivalent car listings across websites using a hybrid semantic-matching algorithm that combines text embeddings (SBERT), fuzzy string matching, and canonical dictionary mapping. This approach enables accurate model grouping despite textual inconsistencies, abbreviations, or variant naming differences across sites.
Functional Requirements:
The system must generate a semantic embedding for each listing using key fields: make, model, variant, and fuel_type. It must compute cosine similarity between embeddings and group listings above a configured threshold (e.g., ≥ 0.92) under the same model_id. It must apply fuzzy string matching (e.g., Levenshtein distance or FuzzyWuzzy) as a fallback when embedding confidence is borderline (e.g., 0.85–0.91). It must use a canonical dictionary to normalize known synonyms (e.g., "VW" → "Volkswagen", "LR" → "Long Range") prior to embedding or comparison. It must assign a unique model_id to all listings determined to represent the same vehicle, preserving their individual lease terms and pricing. It must support configurable thresholds and logging of ambiguous matches for QA review or future training refinement. It must maintain record uniqueness, ensuring different variants or unmatched entries remain distinct.
for technical details you can see here
Automated Scheduler
Description: Enables time-based and manual execution of scraping jobs.
Functional Requirements:
The system must support configurable recurring schedules (e.g., daily, weekly). It must queue or block overlapping runs to avoid conflicts. It must allow manual triggering of scrapes for QA or hotfixes. It must generate execution metadata (e.g., timestamps, duration).
for technical details you can see here
Structured Data Output
Description: Aggregated data is exported to a JSON format or exposed through a basic API.
Functional Requirements:
The system must output a JSON array containing all normalized, matched listings. Each JSON object must include fields for make, model, price, terms, specs, provider, and image URLs. It must include metadata such as source website and scrape timestamp. It must support exposing the data through an API endpoint (e.g., /api/leaseoffers). It must provide filtering capabilities on the API (e.g., by brand, fuel type).
for technical details you can see here Error Logging & Alerting
Description: Identifies, logs, and reports issues during scraping and processing.
Functional Requirements:
The system must log job status (success/failure) and capture all encountered errors. It must retry network-related errors up to a maximum retry threshold. It must detect data field extraction failures and log missing or null fields. It must send email/Slack alerts for scraper failure, zero listings, or excessive null values. It must produce daily summaries of scraper health and performance.
for technical details you can see here
Basic Monitoring Dashboard or Logs
Description: Provides visual and programmatic insight into scraping and pipeline performance.
Functional Requirements:
The system must track job status, data volume (number of cars scraped), and error rate per run. It must display time series data for daily/monthly trends. It must allow download/export of logs for investigation. It may optionally provide a browser-accessible UI for job status. It must highlight anomalies (e.g., large drops in record count) for operational visibility.|
for technical details you can see here
Deployment
This module defines how the AutoCompare scraping platform is deployed and operated. While cloud deployment (AWS or Azure) is the most recommended approach due to its scalability, operational efficiency, and managed services, the final deployment environment (cloud, on-premise, or hybrid) must be confirmed by the client, as no definitive infrastructure requirement has been provided yet.
Deployment Considerations:
Cloud (Recommended): Supports elasticity, auto-scaling, and low DevOps overhead via managed services (e.g., AWS Fargate, Azure App Service). On-Premise: Can be considered if client requires strict control over data residency or infrastructure. Hybrid: Possible if certain components (e.g., API exposure, storage) need to reside on-prem while others run in the cloud. Non - Functional Requirements
Purpose: Specify the qualities the system must have.
Content: Includes,Scalability,Reliability,Performance (e.g., scrape duration < 1 hour),Maintainability and observability
Scalability
The scraping platform should handle increasing load and additional sources over time. This means: Ability to add new websites with minimal changes to the overall system (modular scraper design). If the volume of data grows (for example, if each site adds many more listings or if AutoCompare wants to scrape 10+ websites in the future), the system’s architecture (using cloud resources, databases, etc.) should be able to scale up. The database or storage solution must handle more records and possibly concurrent writes/reads if scaled out. Using a cloud database service can ensure we can scale storage and throughput as needed. Performance
The end-to-end scraping and data processing pipeline should be reasonably efficient:
Timeliness: A full scraping cycle (all sites) should complete within a acceptable time window (e.g., within a few hours at most, ideally under 1 hour if run nightly). This ensures data freshness. If one site is particularly large, consider multithreading or concurrency in that scraper to fetch pages faster, as long as it doesn’t overwhelm the site. Responsiveness: If an API is provided for the data, its responses should be quick (sub-second for queries) so that the AutoCompare platform can load comparisons without delay. This implies using indices or efficient queries in the data store. Note that extremely real-time performance is not required (we are not expecting changes minute-by-minute), but the system should not be sluggish in retrieving or updating data. Reliability & Robustness
The system should be highly reliable in obtaining and delivering data: Scrapers must be robust to minor changes in HTML structure (using stable selectors, or having fallback strategies). When failures occur, the system should fail gracefully – e.g., if one site is down, it should not prevent others from being scraped and the data pipeline completing for those. Automate recovery where possible: if a transient network error happens, the scraper could retry a few times. If a parsing error occurs, it might skip that item and continue. Utilize logging to record normal operations and errors. This will help quickly diagnose issues. Consider implementing a watchdog or monitor that ensures the scraping jobs actually run on schedule. For example, if a scheduled run did not happen or froze, it should be detected (perhaps via a timeout or a missing “heartbeat” file) so that corrective action can be taken. Maintainability
The codebase and system design should be maintainable over time: Use a modular architecture – for example, one module or script per source site. This way, if DirectLease.nl changes its layout, a developer can go directly to the DirectLease scraper module and update the selectors or logic without impacting the others. Clear documentation of the scraping logic for each site (what URLs are hit, what data is expected) should be provided, perhaps in comments or a wiki. This helps future developers quickly understand and modify scrapers as needed. The matching and normalization rules might evolve (e.g., new car models, new fields). These should be configured in a way that’s easy to update (for instance, a config file or easily editable code section for the model name mappings, rather than hard-coding everywhere). If using cloud infrastructure, infrastructure-as-code (like scripts or Terraform) can be used to document how the system is deployed, making it easier to recreate or modify the environment. Error Handling and Monitoring
Closely related to reliability, the system should have strong monitoring:
Set up monitoring dashboards or alerts (for example, using CloudWatch on AWS or Application Insights on Azure) to track the scraping jobs. Key metrics: success/failure of each job, runtime, number of records scraped, etc. Send notifications (email, Slack, etc.) to the development/operations team if a job fails or if the scraped data deviates significantly (as mentioned in functional requirements). Maintain logs of each run. Ideally, logs can be centralized (e.g., stored in CloudWatch Logs or Azure Log Analytics) so that developers can inspect what happened on each run. This helps troubleshoot issues like selectors not finding elements (which would appear as missing data in logs). Error recovery: If a scraper fails mid-run, the system could either restart that scraper or mark it for manual intervention depending on the error type. Ensuring that one failed site doesn’t block the others is important (e.g., use independent processes or threads per site). Frequency and Scheduling
The system must support the required update frequency in a reliable way: We expect to run the scraping at regular intervals (to be determined with the client – e.g., nightly at 2 AM, or twice a week). This schedule should be configurable. Using a scheduler (cron jobs on a server or scheduled functions in cloud) is necessary to automate this. The scheduling mechanism should also avoid overlapping runs if a previous run hasn’t finished. For example, if using a cloud scheduler, ensure the job cannot start again if it’s still running. The platform should also allow on-demand runs (for instance, a developer can trigger a scrape manually if needed for testing or if an urgent update is needed outside the schedule). Security and Compliance
While the data being scraped is public, we should still consider security:
Data Security: Store the scraped data in a secure manner. If using cloud storage or DB, restrict access so that only authorized systems (like AutoCompare’s servers) can read it. Although the data isn’t sensitive personal info, it’s still proprietary aggregate data for AutoCompare. Credentials: If any target site in the future requires login or API keys, those credentials must be stored securely (e.g., in AWS Secrets Manager or Azure Key Vault) and not hard-coded. Legal & Ethical Compliance: Ensure that our scraping abides by the websites’ terms of service and robots.txt rules. Many sites allow scraping of public data, but if a site explicitly forbids it, we need to discuss with the client how to proceed (possibly get permission or use an API if offered). In our current scope (the given Dutch lease sites), they are public listings, but this should be verified. Rate limiting and identifying as AutoCompare’s bot via User-Agent can help maintain a good relationship with the data sources. No PII: We are not scraping personal data, so privacy compliance (GDPR etc.) is not a major concern here, but we should still handle data responsibly. Extensibility
The solution should be built with future expansion in mind:
Adding a new website should be as simple as writing a new scraper module and plugging it into the pipeline. The core pipeline (normalization, matching, storage) remains the same. This requires a flexible design where new sources can be registered/configured easily. The matching logic might need to evolve to cover more edge cases or new types of matching (for example, matching specific trim levels or adding VIN matching if ever available). The code should be written to allow such extension without a complete rewrite. We might also consider internationalization if AutoCompare expands beyond the Netherlands. While not in scope now, keeping the code adaptable (not hardcoding strings in Dutch, etc.) could be beneficial. Tech Stack Constraints: As per assumptions, the tech stack is Python-based and likely to be deployed on AWS or Azure. Non-functional requirements here include: Compatibility: Use libraries and versions that are stable and widely supported in the deployment environment (for example, Python 3.x, Scrapy latest version, etc.). Avoid very cutting-edge libraries that might have bugs. Cloud Deployability: The code should run reliably on cloud services. For instance, if using AWS Lambda for scraping, ensure each scrape can complete within Lambda’s time limits and memory (or use AWS Batch/ECS for longer tasks). If using Azure Functions, similarly ensure compatibility. Resource Management: Use resources efficiently to keep cloud costs in check. For example, if using an EC2 or VM, schedule it to run only when needed or use serverless approaches so we pay only per use. This is more of a cost consideration but ties into how we design for scalability and efficiency.
Data Flow / Architecture Overview
Purpose: Visualize how data moves through the system.
Content: Step-by-step flow from scraper trigger → data extraction → normalization → matching → storage → frontend/API integration → monitoring.
Data Flow Detail
1. Trigger Stage: Job Initialization
Source:
Cloud scheduler (AWS CloudWatch, Azure Logic App), or Action:
Triggers the scraping job by launching scraper tasks for all configured websites. Logs job ID, timestamp, and source metadata. Output:
Initiates parallel scraper modules for each source. 2. Web Scrapers (Multi-Site)
Source:
Target websites (e.g., JustLease.nl, DirectLease.nl, 123Lease.nl) Action:
Extracts listings and follows to detail pages. Uses Selenium when JavaScript rendering is required. Captures key fields such as: make, model, price_per_month, fuel_type, etc. Output:
Raw JSON files per site with unstructured or semi-structured listing data. Initial scrape logs (number of records, crawl time, errors). 3. Processing Stage: Normalization & Matching
Source:
All raw JSONs from scrapers. Action:
Cleans and standardizes fields (e.g., converts “EV” → “Electric”, strips “€” symbols). Validates data types and fills missing values with nulls or placeholders. Uses hybrid logic: canonical mappings + SBERT-based semantic embeddings. Identifies similar vehicles across providers (e.g., "VW Golf TSI" = "Volkswagen Golf 1.5 TSI"). Assigns a model_id for grouping. [Scraped Raw Text]
|
SBERT
|
[Top-k Semantic Matches]
|
Canonical Validator
|
[Final Unified Output]
|
[Storage / API]
Output:
Unified JSON or structured data object with: Metadata (source site, scrape timestamp, model_id) 4. Persistence Stage: Data Storage
Source:
Unified dataset from processing stage. Action:
Saves cleaned and matched data to: Cloud-native (recommended): AWS S3, DynamoDB, or Azure Blob/SQL. On-prem database/file system, if required. source_site, record_id, created_at, and job_id Supports both document-based (JSON) and structured (SQL/NoSQL) formats. All data entries tagged with metadata (timestamp, source, job ID). Ensures atomic overwrite per scraping job to avoid partial/inconsistent reads. Output:
Ready-to-query database or JSON feed Versioned historical datasets (optional) 5. Delivery Stage: API & Frontend Integration
Source:
Central storage (S3, DB, Blob) Action:
Exposed through a lightweight REST API (/api/leaseoffers) Secure file path (e.g., S3 URL) Supports query filters (e.g., make=Tesla, fuel_type=Electric) AutoCompare platform fetches and renders grouped comparisons to users. Output:
JSON payloads or visual comparison UI on frontend Real-time or interval-based updates 6. Oversight Stage: Monitoring & Alerting
Record count changes (anomalies, e.g., drop from 200→0) Field-level nulls or extraction issues Email, Slack, or other configured endpoints Cloud (AWS CloudWatch, Azure Log Analytics) Or centralized file logs for on-premise deployments Optional: generate daily scraping reports for ops dashboard. Tools and Technology Recommendations
Tech Stack Table (Summary)
More Detailed Version
Glossary / Definitions
AutoCompare Platform: The end-user facing application/website of AutoCompare Inc. where consumers can compare car lease deals. Scraper/Spider: A script or program that extracts information from a website. In this context, one per source website. Normalization: The process of converting data to a standard format (e.g., uniform units, naming, types). Matching (Deduplication): Identifying when two records refer to the same real-world item (here, the same car model offer) and linking them. Cron: A time-based job scheduler in Unix-like systems (used as a generic term for scheduling jobs). API: Application Programming Interface. Here, it refers to a web service that could provide the data to the front-end in JSON format. JSON: JavaScript Object Notation, a lightweight data-interchange format, used here for structured output of scraped data. Lease Terms: Conditions of the lease such as duration and mileage allowance. Private Lease vs Operational Lease: Private lease is for individual consumers; operational (or business) lease is for companies. Our context includes both if available, but primarily focused on consumer/private lease deals as indicated by the sites. Occasion: A Dutch term for a used car. “Private Lease occasion” refers to leasing a used car (usually at a lower price or shorter availability). Our scrapers will note whether a listing is an occasion or new. KM: Kilometer, used for mileage (distance). RWD: Rear-Wheel Drive (example of a variant detail in a model name, relevant in variant field).