PRD Web Scraping

icon picker
Module : Structured Data Output

Technical Specification specifically for the Scraper Modules

Overview

This module consolidates and delivers the final dataset from the scraping pipeline. It ensures that all cleaned and matched listings are consistently structured in a platform-consumable format—either as a JSON export or via an accessible API. This output enables AutoCompare to easily ingest the lease car data into its frontend comparison tool and other downstream services.
By including metadata and supporting filter queries (e.g., brand, fuel type), the module is designed for both static exports and dynamic consumption via RESTful interfaces.

Technical Component

Component
Technology / Tool
Purpose
Output Formatter
Python (dict → JSON)
Converts cleaned objects into JSON-serializable format
API Server (optional)
FastAPI / Flask
Exposes endpoint for on-demand access to structured data
Schema Validator
Pydantic / JSON Schema
Validates output format before writing or serving
Metadata Inserter
Python datetime / hash utils
Adds timestamps and site metadata to each listing
File Writer / Uploader
Python / boto3 / Azure SDK
Saves final file to S3 / Azure Blob or local directory
There are no rows in this table

Input/Output Specification

Type
Format
Description
Input
JSON (cleaned, matched)
Validated dataset with unified fields from multiple sources
Output
JSON (export) / REST API
Final data exposed for front-end ingestion or integration
There are no rows in this table

JSON Output Schema (Required Fields)

Each JSON object in the array must contain:
make
string
Car brand (e.g., "Volkswagen")
model
Car model (e.g., "Golf")
variant
Version or trim (optional)
price_per_month
Lease cost per month (in Euros)
lease_duration
Duration in months
mileage_per_year
Included mileage in km/year
fuel_type
e.g., "Electric", "Gasoline"
transmission
"Automatic" or "Manual"
body_type
e.g., "SUV", "Hatchback"
car_condition
"New" or "Used"
engine_specs
Optional engine or battery detail
image_url
Link to listing image
listing_url
Direct link to the source site
provider
Name of the leasing company
source_site
Which website the data was scraped from
scrape_timestamp
When the data was scraped
There are no rows in this table
Notes:
The actual set of fields will depend on what is available on each site. The above covers common fields; if a site provides additional specs (e.g., color, horsepower, CO₂ emission), we can include those as well (possibly under an additional nested structure like specs).
listing_id is optional and only useful if we can extract a unique ID from the site (not all sites expose an obvious ID, but some might in their HTML or URLs).
All string fields will be normalized to a consistent format (e.g., capitalization or wording). Numeric fields are stored as numbers (no currency symbols, etc.).
scrape_timestap helps to know how current the data is. If we keep historical entries, it could be used to filter the latest ones.
This schema would be represented in JSON as an array of objects. For instance, a simplified JSON output example with two entries could look like:
json
CopyEdit
[
{
"make": "Tesla",
"model": "Model 3",
"variant": "Long Range RWD",
"year": 2024,
"fuel_type": "Electric",
"transmission": "Automatic",
"body_type": "Sedan",
"price_per_month": 499,
"lease_term_months": 60,
"mileage_per_year": 15000,
"provider": "JustLease",
"listing_url": "https://justlease.nl/tesla-model-3-long-range",
"image_url": "https://justlease.nl/images/tesla-model3.jpg",
"scrape_date": "2025-04-07T02:30:00Z",
"group_id": "TESLA_MODEL_3"
},
{
"make": "Tesla",
"model": "Model 3",
"variant": "Long Range RWD",
"year": 2024,
"fuel_type": "Electric",
"transmission": "Automatic",
"body_type": "Sedan",
"price_per_month": 510,
"lease_term_months": 48,
"mileage_per_year": 20000,
"provider": "DirectLease",
"listing_url": "https://directlease.nl/private-lease/tesla-model-3",
"image_url": "https://directlease.nl/img/tesla-model3.png",
"scrape_date": "2025-04-07T02:31:00Z",
"group_id": "TESLA_MODEL_3"
}
]


API Access Specification (Optional)

Endpoint
Method
Description
/api/leaseoffers
GET
Returns the full dataset
/api/leaseoffers
GET + query
Filter results by brand, fuel_type, etc.
/api/metadata
GET
Returns schema version, last scrape time
There are no rows in this table
Supported Filters:
?brand=Tesla
?fuel_type=Electric
?price_lte=500
?body_type=SUV

Validation & Quality Checks

Schema conformity validated using Pydantic model.
Null/missing critical fields (e.g., price, make) excluded before export.
Metadata (timestamp, site ID) injected at write time.
Optionally, record count per brand/provider included in footer.

Access Control & Storage

Export location:
Local (during dev)
AWS S3 / Azure Blob for production
If using API:
Basic Auth / API Key required
Rate limits configurable via middleware

Extensibility & Integration

Front-end consumers (AutoCompare site) can:
Pull the latest JSON from cloud storage
Fetch from the /api/leaseoffers endpoint
Compatible with:
Power BI / Google Data Studio via connector
Static site generation tools or backend ingestion



Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.