PRD Web Scraping

icon picker
Module : Data Normalization & Cleaning Pipeline

Technical Specification specifically for the Scraper Modules

Overview

This module processes raw, unstructured scraped data and converts it into a standardized, platform-ready format. It ensures schema consistency, cross-source unification, and data integrity across all lease offers before the data is pushed to storage or exposed via API.

Technical Component

Component
Technology
Purpose
Data Processor Script
Python
Core cleaning logic; executed after scraper finishes
Schema Definition
JSON Schema / Python Dict
Defines structure and constraints for normalized data
Data Validator
Cerberus / Pydantic
Enforces schema adherence and type checking
Dictionary Mapping
YAML / Python Dict
Converts non-standard terms into canonical values
Log Handler
Python logging
Tracks anomalies, invalid formats, and rejected records
There are no rows in this table

Input/Output Specification

Type
Format
Description
Input
JSON (raw)
One JSON file per source, unstructured
Output
JSON (cleaned)
Single, unified JSON dataset for all sites
There are no rows in this table

Input Field Cleanup Rules

Field Name
Normalization Rules
price_per_month
Remove currency symbols (€), thousand separators, whitespace. Convert to integer.
lease_duration
Extract numeric value in months only. Clean "60 maanden", "5 jaar", etc.
mileage_per_year
Remove "km" or commas. Parse "10.000 km/jr" → 10000
make / model
Title case. Remove excess whitespace or special characters
fuel_type
Map "EV", "Electrisch", "Electric" → "Electric" (via dictionary)
transmission
Normalize to "Automatic" or "Manual" only
body_type
Map variants like “SUV”, “Crossover SUV” → "SUV"
condition
Standardize to "New" or "Used"
engine_specs
Parse numbers if possible (e.g., "82 kWh battery")
There are no rows in this table

Canonical Dictionary Mapping

Sample mappings used for field unification:
python
CopyEdit
FUEL_TYPE_MAP = {
"EV": "Electric",
"Electrisch": "Electric",
"Benzine": "Gasoline",
"Hybride": "Hybrid"
}

BODY_TYPE_MAP = {
"SUV": "SUV",
"Crossover SUV": "SUV",
"Sedan 4-deurs": "Sedan",
"Hatchback 5dr": "Hatchback"
}

This mapping structure is extensible through external YAML or JSON files and supports dynamic reloading.

Validation & Error Handling

Type Validation: Use Pydantic or Cerberus to enforce correct field types.
Missing Field Handling:
If critical fields (e.g., make, price) are missing → reject record and log reason.
If optional fields (e.g., engine_specs) are missing → retain as null.
Value Cleaning:
If normalization fails (e.g., unrecognized units) → log entry with full field path and invalid value.
Schema Enforcement: Use a pre-defined schema to reject malformed or unexpected structures.

Data Flow

Input: Raw scraped JSONs from multiple scraper modules.
Parser: Cleans and strips unwanted symbols from all fields.
Mapper: Converts non-standard terminology into unified values.
Validator: Ensures each entry matches the schema.
Output: Validated, cleaned dataset for downstream processing.

Monitoring & Reporting

Log Reports:
Total records processed
Number of records rejected
Most common normalization failures
Anomaly Detection:
Field value deviation (e.g., lease price over €10,000/month triggers warning)
Optional: Store rejected records in a separate file (rejected_<timestamp>.json) for review.

Extensibility Strategy

New fields: Schema is modular; new keys can be added with rules in a config file.
Multilingual support: Mappings can be extended for other countries if platform expands.
Custom transformer hooks: Each field normalization can use plugin logic for advanced use cases (e.g., NLP to parse “variant”).


Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.