Explore

PRD Web Scraping

Module : Data Normalization & Cleaning Pipeline

Technical Specification specifically for the Scraper Modules

Overview

This module processes raw, unstructured scraped data and converts it into a standardized, platform-ready format. It ensures schema consistency, cross-source unification, and data integrity across all lease offers before the data is pushed to storage or exposed via API.

Technical Component

Component

Technology

Purpose

Data Processor Script

Python

Core cleaning logic; executed after scraper finishes

Schema Definition

JSON Schema / Python Dict

Defines structure and constraints for normalized data

Data Validator

Cerberus / Pydantic

Enforces schema adherence and type checking

Dictionary Mapping

YAML / Python Dict

Converts non-standard terms into canonical values

Log Handler

Python logging

Tracks anomalies, invalid formats, and rejected records

There are no rows in this table

⁠

Input/Output Specification

Type

Format

Description

Input

JSON (raw)

One JSON file per source, unstructured

Output

JSON (cleaned)

Single, unified JSON dataset for all sites

There are no rows in this table

⁠

Input Field Cleanup Rules

Field Name

Normalization Rules

price_per_month

Remove currency symbols (€), thousand separators, whitespace. Convert to integer.

lease_duration

Extract numeric value in months only. Clean "60 maanden", "5 jaar", etc.

mileage_per_year

Remove "km" or commas. Parse "10.000 km/jr" → 10000

make / model

Title case. Remove excess whitespace or special characters

fuel_type

Map "EV", "Electrisch", "Electric" → "Electric" (via dictionary)

transmission

Normalize to "Automatic" or "Manual" only

body_type

Map variants like “SUV”, “Crossover SUV” → "SUV"

condition

Standardize to "New" or "Used"

engine_specs

Parse numbers if possible (e.g., "82 kWh battery")

There are no rows in this table

⁠

Canonical Dictionary Mapping

Sample mappings used for field unification:

python

CopyEdit

FUEL_TYPE_MAP = {

"EV": "Electric",

"Electrisch": "Electric",

"Benzine": "Gasoline",

"Hybride": "Hybrid"

}

BODY_TYPE_MAP = {

"SUV": "SUV",

"Crossover SUV": "SUV",

"Sedan 4-deurs": "Sedan",

"Hatchback 5dr": "Hatchback"

}

This mapping structure is extensible through external YAML or JSON files and supports dynamic reloading.

Validation & Error Handling

Type Validation: Use Pydantic or Cerberus to enforce correct field types.

Missing Field Handling:

If critical fields (e.g., make, price) are missing → reject record and log reason.

If optional fields (e.g., engine_specs) are missing → retain as null.

Value Cleaning:

If normalization fails (e.g., unrecognized units) → log entry with full field path and invalid value.

Schema Enforcement: Use a pre-defined schema to reject malformed or unexpected structures.

Data Flow

Input: Raw scraped JSONs from multiple scraper modules.

Parser: Cleans and strips unwanted symbols from all fields.

Mapper: Converts non-standard terminology into unified values.

Validator: Ensures each entry matches the schema.

Output: Validated, cleaned dataset for downstream processing.

Monitoring & Reporting

Log Reports:

Total records processed

Number of records rejected

Most common normalization failures

Anomaly Detection:

Field value deviation (e.g., lease price over €10,000/month triggers warning)

Optional: Store rejected records in a separate file (rejected_<timestamp>.json) for review.

Extensibility Strategy

New fields: Schema is modular; new keys can be added with rules in a config file.

Multilingual support: Mappings can be extended for other countries if platform expands.

Custom transformer hooks: Each field normalization can use plugin logic for advanced use cases (e.g., NLP to parse “variant”).

Overview

Technical Component

Input/Output Specification

Input Field Cleanup Rules

Canonical Dictionary Mapping

Validation & Error Handling

Data Flow

Monitoring & Reporting

Extensibility Strategy

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.