Overview
This module processes raw, unstructured scraped data and converts it into a standardized, platform-ready format. It ensures schema consistency, cross-source unification, and data integrity across all lease offers before the data is pushed to storage or exposed via API.
Technical Component
Input/Output Specification
Input Field Cleanup Rules
Canonical Dictionary Mapping
Sample mappings used for field unification:
python
CopyEdit
FUEL_TYPE_MAP = {
"EV": "Electric",
"Electrisch": "Electric",
"Benzine": "Gasoline",
"Hybride": "Hybrid"
}
BODY_TYPE_MAP = {
"SUV": "SUV",
"Crossover SUV": "SUV",
"Sedan 4-deurs": "Sedan",
"Hatchback 5dr": "Hatchback"
}
This mapping structure is extensible through external YAML or JSON files and supports dynamic reloading.
Validation & Error Handling
Type Validation: Use Pydantic or Cerberus to enforce correct field types. If critical fields (e.g., make, price) are missing → reject record and log reason. If optional fields (e.g., engine_specs) are missing → retain as null. If normalization fails (e.g., unrecognized units) → log entry with full field path and invalid value. Schema Enforcement: Use a pre-defined schema to reject malformed or unexpected structures. Data Flow
Input: Raw scraped JSONs from multiple scraper modules. Parser: Cleans and strips unwanted symbols from all fields. Mapper: Converts non-standard terminology into unified values. Validator: Ensures each entry matches the schema. Output: Validated, cleaned dataset for downstream processing. Monitoring & Reporting
Number of records rejected Most common normalization failures Field value deviation (e.g., lease price over €10,000/month triggers warning) Optional: Store rejected records in a separate file (rejected_<timestamp>.json) for review. Extensibility Strategy
New fields: Schema is modular; new keys can be added with rules in a config file. Multilingual support: Mappings can be extended for other countries if platform expands. Custom transformer hooks: Each field normalization can use plugin logic for advanced use cases (e.g., NLP to parse “variant”).