Explore

PRD Web Scraping

Module : Cross-Platform Matching Engine

Technical Specification specifically for the Scraper Modules

Overview

This module processes raw, unstructured scraped data and converts it into a standardized, platform-ready format. It ensures schema consistency, cross-source unification, and data integrity across all lease offers before the data is pushed to storage or exposed via API.

Given the need to handle a large dataset, high variability in text, and semantic differences in car listings (such as "Long Range" vs "LR" or "VW" vs "Volkswagen"),

the most suitable method is an embedding-based matching approach using SBERT or similar transformer models.

This approach allows the platform to:

Go beyond string similarity, capturing true semantic meaning of model names and trims.

Reduce manual rule maintenance as new data or formats emerge.

Provide a scalable and future-ready foundation for semantic search, advanced filtering, and intelligent grouping in the comparison UI.

Technical Component

Component

Technology

Purpose

Data Processor Script

Python (Pandas, Regex)

Core logic for cleaning and normalizing scraped data.

Schema Definition

JSON Schema / Python Dict

Defines data structure and validation rules for output data.

Field Validator

Cerberus / Pydantic

Validates field types and ensures schema conformance.

Dictionary Mapper

YAML / Python Dict

Maps non-standard terms to unified, canonical values.

Semantic Matcher

SBERT (HuggingFace/Transformers)

Embeds text for semantic matching across model names and trims.

Logging & Auditing

Python logging, Cloud (S3/CloudWatch)

Logs errors, missing fields, and metrics for debugging.

Config Loader

YAML/JSON

Modular configuration for field-specific rules and mapping.

There are no rows in this table

⁠

Data Flow

Input: Raw scraped data in JSON format.

Data Processor: Cleans and formats the data (removes symbols, standardizes names).

Dictionary Mapping: Maps non-standard terms (like abbreviations) to canonical values.

Semantic Matcher: Applies SBERT or similar embeddings to ensure semantic equivalence.

Field Validator: Ensures data meets the predefined schema.

Output: Validated and cleaned JSON dataset for storage or API exposure.

Input/Output Specification

Type

Format

Description

Input

JSON (raw)

One JSON file per source, unstructured

Output

JSON (cleaned)

Single, unified JSON dataset for all sites

There are no rows in this table

⁠

Input Field Cleanup Rules

Field Name

Normalization Rules

price_per_month

Remove currency symbols (€), thousand separators, whitespace. Convert to integer.

lease_duration

Extract numeric value in months only. Clean "60 maanden", "5 jaar", etc.

mileage_per_year

Remove "km" or commas. Parse "10.000 km/jr" → 10000

make / model

Title case. Remove excess whitespace or special characters

fuel_type

Map "EV", "Electrisch", "Electric" → "Electric" (via dictionary)

transmission

Normalize to "Automatic" or "Manual" only

body_type

Map variants like “SUV”, “Crossover SUV” → "SUV"

condition

Standardize to "New" or "Used"

engine_specs

Parse numbers if possible (e.g., "82 kWh battery")

There are no rows in this table

⁠

Semantic Matching Using SBERT

To resolve semantic differences in car listings, SBERT (or other transformer models) is used to convert model names, trims, and other key attributes into embeddings. These embeddings are compared to determine semantic similarity, even when text differs in form (e.g., "Long Range" vs. "LR").

The key features of this approach include:

Capturing true semantic meaning of terms (e.g., recognizing that "EV" means "Electric Vehicle").

Reducing manual maintenance of rules as new synonyms or variations appear.

Supporting scalable, advanced search features, such as semantic search and intelligent grouping in the UI.

⁠

Validation & Error Handling

Validation Behavior

Missing Critical Fields: Records with missing critical fields (e.g., price or make) are flagged as null and rejected from the output dataset.

Incorrect Formats: Records with unrecognized formats (e.g., non-numeric prices) are logged for review.

Schema Enforcement: Pydantic or Cerberus ensures all fields match the predefined schema before data is pushed to the next processing stage.

Logging Behavior

Failed Record Logs: All rejected records, along with the reason for rejection, are logged for transparency.

Success Metrics: The number of successfully processed records, normalization failures, and average processing time are logged.

Cloud Integration: Logs are saved to a cloud storage solution (e.g., AWS S3, Azure Log Analytics) for further auditing and analysis.

Simulation : Web Scraping Platform – SBERT Matching Logic

Problem:

After scraping data from various car leasing websites like JustLease.nl, DirectLease.nl, and 123Lease.nl, the platform receives inconsistent naming for the same car models. For example:

"Volkswagen ID.4", "VW ID4", and "Volkswagen ID4 Pro Performance"

These entries all refer to the same vehicle, but they differ in wording.

Using basic string matching or fuzzy logic often fails due to these variations.

⁠