PRD Web Scraping

icon picker
Module : Cross-Platform Matching Engine

Technical Specification specifically for the Scraper Modules

Overview

This module processes raw, unstructured scraped data and converts it into a standardized, platform-ready format. It ensures schema consistency, cross-source unification, and data integrity across all lease offers before the data is pushed to storage or exposed via API.
Given the need to handle a large dataset, high variability in text, and semantic differences in car listings (such as "Long Range" vs "LR" or "VW" vs "Volkswagen"),
info
the most suitable method is an embedding-based matching approach using SBERT or similar transformer models.
This approach allows the platform to:
Go beyond string similarity, capturing true semantic meaning of model names and trims.
Reduce manual rule maintenance as new data or formats emerge.
Provide a scalable and future-ready foundation for semantic search, advanced filtering, and intelligent grouping in the comparison UI.

Technical Component

Component
Technology
Purpose
Data Processor Script
Python (Pandas, Regex)
Core logic for cleaning and normalizing scraped data.
Schema Definition
JSON Schema / Python Dict
Defines data structure and validation rules for output data.
Field Validator
Cerberus / Pydantic
Validates field types and ensures schema conformance.
Dictionary Mapper
YAML / Python Dict
Maps non-standard terms to unified, canonical values.
Semantic Matcher
SBERT (HuggingFace/Transformers)
Embeds text for semantic matching across model names and trims.
Logging & Auditing
Python logging, Cloud (S3/CloudWatch)
Logs errors, missing fields, and metrics for debugging.
Config Loader
YAML/JSON
Modular configuration for field-specific rules and mapping.
There are no rows in this table

Data Flow

Input: Raw scraped data in JSON format.
Data Processor: Cleans and formats the data (removes symbols, standardizes names).
Dictionary Mapping: Maps non-standard terms (like abbreviations) to canonical values.
Semantic Matcher: Applies SBERT or similar embeddings to ensure semantic equivalence.
Field Validator: Ensures data meets the predefined schema.
Output: Validated and cleaned JSON dataset for storage or API exposure.

Input/Output Specification

Type
Format
Description
Input
JSON (raw)
One JSON file per source, unstructured
Output
JSON (cleaned)
Single, unified JSON dataset for all sites
There are no rows in this table

Input Field Cleanup Rules

Field Name
Normalization Rules
price_per_month
Remove currency symbols (€), thousand separators, whitespace. Convert to integer.
lease_duration
Extract numeric value in months only. Clean "60 maanden", "5 jaar", etc.
mileage_per_year
Remove "km" or commas. Parse "10.000 km/jr" → 10000
make / model
Title case. Remove excess whitespace or special characters
fuel_type
Map "EV", "Electrisch", "Electric" → "Electric" (via dictionary)
transmission
Normalize to "Automatic" or "Manual" only
body_type
Map variants like “SUV”, “Crossover SUV” → "SUV"
condition
Standardize to "New" or "Used"
engine_specs
Parse numbers if possible (e.g., "82 kWh battery")
There are no rows in this table


Semantic Matching Using SBERT

To resolve semantic differences in car listings, SBERT (or other transformer models) is used to convert model names, trims, and other key attributes into embeddings. These embeddings are compared to determine semantic similarity, even when text differs in form (e.g., "Long Range" vs. "LR").
The key features of this approach include:
Capturing true semantic meaning of terms (e.g., recognizing that "EV" means "Electric Vehicle").
Reducing manual maintenance of rules as new synonyms or variations appear.
Supporting scalable, advanced search features, such as semantic search and intelligent grouping in the UI.

Validation & Error Handling

Validation Behavior

Missing Critical Fields: Records with missing critical fields (e.g., price or make) are flagged as null and rejected from the output dataset.
Incorrect Formats: Records with unrecognized formats (e.g., non-numeric prices) are logged for review.
Schema Enforcement: Pydantic or Cerberus ensures all fields match the predefined schema before data is pushed to the next processing stage.

Logging Behavior

Failed Record Logs: All rejected records, along with the reason for rejection, are logged for transparency.
Success Metrics: The number of successfully processed records, normalization failures, and average processing time are logged.
Cloud Integration: Logs are saved to a cloud storage solution (e.g., AWS S3, Azure Log Analytics) for further auditing and analysis.

Simulation : Web Scraping Platform – SBERT Matching Logic

Problem:

After scraping data from various car leasing websites like JustLease.nl, DirectLease.nl, and 123Lease.nl, the platform receives inconsistent naming for the same car models. For example:
"Volkswagen ID.4", "VW ID4", and "Volkswagen ID4 Pro Performance"
These entries all refer to the same vehicle, but they differ in wording.
Using basic string matching or fuzzy logic often fails due to these variations.

Solution: SBERT for Semantic Matching

How SBERT Works in This Context:

1. Sentence Embedding:

All scraped texts (e.g., car model names, trims, etc.) are converted into dense vector representations using SBERT (Sentence-BERT).
Example:
python
CopyEdit
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

texts = ["Volkswagen ID.4", "VW ID4", "Volkswagen ID4 Pro Performance"]
embeddings = model.encode(texts)

Each string becomes a numerical vector that captures its semantic meaning.

2. Similarity Calculation:

The system calculates cosine similarity between these vectors to determine how semantically close the entries are.
python
CopyEdit
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)

If similarity > 0.85, the entries are treated as referring to the same object.

3. Clustering & Normalization:

Similar entries are grouped using clustering methods (like DBSCAN or Faiss). Each cluster is then mapped to a single, unified model label.
json
CopyEdit
{
"cluster_01": {
"raw_texts": ["Volkswagen ID.4", "VW ID4", "Volkswagen ID4 Pro Performance"],
"unified_model": "Volkswagen ID.4"
}
}

Benefits of Using SBERT:

Semantic Awareness: Understands meaning beyond exact words or spelling.
Low Maintenance: No need for manually written rules or regex for every variation.
Scalable: Works efficiently with thousands of car listings from multiple websites.

System Flow Diagram:


[Raw Scraped Text]
|
v
[SBERT Embedding]
|
v
[Cosine Similarity Matrix]
|
v
[Clustering & Labeling]
|
v
[Unified Entity Output]

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.