Overview
This module processes raw, unstructured scraped data and converts it into a standardized, platform-ready format. It ensures schema consistency, cross-source unification, and data integrity across all lease offers before the data is pushed to storage or exposed via API.
Given the need to handle a large dataset, high variability in text, and semantic differences in car listings (such as "Long Range" vs "LR" or "VW" vs "Volkswagen"),
the most suitable method is an embedding-based matching approach using SBERT or similar transformer models.
This approach allows the platform to:
Go beyond string similarity, capturing true semantic meaning of model names and trims. Reduce manual rule maintenance as new data or formats emerge. Provide a scalable and future-ready foundation for semantic search, advanced filtering, and intelligent grouping in the comparison UI. Technical Component
Data Flow
Input: Raw scraped data in JSON format. Data Processor: Cleans and formats the data (removes symbols, standardizes names). Dictionary Mapping: Maps non-standard terms (like abbreviations) to canonical values. Semantic Matcher: Applies SBERT or similar embeddings to ensure semantic equivalence. Field Validator: Ensures data meets the predefined schema. Output: Validated and cleaned JSON dataset for storage or API exposure. Input/Output Specification
Input Field Cleanup Rules
Semantic Matching Using SBERT
To resolve semantic differences in car listings, SBERT (or other transformer models) is used to convert model names, trims, and other key attributes into embeddings. These embeddings are compared to determine semantic similarity, even when text differs in form (e.g., "Long Range" vs. "LR").
The key features of this approach include:
Capturing true semantic meaning of terms (e.g., recognizing that "EV" means "Electric Vehicle"). Reducing manual maintenance of rules as new synonyms or variations appear. Supporting scalable, advanced search features, such as semantic search and intelligent grouping in the UI. Validation & Error Handling
Validation Behavior
Missing Critical Fields: Records with missing critical fields (e.g., price or make) are flagged as null and rejected from the output dataset. Incorrect Formats: Records with unrecognized formats (e.g., non-numeric prices) are logged for review. Schema Enforcement: Pydantic or Cerberus ensures all fields match the predefined schema before data is pushed to the next processing stage. Logging Behavior
Failed Record Logs: All rejected records, along with the reason for rejection, are logged for transparency. Success Metrics: The number of successfully processed records, normalization failures, and average processing time are logged. Cloud Integration: Logs are saved to a cloud storage solution (e.g., AWS S3, Azure Log Analytics) for further auditing and analysis. Simulation : Web Scraping Platform – SBERT Matching Logic
Problem:
After scraping data from various car leasing websites like JustLease.nl, DirectLease.nl, and 123Lease.nl, the platform receives inconsistent naming for the same car models. For example:
"Volkswagen ID.4", "VW ID4", and "Volkswagen ID4 Pro Performance" These entries all refer to the same vehicle, but they differ in wording. Using basic string matching or fuzzy logic often fails due to these variations.
Solution: SBERT for Semantic Matching
How SBERT Works in This Context:
1. Sentence Embedding:
All scraped texts (e.g., car model names, trims, etc.) are converted into dense vector representations using SBERT (Sentence-BERT).
Example:
python
CopyEdit
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = ["Volkswagen ID.4", "VW ID4", "Volkswagen ID4 Pro Performance"]
embeddings = model.encode(texts)
Each string becomes a numerical vector that captures its semantic meaning.
2. Similarity Calculation:
The system calculates cosine similarity between these vectors to determine how semantically close the entries are.
python
CopyEdit
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
If similarity > 0.85, the entries are treated as referring to the same object.
3. Clustering & Normalization:
Similar entries are grouped using clustering methods (like DBSCAN or Faiss). Each cluster is then mapped to a single, unified model label.
json
CopyEdit
{
"cluster_01": {
"raw_texts": ["Volkswagen ID.4", "VW ID4", "Volkswagen ID4 Pro Performance"],
"unified_model": "Volkswagen ID.4"
}
}
Benefits of Using SBERT:
Semantic Awareness: Understands meaning beyond exact words or spelling. Low Maintenance: No need for manually written rules or regex for every variation. Scalable: Works efficiently with thousands of car listings from multiple websites. System Flow Diagram:
[Raw Scraped Text]
|
v
[SBERT Embedding]
|
v
[Cosine Similarity Matrix]
|
v
[Clustering & Labeling]
|
v
[Unified Entity Output]