PRD Web Scraping

icon picker
Module : Scraper

Technical Specification specifically for the Scraper Modules

Overview

Scrapers for JustLease.nl, DirectLease.nl, and 123Lease.nl that retrieve structured car leasing data by navigating paginated listings and extracting detail-level attributes from each offer. ​ Hybrid Scraping Architecture
Given the real-world diversity of the Dutch car leasing websites—ranging from fully server-rendered pages (e.g., JustLease and DirectLease) to JavaScript-dependent platforms like 123Lease—AutoCompare requires a hybrid scraping architecture to operate effectively at scale.
This approach strategically combines:
Scrapy + BeautifulSoup: To efficiently scrape structured data from HTML-based sites using fast, lightweight HTTP parsing.
Selenium (Headless Chrome): As a fallback for JS-heavy or dynamic content sites where standard scrapers cannot reach critical data.
This architecture ensures:
Scalability across thousands of listings per site.
Resilience against site structure differences and partial page loads.
Maintainability, where 90% of use cases remain efficient and fast via Scrapy, while Selenium is only invoked when necessary—minimizing overhead and complexity.

Technical Component

Component
Technology/Tool
Purpose
Scraper Framework
Python 3.10+, Scrapy
Core scraper engine with site-specific spider classes
HTML Parser
BeautifulSoup / lxml
Clean and extract HTML elements
Headless Browser
Selenium + ChromeDriver
Render JavaScript-heavy content (fallback only)
Crawler Middleware
Scrapy Middleware
Controls headers, delay, and retries
Request Handler
Scrapy Downloader
Manages pagination and redirects
JavaScript Detector
Heuristic Logic / Tag Checks
Detects if JS fallback is needed
There are no rows in this table

Input Configuration (per Website)

Config Parameter
Type
Description
base_url
String
URL of the main listing page
pagination_xpath
XPath/CSS
Selector to detect "Next" page button
listing_link_xpath
XPath/CSS
Selector for links to individual car detail pages
detail_extract_rules
Dict
Per-field selectors for extracting structured data
site_id
String
Unique identifier for tracking logs, errors, and metadata
There are no rows in this table

Data Fields Extracted

make
String
Car brand (e.g., "Volkswagen")
model
Car model name (e.g., "Golf")
variant
Optional: version/trim (e.g., "1.5 TSI")
price_per_month
Monthly lease cost (cleaned of currency symbols)
lease_duration
Lease term in months
mileage_per_year
Included kilometers/year
provider
Site or leasing company name
fuel_type
Gasoline, Electric, Diesel, etc.
transmission
Manual/Automatic
car_condition
"New" or "Occasion" (used)
body_type
SUV, Sedan, Hatchback, etc.
engine_specs
Optional: Engine size or battery specs
image_url
Primary image (only the link is stored)
listing_url
Direct link to source page
There are no rows in this table

Scraping Logic and Behavior

Pagination
System iterates through all paginated listing pages.
Stops when "next" page is no longer found.
Listing Processing
Each listing’s summary is processed for available inline data.
If needed, follows link to listing detail page.
Detail Extraction
Extracts additional data (e.g., full specs, lease terms, variant).
Applies fallback selectors if primary ones fail.
Polite Crawling
Delay: 1.5–3s randomized delay between requests.
Retries: Up to 3x retry per failed request.
Headers: Dynamic User-Agent strings rotated from a pool.
robots.txt: Checked before scraping any site.
JavaScript Rendering (Fallback Mode)
If listings or content fail to load with standard Scrapy/requests:
Trigger Selenium for that domain only.
Wait for DOM-ready state or specific element load.
Extract page content and return to main pipeline.

Validation & Logging

Missing fields (e.g., no mileage or price): marked null and flagged in logs.
Each scrape run generates:
Total listings found
Total listings with errors or missing fields
Average response time
Logs stored to cloud (e.g., S3, CloudWatch, Azure Log)

Site-Specific Behavior Examples

Website
JS Needed
Pagination
Detail Page Required
Notes
Yes
Yes
Rich specs on detail page
Yes
No
Most fields inline
Yes (JS-rendered)
Yes
Requires headless browser
There are no rows in this table


Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.