Explore

Module : Scraper

Technical Specification specifically for the Scraper Modules

Overview

Scrapers for JustLease.nl, DirectLease.nl, and 123Lease.nl that retrieve structured car leasing data by navigating paginated listings and extracting detail-level attributes from each offer. Hybrid Scraping Architecture

Given the real-world diversity of the Dutch car leasing websites—ranging from fully server-rendered pages (e.g., JustLease and DirectLease) to JavaScript-dependent platforms like 123Lease—AutoCompare requires a hybrid scraping architecture to operate effectively at scale.

This approach strategically combines:

Scrapy + BeautifulSoup: To efficiently scrape structured data from HTML-based sites using fast, lightweight HTTP parsing.

Selenium (Headless Chrome): As a fallback for JS-heavy or dynamic content sites where standard scrapers cannot reach critical data.

This architecture ensures:

Scalability across thousands of listings per site.

Resilience against site structure differences and partial page loads.

Maintainability, where 90% of use cases remain efficient and fast via Scrapy, while Selenium is only invoked when necessary—minimizing overhead and complexity.

Technical Component

Component

Technology/Tool

Purpose

Scraper Framework

Python 3.10+, Scrapy

Core scraper engine with site-specific spider classes

HTML Parser

BeautifulSoup / lxml

Clean and extract HTML elements

Headless Browser

Selenium + ChromeDriver

Render JavaScript-heavy content (fallback only)

Crawler Middleware

Scrapy Middleware

Controls headers, delay, and retries

Request Handler

Scrapy Downloader

Manages pagination and redirects

JavaScript Detector

Heuristic Logic / Tag Checks

Detects if JS fallback is needed

There are no rows in this table

⁠

Input Configuration (per Website)

base_url

String

URL of the main listing page

pagination_xpath

XPath/CSS

Selector to detect "Next" page button

listing_link_xpath

XPath/CSS

Selector for links to individual car detail pages

There are no rows in this table

⁠

Data Fields Extracted

String

There are no rows in this table

⁠

Scraping Logic and Behavior

Pagination

System iterates through all paginated listing pages.

Stops when "next" page is no longer found.

Listing Processing

Each listing’s summary is processed for available inline data.

If needed, follows link to listing detail page.

Detail Extraction

Extracts additional data (e.g., full specs, lease terms, variant).

Applies fallback selectors if primary ones fail.

Polite Crawling

Delay: 1.5–3s randomized delay between requests.

Retries: Up to 3x retry per failed request.

Headers: Dynamic User-Agent strings rotated from a pool.

robots.txt: Checked before scraping any site.

JavaScript Rendering (Fallback Mode)

If listings or content fail to load with standard Scrapy/requests:

Trigger Selenium for that domain only.

Wait for DOM-ready state or specific element load.

Extract page content and return to main pipeline.

Validation & Logging

Missing fields (e.g., no mileage or price): marked null and flagged in logs.

Each scrape run generates:

Total listings found

Total listings with errors or missing fields

Average response time

Logs stored to cloud (e.g., S3, CloudWatch, Azure Log)

⁠

Site-Specific Behavior Examples

There are no rows in this table

⁠

Overview

Technical Component

Input Configuration (per Website)

Data Fields Extracted

Scraping Logic and Behavior

Validation & Logging

Site-Specific Behavior Examples

Want to print your doc?
This is not the way.

Try clicking the ··· in the right corner or using a keyboard shortcut (

CtrlP

) instead.