Problem Statement
We aim to build an all-in-one chat interface for motorcycle enthusiasts where they can find all information about vintage motorcycles. Chat should be able to handle images and text and guide the user about the motorcycle specifics and part numbers (for easy replacement). Core problem Hisan will solve is the data extraction and ingestion pipeline.
Deliverables
1. Qdrant Cluster Setup
Set up a Qdrant cluster with all source data fully ingested.
Store all relevant documents and extracted content in Qdrant Ensure each record is embedded and indexed for retrieval Support scalable querying across the full dataset 2. Metadata on Every Data Point
Each data point should include structured metadata to improve filtering and retrieval quality.
Tag data based on its function or section.
Rough Examples:
2.2 Runtime Filter Metadata
Include metadata fields such as:
This allows runtime filtering based on the user’s question.
Optional design choice:
If we can assume a user only asks about one motorcycle at a time, we could create one collection per motorcycle instead of using a shared collection with metadata filters.
3. Hierarchical Data Structure
Implement hierarchy in the indexed data.
A good hierarchy is:
Hierarchical Search Flow
Use a staged retrieval process:
Find the most relevant pages Search within those pages for relevant paragraphs Retrieve the best matching chunks from those paragraphs This helps improve precision and preserve document structure during retrieval.
4. Hybrid Search
Implement hybrid search for stronger retrieval performance.
This should combine:
Vector search for semantic similarity Keyword / lexical search for exact-match terms, part numbers, and terminology Hybrid search is especially useful for technical documentation where exact wording often matters.
5. RAG API layer for chatbot building
For the chatbot agent, I can provide an API layer with the following tools:
5.1 RAG Search Tool
A tool that performs retrieval-augmented search.
Input:
Optional parameters:
number of chunks to return scope controls based on whether the user input is broad or specific Output:
relevant retrieved chunks ranking or confidence information as needed Adaptive Retrieval Controls
The agent can pass parameters such as:
These can be adjusted dynamically depending on the nature of the query.
5.2 Query Expansion / Rewriting Tool
A tool to improve retrieval by expanding or rewriting the original query before search.
Use cases:
resolve ambiguous phrasing add synonyms or related terminology convert user language into terminology that better matches the indexed corpus improve recall for broad or underspecified questions
Budget: $1500
Timeline: 2-3 weeks
We will keep a milestone-based approach. Start off with 3 docs, ingest them and test. When all looks good, we will scale the approach and extend to other docs. POC for 3 docs should take around 3-4 days. I expect considerable time will go in processing and verifying the long list of docs. We might need to tweak the pipeline to cater to edge cases in different docs as each doc will have a unique layout.