Skip to content
Share
Explore

ShopDog Proposal

Problem Statement

We aim to build an all-in-one chat interface for motorcycle enthusiasts where they can find all information about vintage motorcycles. Chat should be able to handle images and text and guide the user about the motorcycle specifics and part numbers (for easy replacement). Core problem Hisan will solve is the data extraction and ingestion pipeline.

Deliverables

1. Qdrant Cluster Setup

Set up a Qdrant cluster with all source data fully ingested.
Store all relevant documents and extracted content in Qdrant
Ensure each record is embedded and indexed for retrieval
Support scalable querying across the full dataset

2. Metadata on Every Data Point

Each data point should include structured metadata to improve filtering and retrieval quality.
Tag data based on its function or section.
Rough Examples:
part_diagram
part_reference_table
maintenance_procedure
specification
troubleshooting
2.2 Runtime Filter Metadata
Include metadata fields such as:
surrounding_context
model
year
system
content_type
This allows runtime filtering based on the user’s question.
Optional design choice: If we can assume a user only asks about one motorcycle at a time, we could create one collection per motorcycle instead of using a shared collection with metadata filters.

3. Hierarchical Data Structure

Implement hierarchy in the indexed data.
A good hierarchy is:
page > paragraph > chunk
Hierarchical Search Flow
Use a staged retrieval process:
Find the most relevant pages
Search within those pages for relevant paragraphs
Retrieve the best matching chunks from those paragraphs
This helps improve precision and preserve document structure during retrieval.

4. Hybrid Search

Implement hybrid search for stronger retrieval performance.
This should combine:
Vector search for semantic similarity
Keyword / lexical search for exact-match terms, part numbers, and terminology
Hybrid search is especially useful for technical documentation where exact wording often matters.

5. RAG API layer for chatbot building

For the chatbot agent, I can provide an API layer with the following tools:

5.1 RAG Search Tool

A tool that performs retrieval-augmented search.
Input:
textual query
Optional parameters:
number of chunks to return
threshold values
scope controls based on whether the user input is broad or specific
Output:
relevant retrieved chunks
associated metadata
ranking or confidence information as needed
Relevant image if needed
Adaptive Retrieval Controls
The agent can pass parameters such as:
top_k
similarity thresholds
chunk limits
retrieval scope tuning
These can be adjusted dynamically depending on the nature of the query.

5.2 Query Expansion / Rewriting Tool

A tool to improve retrieval by expanding or rewriting the original query before search.
Use cases:
resolve ambiguous phrasing
add synonyms or related terminology
convert user language into terminology that better matches the indexed corpus
improve recall for broad or underspecified questions

Budget: $1500

Timeline: 2-3 weeks

We will keep a milestone-based approach. Start off with 3 docs, ingest them and test.
When all looks good, we will scale the approach and extend to other docs.
POC for 3 docs should take around 3-4 days.
I expect considerable time will go in processing and verifying the long list of docs.
We might need to tweak the pipeline to cater to edge cases in different docs as each doc will have a unique layout.

Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.