Explore

ShopDog Proposal

Problem Statement

We aim to build an all-in-one chat interface for motorcycle enthusiasts where they can find all information about vintage motorcycles. Chat should be able to handle images and text and guide the user about the motorcycle specifics and part numbers (for easy replacement). Core problem Hisan will solve is the data extraction and ingestion pipeline.

Deliverables

1. Qdrant Cluster Setup

Set up a Qdrant cluster with all source data fully ingested.

Store all relevant documents and extracted content in Qdrant

Ensure each record is embedded and indexed for retrieval

Support scalable querying across the full dataset

⁠

2. Metadata on Every Data Point

Each data point should include structured metadata to improve filtering and retrieval quality.

Tag data based on its function or section.

Rough Examples:

part_diagram

part_reference_table

maintenance_procedure

specification

troubleshooting

2.2 Runtime Filter Metadata

Include metadata fields such as:

surrounding_context

model

year

system

content_type

This allows runtime filtering based on the user’s question.

Optional design choice: If we can assume a user only asks about one motorcycle at a time, we could create one collection per motorcycle instead of using a shared collection with metadata filters.

⁠

3. Hierarchical Data Structure

Implement hierarchy in the indexed data.

A good hierarchy is:

page > paragraph > chunk

Hierarchical Search Flow

Use a staged retrieval process:

Find the most relevant pages

Search within those pages for relevant paragraphs

Retrieve the best matching chunks from those paragraphs

This helps improve precision and preserve document structure during retrieval.

⁠

4. Hybrid Search

Implement hybrid search for stronger retrieval performance.

This should combine:

Vector search for semantic similarity

Keyword / lexical search for exact-match terms, part numbers, and terminology

Hybrid search is especially useful for technical documentation where exact wording often matters.

⁠

5. RAG API layer for chatbot building

For the chatbot agent, I can provide an API layer with the following tools:

5.1 RAG Search Tool

A tool that performs retrieval-augmented search.

Input:

textual query

Optional parameters:

number of chunks to return

threshold values

scope controls based on whether the user input is broad or specific

Output:

relevant retrieved chunks

associated metadata

ranking or confidence information as needed

Relevant image if needed

Adaptive Retrieval Controls

The agent can pass parameters such as:

top_k

similarity thresholds

chunk limits

retrieval scope tuning

These can be adjusted dynamically depending on the nature of the query.

⁠

5.2 Query Expansion / Rewriting Tool

A tool to improve retrieval by expanding or rewriting the original query before search.

Use cases:

resolve ambiguous phrasing

add synonyms or related terminology

convert user language into terminology that better matches the indexed corpus

improve recall for broad or underspecified questions

⁠

Budget: $1500

Timeline: 2-3 weeks

We will keep a milestone-based approach. Start off with 3 docs, ingest them and test.

When all looks good, we will scale the approach and extend to other docs.

POC for 3 docs should take around 3-4 days.

I expect considerable time will go in processing and verifying the long list of docs.

We might need to tweak the pipeline to cater to edge cases in different docs as each doc will have a unique layout.

Want to print your doc?
This is not the way.

Try clicking the ··· in the right corner or using a keyboard shortcut (

CtrlP

) instead.