Using LlamaIndex is a key building block
Provide tool kids to build application. Orchestration layer for various components to build a RAG.
Using “Ray” data as the core application.
Delayed execution engine (Doesn’t really execute until it’s go time)
Create Vector DB
Need to split docs into domain-specific chunks before ingestion. Vector DB stores the following
RAG Challenge — How to preserve context?
Workshop: Using an HTML tag as a way to group text together — Using the section tag. This will most likely change as we ingest different types of data.
Chunk size can be something you tune over time. Embedding batch size — High impact to the model
This embed_dim (embedding_dimensions) needs to be updated based on the model you use (See data.py for more)
What is embedding dimension?
Embedding model allowsyou to convert a sequence of tokens into a series of numbers. Model is trained to map text into vectors, and the vector represents semantic relationship of the text. Diff models uses diff embedding to capture these info. The larger the dimension, the more context it can captured. Larger dimension doesn’t always mean better performance — Some smaller model can perform better than larger model in specific tasks.
Index Data Section
Need to drop the # of ActorPoolStrategy(...) to something lower than 8 to work
Using Top K — Using Postgres implementation Need to ensure what constraints we need and structure the data to do retrieval based on those constraints. Strategy: Do semantic search first, and do another ranking afterwards. Hybrid search: (Experience) Almost always work better where there are domain specific keywords — Generally better if you go with keyword search.
Non-Quality related — 👍 (Makes sense)
Quality-related ← THIS IS THE MEAT
This hits home 👀
Quick to get a prototype — But takes a long time to productionize
The dotted line → We need the confidence before going there.
Focus on quick iterations & vibe check This is NOT meant to be systematic but useful
Vibe check on representative of test queries — Towards more systematic evaluation to gain confidence before doing initial deployment.
Challenges of Systematic Evaluations
Human evaluation is not scalable Labelled data is slow & costly to collect Easy to know it’s bad, but not easy to understand how to improve. There’s also the end-to-end vs component-specific retrieval.
LLM-as-a-judge (GPT-4, Claude 2)
High agreement with human labeler, scalabile, interpretability — Ask for a list of reasoning on providing such scores. Pairwise/Reference-guided comparison — With golden answer Single answer grading — Ask for specific score
Systematic Eval - Overview
Data for Systematic Evaluation - GOLD
Question: How to deal with LLM changing under you when the LLM itself is changing?
Question: Fine-tuning in Judge model?
Not much success yet — Open source model hasn’t worked super well in evaluation yet. No strong conviction. Fine tuning — Are you trying to distill a model into a smaller model? Fine tune GPT-4 into specific evaluation tasks, and you have a set of data to see what the evaluator looks like → This works. Distill GPT-4 model into a smaller model — Not working very well (Since it requires more reasoning capability)
Question: How to account for bias in LLM?
Technique (Might be good to adopt): Track output distribution (Limit the output constraints)
Position bias (First vs last) Verbosity bias (Known issue) Self-enhancement bias (GPT judge prefers GPT answer as an example)
Question: Cosine similarity
There are answers that are semantically similar but not correct — There is no one metric that works well, so it’s helpful to have multiple criterias to get a big picture on roughly how good they are. Semantic similarity is more scalable as it doesn’t depend on LLM. Embedding model — Depending on which one you picked. They are usually optimized for retrieval, and it’s not optimized for correctness.
This is more information retrieval metrics — Not AI specific
Experimentation & Optimization
Idea: Use LLM to do the cleaning of the data before feeding into the Retrieval/Generation piece.
FOLLOWUP: Go through Lab that around optimization — This piece is interesting
Customizing Retrieval & Generation
Went through a couple strategy — Couple of them are pretty interesting to try out.