✅ Workshop: RAG


Build

Using LlamaIndex is a key building block
Provide tool kids to build application.
Orchestration layer for various components to build a RAG.

Using “Ray” data as the core application.
Ray link —
Delayed execution engine (Doesn’t really execute until it’s go time)

Create Vector DB
Need to split docs into domain-specific chunks before ingestion.
Vector DB stores the following
Text
Source
Associated embedding

RAG Challenge — How to preserve context?
Workshop: Using an HTML tag as a way to group text together — Using the section tag.
This will most likely change as we ingest different types of data.

Data Ingestion
Chunk size can be something you tune over time.
Embedding batch size — High impact to the model ​
image.png
This embed_dim (embedding_dimensions) needs to be updated based on the model you use (See data.py for more) ​
image.png

What is embedding dimension?
Embedding model allowsyou to convert a sequence of tokens into a series of numbers.
Model is trained to map text into vectors, and the vector represents semantic relationship of the text.
Diff models uses diff embedding to capture these info.
The larger the dimension, the more context it can captured.
Larger dimension doesn’t always mean better performance — Some smaller model can perform better than larger model in specific tasks.


Index Data Section
Need to drop the # of ActorPoolStrategy(...) to something lower than 8 to work ​
image.png

Retrieval Discussion
Using Top K — Using Postgres implementation
Need to ensure what constraints we need and structure the data to do retrieval based on those constraints.
Strategy: Do semantic search first, and do another ranking afterwards.
Hybrid search: (Experience) Almost always work better where there are domain specific keywords — Generally better if you go with keyword search.


Production Challenges

image.png

Non-Quality related — 👍 (Makes sense)

Quality-related ← THIS IS THE MEAT

image.png

This hits home 👀
image.png
Quick to get a prototype — But takes a long time to productionize

image.png
The dotted line → We need the confidence before going there.

Idea Validation

Focus on quick iterations & vibe check
This is NOT meant to be systematic but useful

Solution Validation

Vibe check on representative of test queries — Towards more systematic evaluation to gain confidence before doing initial deployment.

Challenges of Systematic Evaluations

Metrics
No right answer
Human evaluation is not scalable
Data Available
Labelled data is slow & costly to collect
Actionable Insight
Easy to know it’s bad, but not easy to understand how to improve.
There’s also the end-to-end vs component-specific retrieval.

LLM-as-a-judge (GPT-4, Claude 2)

High agreement with human labeler, scalabile, interpretability — Ask for a list of reasoning on providing such scores.
Approaches
Pairwise/Reference-guided comparison — With golden answer
Single answer grading — Ask for specific score

Systematic Eval - Overview

image.png

Data for Systematic Evaluation - GOLD

image.png

image.png


Heh —
image.png

Question: How to deal with LLM changing under you when the LLM itself is changing?
No great solution

Question: Fine-tuning in Judge model?
Not much success yet — Open source model hasn’t worked super well in evaluation yet. No strong conviction.
Followup: Why is it bad?
Fine tuning — Are you trying to distill a model into a smaller model?
Fine tune GPT-4 into specific evaluation tasks, and you have a set of data to see what the evaluator looks like → This works.
Distill GPT-4 model into a smaller model — Not working very well (Since it requires more reasoning capability)

Question: How to account for bias in LLM?
Technique (Might be good to adopt): Track output distribution (Limit the output constraints)

LLM-as-a-judge
Position bias (First vs last)
Verbosity bias (Known issue)
Self-enhancement bias (GPT judge prefers GPT answer as an example)

Question: Cosine similarity
There are answers that are semantically similar but not correct — There is no one metric that works well, so it’s helpful to have multiple criterias to get a big picture on roughly how good they are.
Semantic similarity is more scalable as it doesn’t depend on LLM.
Embedding model — Depending on which one you picked. They are usually optimized for retrieval, and it’s not optimized for correctness.

Search-based Evaluation
This is more information retrieval metrics — Not AI specific
image.png


Experimentation & Optimization

image.png

Idea: Use LLM to do the cleaning of the data before feeding into the Retrieval/Generation piece.

FOLLOWUP: Go through Lab that around optimization — This piece is interesting


Customizing Retrieval & Generation

Went through a couple strategy — Couple of them are pretty interesting to try out.




Followup
Category
Items
1
LlamaIndex, Ray
Who uses it? Compliance?
2
LlamaIndex
Can it be hosted in-house?
3
LlamaIndex
Product offering —
There are no rows in this table

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.