Gallery

AI Engineer Summit: Scrappy Notes

✅ Workshop: RAG

🚫 Workshop: Evals-Driven Developmen

✅ Main: Building Blocks

✅ Main: Embedding

Explore

✅ Workshop: RAG

Link:

https://github.com/Disiok/ai-engineer-workshop/⁠

⁠

Preso:

https://github.com/Disiok/ai-engineer-workshop/blob/main/presentation.pdf⁠

⁠

Build

Using LlamaIndex is a key building block

Provide tool kids to build application.

Orchestration layer for various components to build a RAG.

Using “Ray” data as the core application.

Ray link —

https://www.ray.io/⁠

⁠

Delayed execution engine (Doesn’t really execute until it’s go time)

Create Vector DB

Need to split docs into domain-specific chunks before ingestion.

Vector DB stores the following

Text

Source

Associated embedding

RAG Challenge — How to preserve context?

Workshop: Using an HTML tag as a way to group text together — Using the section tag.

This will most likely change as we ingest different types of data.

Data Ingestion

Chunk size can be something you tune over time.

Embedding batch size — High impact to the model

⁠

This embed_dim (embedding_dimensions) needs to be updated based on the model you use (See data.py for more)

⁠

What is embedding dimension?

Embedding model allowsyou to convert a sequence of tokens into a series of numbers.

Model is trained to map text into vectors, and the vector represents semantic relationship of the text.

Diff models uses diff embedding to capture these info.

The larger the dimension, the more context it can captured.

Larger dimension doesn’t always mean better performance — Some smaller model can perform better than larger model in specific tasks.

Index Data Section

Need to drop the # of ActorPoolStrategy(...) to something lower than 8 to work

⁠

Retrieval Discussion

Using Top K — Using Postgres implementation

Need to ensure what constraints we need and structure the data to do retrieval based on those constraints.

Strategy: Do semantic search first, and do another ranking afterwards.

Hybrid search: (Experience) Almost always work better where there are domain specific keywords — Generally better if you go with keyword search.

Production Challenges

⁠

Non-Quality related — 👍 (Makes sense)

Quality-related ← THIS IS THE MEAT

⁠

This hits home 👀

⁠

Quick to get a prototype — But takes a long time to productionize

⁠

The dotted line → We need the confidence before going there.

Idea Validation

Focus on quick iterations & vibe check

This is NOT meant to be systematic but useful

Solution Validation

Vibe check on representative of test queries — Towards more systematic evaluation to gain confidence before doing initial deployment.

Challenges of Systematic Evaluations

Metrics

No right answer

Human evaluation is not scalable

Data Available

Labelled data is slow & costly to collect

Actionable Insight

Easy to know it’s bad, but not easy to understand how to improve.

There’s also the end-to-end vs component-specific retrieval.

LLM-as-a-judge (GPT-4, Claude 2)

High agreement with human labeler, scalabile, interpretability — Ask for a list of reasoning on providing such scores.

Approaches

Pairwise/Reference-guided comparison — With golden answer

Single answer grading — Ask for specific score

Systematic Eval - Overview

⁠

Data for Systematic Evaluation - GOLD

⁠

Heh —

⁠

Question: How to deal with LLM changing under you when the LLM itself is changing?

No great solution

Question: Fine-tuning in Judge model?

Not much success yet — Open source model hasn’t worked super well in evaluation yet. No strong conviction.

Followup: Why is it bad?

Fine tuning — Are you trying to distill a model into a smaller model?

Fine tune GPT-4 into specific evaluation tasks, and you have a set of data to see what the evaluator looks like → This works.

Distill GPT-4 model into a smaller model — Not working very well (Since it requires more reasoning capability)

Question: How to account for bias in LLM?

Technique (Might be good to adopt): Track output distribution (Limit the output constraints)

LLM-as-a-judge

Position bias (First vs last)

Verbosity bias (Known issue)

Self-enhancement bias (GPT judge prefers GPT answer as an example)

Question: Cosine similarity

There are answers that are semantically similar but not correct — There is no one metric that works well, so it’s helpful to have multiple criterias to get a big picture on roughly how good they are.

Semantic similarity is more scalable as it doesn’t depend on LLM.

Embedding model — Depending on which one you picked. They are usually optimized for retrieval, and it’s not optimized for correctness.

Search-based Evaluation

This is more information retrieval metrics — Not AI specific

⁠

Experimentation & Optimization

⁠

Idea: Use LLM to do the cleaning of the data before feeding into the Retrieval/Generation piece.

FOLLOWUP: Go through Lab that around optimization — This piece is interesting

Customizing Retrieval & Generation

Went through a couple strategy — Couple of them are pretty interesting to try out.

Followup

Followup