JavaScript required
We’re sorry, but Coda doesn’t work properly without JavaScript enabled.
Skip to content
Gallery
AI Engineer Summit: Scrappy Notes
✅ Workshop: RAG
🚫 Workshop: Evals-Driven Developmen
✅ Main: Building Blocks
✅ Main: Embedding
More
Share
Explore
✅ Main: Building Blocks
GOOD TALK - REWATCH
Evals: Foundation
Don’t have a consistent eval process.
Traditional ML has standardized metrics.
LLM is a minefield
Are we assessing the prompts? Or assessing the LLMs themselves?
Dialogue: Requires a strong LLM to do evaluation.
Automated-eval: Shouldn’t discount eye-balling completion (”vibe-check”)
THIS IS THE TRANSFERRABLE ASSET THAT YOU CAN USE.
RAG: Add knowledge to context so we don’t have to rely on model only.
Retrieving the right doc is hard (TOMORROW)
Fact: LLM can’t see all docs are retrieved.
Reference informational retrieval - Lots we can learn from them.
LLM may not know if the doc is not irrelevant.
e.g. Measure item distance, make sure it’s not too far.
Guardrails: Make sure what we deploy is safe.
Hallucination — Detecting factual consistency
Summarization field has been focusing on this
Use NLI — GO INTO THIS
Sampling
Generate sampling multiple times — If they are similar, then that means they are likely to be factual (Not bulletproof but close enough)
Strong LLM
Collecting Feedback
This helps to build out benchmarks.
Can’t always trust thumbsup/down.
Implicit data flywheel
Three Things
You need automated evals.
Reuse existing systems & techniques
e.g. Recommendation system (Two stage retrieval, filtering)
UX plays a large role in LLM products
Copilot UX allows them to collect user feedback a lot more frequent.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
Ctrl
P
) instead.