✅ Main: Building Blocks

GOOD TALK - REWATCH

Evals: Foundation
Don’t have a consistent eval process.
Traditional ML has standardized metrics.
LLM is a minefield
Are we assessing the prompts? Or assessing the LLMs themselves?
Dialogue: Requires a strong LLM to do evaluation.
Automated-eval: Shouldn’t discount eye-balling completion (”vibe-check”)
THIS IS THE TRANSFERRABLE ASSET THAT YOU CAN USE.
RAG: Add knowledge to context so we don’t have to rely on model only.
Retrieving the right doc is hard (TOMORROW)
Fact: LLM can’t see all docs are retrieved.
Reference informational retrieval - Lots we can learn from them.
LLM may not know if the doc is not irrelevant.
e.g. Measure item distance, make sure it’s not too far.
Guardrails: Make sure what we deploy is safe.
Hallucination — Detecting factual consistency
Summarization field has been focusing on this
Use NLI — GO INTO THIS
Sampling
Generate sampling multiple times — If they are similar, then that means they are likely to be factual (Not bulletproof but close enough)
Strong LLM
Collecting Feedback
This helps to build out benchmarks.
Can’t always trust thumbsup/down.
Implicit data flywheel

Three Things
You need automated evals.
Reuse existing systems & techniques
e.g. Recommendation system (Two stage retrieval, filtering)
UX plays a large role in LLM products
Copilot UX allows them to collect user feedback a lot more frequent.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.