✅ Main: Building Blocks

GOOD TALK - REWATCH

Evals: Foundation

Don’t have a consistent eval process.

Traditional ML has standardized metrics.

LLM is a minefield

Are we assessing the prompts? Or assessing the LLMs themselves?

Dialogue: Requires a strong LLM to do evaluation.

Automated-eval: Shouldn’t discount eye-balling completion (”vibe-check”)

THIS IS THE TRANSFERRABLE ASSET THAT YOU CAN USE.

RAG: Add knowledge to context so we don’t have to rely on model only.

Retrieving the right doc is hard (TOMORROW)

Fact: LLM can’t see all docs are retrieved.

Reference informational retrieval - Lots we can learn from them.

LLM may not know if the doc is not irrelevant.

e.g. Measure item distance, make sure it’s not too far.

Guardrails: Make sure what we deploy is safe.

Hallucination — Detecting factual consistency

Summarization field has been focusing on this

Use NLI — GO INTO THIS

Sampling

Generate sampling multiple times — If they are similar, then that means they are likely to be factual (Not bulletproof but close enough)

Strong LLM

Collecting Feedback

This helps to build out benchmarks.

Can’t always trust thumbsup/down.

Implicit data flywheel

Three Things

You need automated evals.

Reuse existing systems & techniques

e.g. Recommendation system (Two stage retrieval, filtering)

UX plays a large role in LLM products

Copilot UX allows them to collect user feedback a lot more frequent.

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.