- Improving GSM8K Reasoning Accuracy in Small Language Models using
Filtered Chain-of-Thought Knowledge Distillation
📊 Research Papers Table — Knowledge Distillation + GSM8K
📊 Simple Gap Summary Table
📘 Survey Paper Summary Table (Related to KD & Efficient LLMs)
🟢 Final Summary in One Sentence
This research aims to make a small language model reason better on GSM8K by intelligently transferring high-quality reasoning knowledge from a large teacher model using filtered chain-of-thought distillation.
📘 Knowledge Distillation for GSM8K Reasoning (Simple View)
GSM8K Dataset
(Math Word Problems)
│
▼
┌────────────────────────┐
│ Large Teacher LLM │
│ (Strong Reasoning) │
│ Generates CoT Steps │
└───────────┬────────────┘
│
Filter Correct & Clear
Chain-of-Thought Steps
│
▼
┌────────────────────────┐
│ Small Student LLM │
│ (Compressed Model) │
│ Learns via KD │
└───────────┬────────────┘
│
▼
Evaluate on GSM8K
─ Accuracy
─ Reasoning Quality
─ Inference Speed
🔹 Even More Simplified (One-Line Flow)
GSM8K → Teacher (Generate CoT) → Filter CoT → Distill to Small Model → Test on GSM8K
Here is the same figure explained in a very simple tabular format so that even someone without an AI background can understand it.
Dataset used in the experiment: GSM8K
Explanation of Figure in Simple Table Res Pri Prom:
Main Idea in Very Simple Words
Key Insight From the Paper
So the researchers discovered:
If the teacher AI explains its reasoning clearly, the student AI learns much better.
Simple Real-Life Analogy
The Ground Truth KD method is like a teacher who knows the correct answer and then explains the method carefully, which leads to the best learning.
✅ Final takeaway
The research shows that the way a teacher AI explains answers during training greatly affects how well a smaller AI learns.
If you want, I can also make a very simple diagram of the whole pipeline from the paper (teacher → prompting → student training) that makes the entire research easy to understand in one visual.
Baseline Methods Explained CODI:
Implementation of Each Baseline:
Key Difference Between Them