Skip to content
Share
Explore

Knowledge Distillation for LLM Compression & Efficient Reasoning

- Improving GSM8K Reasoning Accuracy in Small Language Models using

Filtered Chain-of-Thought Knowledge Distillation


📊 Research Papers Table — Knowledge Distillation + GSM8K

Paper Title
Dataset Used
Teacher Model
Student Model
GSM8K Accuracy
Code Link
Paper Link
Research Gap
Published / Submission Date
Implicit Chain-of-Thought Reasoning via Knowledge Distillation
GSM8K
Explicit CoT trained teacher
Implicit reasoning student
≈ 22 % on GSM8K
Limited performance gains compared to explicit CoT; struggles to scale implicit reasoning to higher-capacity models; limited evaluation beyond GSM8K; lacks analysis on reasoning interpretability trade-offs.
Submitted on 2 Nov 2023
CODI: Compressing Chain-of-Thought via Self-Distillation
GSM8K
Explicit CoT (joint teacher/student self-distilled)
Same model as teacher in implicit representation
Matches explicit CoT performance (28.2 % increase over previous best in implicit distillation) – interpretable GSM8K results shown
Focused mainly on compression rather than efficiency-speed trade-offs; unclear generalization to multi-hop or out-of-domain reasoning tasks; limited study on smaller student models.
Published in November 2025 (EMNLP 2025 proceedings)
Teaching Small Language Models to Reason (related KD work)
GSM8K
Large teacher (e.g., PaLM-540B)
T5-XXL
~21.99 % improvement over baseline
Not officially released
Heavy dependence on very large proprietary teachers; no open-source reproducibility; limited exploration of implicit reasoning vs explicit CoT compression; computational cost not fully addressed.
Submitted on 16 Dec 2022 (latest version updated 1 Jun 2023)

Enhancing KD with Response-Priming Prompting (2024)
GSM8K
LLaMA-3.1 405B teacher
LLaMA-3.1 8B student
~55 % performance increase vs. baseline distillation
Relies on extremely large teacher models; unclear robustness across datasets beyond GSM8K; lacks comparison with non-CoT compression methods; limited ablation on prompt sensitivity.
Submitted on 18 Dec 2024
There are no rows in this table

📊 Simple Gap Summary Table

Gap
Current Limitation
Your Opportunity
Accuracy Gap
Student far below teacher
Improve reasoning transfer
Efficiency Balance
No joint evaluation
Optimize both accuracy + speed
Teacher Errors
No filtering
Confidence-based filtering
Multi-Teacher
Rarely explored
Weighted multi-teacher KD
Hidden-State KD
Low performance
Combine explicit + implicit
Statistical Rigor
Weak validation
Add proper statistical testing
There are no rows in this table

📘 Survey Paper Summary Table (Related to KD & Efficient LLMs)

Paper Title
Authors
Year
Core Focus
Relevance to KD + Reasoning/GSM8K
Key Highlights
Exact Link
A Survey on Knowledge Distillation of Large Language Models
Xiaohan Xu et al.
2024
KD methods for compressing and enhancing LLMs
High — covers general KD methods, model compression → applicable to reasoning task transfer
- Surveys KD algorithms & techniques tailored for LLMs- Discusses KD for transferring capabilities from large proprietary LLMs to smaller models- Highlights role of KD in model compression & skill transfer- Organized around algorithm, skill, verticalization perspectives ()
Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application
Chuanpeng Yang et al.
2024
KD methods, evaluation & applications in LLM compression
High — KD techniques classified; discusses evaluation & practical applications
- Categorizes KD into white-box and black-box approaches- Discusses how KD affects inference speed and model scalability- Covers evaluation tasks related to compressed LLM performance ()
Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions
Luyang Fang et al.
2025
KD + Dataset Distillation for efficient LLMs
Medium-High — more advanced KD trends; mentions capability preservation strategies
- Combines KD & dataset distillation to address scalability- Highlights challenges in preserving reasoning and linguistic abilities during compression- Useful for future research directions ()
There are no rows in this table



Application Area
Paper 1
Paper 2
Paper 3
Paper 4
Paper 5
Research Gap
Research Flow
Recommended Datasets
Dataset Sources
Publication/Completion Advice
Knowledge Distillation for LLM Compression & Efficient Reasoning
Distilling Step-by-Step: Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes, Hsieh et al., 2023, arXiv,
Self-Distillation Improves Chain-of-Thought Reasoning in Language Models, Magister et al., 2023, arXiv,
MiniLLM: Knowledge Distillation of Large Language Models, Gu et al., 2023, arXiv,
GKD: Generalized Knowledge Distillation for Large Language Models, Agarwal et al., 2023, arXiv,
LLM-Pruner: On the Structural Pruning of Large Language Models, Ma et al., 2023, NeurIPS Workshop,
Current KD approaches for LLMs focus mainly on output logits or reasoning traces but insufficiently address: (1) alignment preservation during distillation, (2) multi-teacher ensemble distillation, (3) domain-specific LLM compression for low-resource languages, (4) energy-efficient distillation metrics. Few works integrate reasoning fidelity + efficiency + safety simultaneously.
1. Conduct detailed literature review on LLM distillation techniques (logit-based, feature-based, reasoning-based). 2. Define problem statement (e.g., “Efficient domain-specific LLM via multi-teacher reasoning-aware KD”). 3. Select teacher model (e.g., 7B–13B LLM). 4. Design student model architecture (smaller transformer). 5. Implement KD variants: vanilla KD, response-based KD, chain-of-thought KD, multi-teacher KD. 6. Evaluate on reasoning, efficiency, latency, memory, and alignment metrics. 7. Perform ablation studies. 8. Compare with pruning/quantization baselines. 9. Document reproducibility & statistical validation.
GSM8K (Math reasoning), MMLU (Multitask reasoning), Alpaca Instruction Dataset, FLAN Collection, WikiText-103
Clearly define novelty (e.g., “Reasoning-aware multi-teacher KD for domain-adapted LLMs”). Include strong baselines. Report latency, FLOPs, and memory reduction. Use statistical significance tests. Open-source code for credibility. Target venues: EMNLP, ACL, NeurIPS Workshops, IEEE Access, Expert Systems with Applications. Maintain structured thesis: Problem → Literature → Method → Experiments → Analysis → Conclusion. Ensure reproducibility and ethical considerations.
There are no rows in this table

🟢 Final Summary in One Sentence

This research aims to make a small language model reason better on GSM8K by intelligently transferring high-quality reasoning knowledge from a large teacher model using filtered chain-of-thought distillation.

📘 Knowledge Distillation for GSM8K Reasoning (Simple View)

GSM8K Dataset
(Math Word Problems)
┌────────────────────────┐
│ Large Teacher LLM │
│ (Strong Reasoning) │
│ Generates CoT Steps │
└───────────┬────────────┘
Filter Correct & Clear
Chain-of-Thought Steps
┌────────────────────────┐
│ Small Student LLM │
│ (Compressed Model) │
│ Learns via KD │
└───────────┬────────────┘
Evaluate on GSM8K
─ Accuracy
─ Reasoning Quality
─ Inference Speed

🔹 Even More Simplified (One-Line Flow)

GSM8K → Teacher (Generate CoT) → Filter CoT → Distill to Small Model → Test on GSM8K




Paper Title
Dataset Used
Teacher Model
Student Model
GSM8K Accuracy
Code Link
Paper Link
Research Gap
Published / Submission Date
Distilling Reasoning Capabilities into Smaller Language Models
GSM8K
Large LLM teacher (e.g., PaLM-style models)
Smaller distilled LM
Strong GSM8K improvement via Socratic CoT distillation
Not officially released
Early-stage reasoning distillation; limited exploration of implicit reasoning compression; dependence on strong teacher reasoning traces.
Submitted December 2022
Teaching Small Language Models to Reason (related KD work)
GSM8K
Large teacher (e.g., PaLM-540B)
T5-XXL
~21.99 % improvement over baseline
Not officially released
Heavy dependence on proprietary teachers; limited open-source reproducibility; computational cost not fully addressed.
Submitted 16 Dec 2022 (updated 1 Jun 2023)
Trace-of-Thought Prompting: Investigating Prompt-Based Knowledge Distillation Through Question Decomposition
GSM8K
Large CoT-capable LLM (prompt-based teacher)
Smaller fine-tuned LM
Significant improvement over baseline on GSM8K (exact % varies by model size)
Not officially released
Focused mainly on prompt-based KD; lacks comparison with parameter-level distillation; limited scaling analysis; restricted evaluation mostly to reasoning benchmarks.
Published August 2024 (ACL SRW 2024)
Implicit Chain-of-Thought Reasoning via Knowledge Distillation
GSM8K
Explicit CoT trained teacher
Implicit reasoning student
≈ 22 % on GSM8K

(without generating reasoning steps.)
Limited performance gains compared to explicit CoT; struggles to scale implicit reasoning to larger models; limited evaluation beyond GSM8K; interpretability trade-offs not deeply analyzed.
Submitted 2 Nov 2023
Enhancing Knowledge Distillation for LLMs with Response-Priming Prompting
GSM8K
LLaMA-3.1 405B teacher
LLaMA-3.1 8B student
~55 % performance increase vs baseline distillation
Relies on extremely large teacher models; unclear robustness beyond GSM8K; limited ablation on prompt sensitivity.
Submitted 18 Dec 2024
CODI: Compressing Chain-of-Thought via Self-Distillation
GSM8K
Explicit CoT (joint teacher/student self-distilled)
Same model as teacher in implicit representation
Matches explicit CoT performance (28.2 % increase over previous best in implicit distillation)
Focused on compression rather than speed/efficiency trade-offs; unclear generalization to multi-hop reasoning; limited study on very small students.
Published November 2025 (EMNLP 2025 proceedings)
There are no rows in this table



Here is the same figure explained in a very simple tabular format so that even someone without an AI background can understand it.
Dataset used in the experiment: GSM8K

Explanation of Figure in Simple Table Res Pri Prom:

Method
What They Did
Simple Example
Accuracy (approx.)
Easy Meaning
No KD
The small AI model was used without any special training from a bigger model.
Small AI tries solving math questions alone.
~12%
Very weak performance.
No KD (Finetuned)
The small AI model was trained using regular training data, but without help from a bigger model.
Student studies using a textbook but no teacher guidance.
~25%
Better than before but still not very good.
Base KD
The small model learns by copying answers from a large “teacher” AI model.
Teacher gives the answer, student memorizes it.
~30%
Improvement, but student doesn’t learn reasoning.
Confidence KD
The teacher AI also gives a confidence score for its answers. Low-confidence answers can be ignored.
Teacher says: “Answer = 8, I am 90% sure.”
~34%
Slight improvement because unreliable answers are filtered.
Teacher KD
The teacher AI explains the solution step-by-step before giving the answer.
Teacher shows the full math steps.
~42%
Much better because the student learns the reasoning process.
Ground Truth KD
The correct answer is given first, then the teacher explains why it is correct.
Teacher says: “The answer is 8. Let me show why step-by-step.”
~48–49%
Best result because explanations are always correct.
There are no rows in this table

Main Idea in Very Simple Words

Concept
Simple Meaning
Teacher Model
A very powerful AI that already knows a lot.
Student Model
A smaller AI that is being trained to become smarter.
Knowledge Distillation (KD)
The process where the big AI teaches the small AI.
There are no rows in this table

Key Insight From the Paper

Situation
Result
Small AI learns alone
Very poor results
Small AI copies answers
Slight improvement
Small AI learns step-by-step reasoning
Huge improvement
There are no rows in this table
So the researchers discovered:
If the teacher AI explains its reasoning clearly, the student AI learns much better.

Simple Real-Life Analogy

Teaching Style
Outcome
Teacher only gives final answer
Student memorizes answers
Teacher explains full steps
Student learns how to solve problems
There are no rows in this table
The Ground Truth KD method is like a teacher who knows the correct answer and then explains the method carefully, which leads to the best learning.
Final takeaway
The research shows that the way a teacher AI explains answers during training greatly affects how well a smaller AI learns.
If you want, I can also make a very simple diagram of the whole pipeline from the paper (teacher → prompting → student training) that makes the entire research easy to understand in one visual.


Baseline Methods Explained CODI:

Baseline
Full Meaning
How it Works
Reasoning Used?
No-CoT-SFT
Supervised Fine-Tuning without Chain-of-Thought
Model is trained to directly output the final answer without intermediate steps
❌ No
CoT-SFT
Chain-of-Thought Supervised Fine-Tuning
Model is trained using step-by-step reasoning before giving the final answer
✅ Yes
iCoT
Internalized Chain-of-Thought
Instead of generating reasoning text, the reasoning pattern is stored internally in model states
⚙️ Internal reasoning
Coconut
Continuous Chain-of-Thought method
Generates continuous reasoning representations rather than explicit text steps
🔄 Continuous reasoning
CODI (Ours)
Continuous reasoning method proposed in the paper
Uses continuous thought tokens to improve reasoning ability
⭐ Proposed method
There are no rows in this table

Implementation of Each Baseline:

Baseline
How It Is Implemented
Training Data Format
No-CoT-SFT
Perform Supervised Fine-Tuning (SFT) on question–answer pairs. The model is trained to directly output the final answer.
Input: Question → Output: Final Answer
CoT-SFT
Fine-tune the model using Chain-of-Thought (CoT) datasets. The model learns to generate reasoning steps before the answer.
Input: Question → Output: Step-by-step reasoning + Answer
iCoT
Uses curriculum learning. First train with reasoning steps, then gradually internalize reasoning into hidden states so the model can answer directly without printing the reasoning.
Training starts with CoT text, later removes visible reasoning
Coconut
Extends iCoT by replacing textual reasoning with continuous thought representations. The model autoregressively generates continuous hidden tokens instead of text reasoning.
Question → Continuous reasoning tokens → Answer
CODI (Ours)
Proposed method. Similar to Coconut but introduces continuous thought tokens explicitly in the model. These tokens represent reasoning steps and are trained to improve reasoning performance.
Question → Continuous thought tokens → Answer
There are no rows in this table

Key Difference Between Them

Method Type
Reasoning Representation
No-CoT-SFT
No reasoning
CoT-SFT
Text reasoning
iCoT
Hidden reasoning in model states
Coconut
Continuous reasoning vectors
CODI
Continuous thought tokens
There are no rows in this table

Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.