Explore

Knowledge Distillation for LLM Compression & Efficient Reasoning

- Improving GSM8K Reasoning Accuracy in Small Language Models using

Filtered Chain-of-Thought Knowledge Distillation

📊 Research Papers Table — Knowledge Distillation + GSM8K

Paper Title

Dataset Used

Teacher Model

Student Model

GSM8K Accuracy

Code Link

Paper Link

Research Gap

Published / Submission Date

Implicit Chain-of-Thought Reasoning via Knowledge Distillation

GSM8K

Explicit CoT trained teacher

Implicit reasoning student

≈ 22 % on GSM8K

⁠

https://github.com/da03/implicit_chain_of_thought⁠

⁠

https://arxiv.org/abs/2311.01460⁠

⁠

Limited performance gains compared to explicit CoT; struggles to scale implicit reasoning to higher-capacity models; limited evaluation beyond GSM8K; lacks analysis on reasoning interpretability trade-offs.

Submitted on 2 Nov 2023

CODI: Compressing Chain-of-Thought via Self-Distillation

GSM8K

Explicit CoT (joint teacher/student self-distilled)

Same model as teacher in implicit representation

Matches explicit CoT performance (28.2 % increase over previous best in implicit distillation) – interpretable GSM8K results shown

⁠

https://github.com/zhenyi4/codi⁠

⁠

https://aclanthology.org/2025.emnlp-main.36/⁠

⁠

Focused mainly on compression rather than efficiency-speed trade-offs; unclear generalization to multi-hop or out-of-domain reasoning tasks; limited study on smaller student models.

Published in November 2025 (EMNLP 2025 proceedings)

Teaching Small Language Models to Reason (related KD work)

GSM8K

Large teacher (e.g., PaLM-540B)

T5-XXL

~21.99 % improvement over baseline

Not officially released

⁠

https://arxiv.org/abs/2212.08410⁠

⁠

Heavy dependence on very large proprietary teachers; no open-source reproducibility; limited exploration of implicit reasoning vs explicit CoT compression; computational cost not fully addressed.

Submitted on 16 Dec 2022 (latest version updated 1 Jun 2023)

Enhancing KD with Response-Priming Prompting (2024)

GSM8K

LLaMA-3.1 405B teacher

LLaMA-3.1 8B student

~55 % performance increase vs. baseline distillation

⁠

https://github.com/alonso130r/knowledge-distillation⁠

⁠

https://arxiv.org/abs/2412.17846⁠

⁠

Relies on extremely large teacher models; unclear robustness across datasets beyond GSM8K; lacks comparison with non-CoT compression methods; limited ablation on prompt sensitivity.

Submitted on 18 Dec 2024

There are no rows in this table

⁠

📊 Simple Gap Summary Table

Gap

Current Limitation

Your Opportunity

Accuracy Gap

Student far below teacher

Improve reasoning transfer

Efficiency Balance

No joint evaluation

Optimize both accuracy + speed

Teacher Errors

No filtering

Confidence-based filtering

Multi-Teacher

Rarely explored

Weighted multi-teacher KD

Hidden-State KD

Low performance

Combine explicit + implicit

Statistical Rigor

Weak validation

Add proper statistical testing

There are no rows in this table

⁠

📘 Survey Paper Summary Table (Related to KD & Efficient LLMs)

Paper Title

Authors

Year

Core Focus

Relevance to KD + Reasoning/GSM8K

Key Highlights

Exact Link

A Survey on Knowledge Distillation of Large Language Models

Xiaohan Xu et al.

2024

KD methods for compressing and enhancing LLMs

High — covers general KD methods, model compression → applicable to reasoning task transfer

- Surveys KD algorithms & techniques tailored for LLMs- Discusses KD for transferring capabilities from large proprietary LLMs to smaller models- Highlights role of KD in model compression & skill transfer- Organized around algorithm, skill, verticalization perspectives (

arXiv

⁠

)

⁠

https://arxiv.org/abs/2402.13116

⁠

(

arXiv

⁠

)

Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application

Chuanpeng Yang et al.

2024

KD methods, evaluation & applications in LLM compression

High — KD techniques classified; discusses evaluation & practical applications

- Categorizes KD into white-box and black-box approaches- Discusses how KD affects inference speed and model scalability- Covers evaluation tasks related to compressed LLM performance (

arXiv

⁠

)

⁠

https://arxiv.org/abs/2407.01885

⁠

(

arXiv

⁠

)

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Luyang Fang et al.

2025

KD + Dataset Distillation for efficient LLMs

Medium-High — more advanced KD trends; mentions capability preservation strategies

- Combines KD & dataset distillation to address scalability- Highlights challenges in preserving reasoning and linguistic abilities during compression- Useful for future research directions (

arXiv

⁠

)

⁠

https://arxiv.org/abs/2504.14772

⁠

(

arXiv

⁠

)

There are no rows in this table

⁠

Application Area

Paper 1

Paper 2

Paper 3

Paper 4

Paper 5

Research Gap

Research Flow

Recommended Datasets

Dataset Sources

Publication/Completion Advice

Knowledge Distillation for LLM Compression & Efficient Reasoning

Distilling Step-by-Step: Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes, Hsieh et al., 2023, arXiv,

https://arxiv.org/abs/2305.02301

⁠

Self-Distillation Improves Chain-of-Thought Reasoning in Language Models, Magister et al., 2023, arXiv,

https://arxiv.org/abs/2212.09671

⁠

MiniLLM: Knowledge Distillation of Large Language Models, Gu et al., 2023, arXiv,

https://arxiv.org/abs/2306.08543

⁠

GKD: Generalized Knowledge Distillation for Large Language Models, Agarwal et al., 2023, arXiv,

https://arxiv.org/abs/2306.13649

⁠

LLM-Pruner: On the Structural Pruning of Large Language Models, Ma et al., 2023, NeurIPS Workshop,

https://arxiv.org/abs/2305.11627

⁠

Current KD approaches for LLMs focus mainly on output logits or reasoning traces but insufficiently address: (1) alignment preservation during distillation, (2) multi-teacher ensemble distillation, (3) domain-specific LLM compression for low-resource languages, (4) energy-efficient distillation metrics. Few works integrate reasoning fidelity + efficiency + safety simultaneously.

1. Conduct detailed literature review on LLM distillation techniques (logit-based, feature-based, reasoning-based). 2. Define problem statement (e.g., “Efficient domain-specific LLM via multi-teacher reasoning-aware KD”). 3. Select teacher model (e.g., 7B–13B LLM). 4. Design student model architecture (smaller transformer). 5. Implement KD variants: vanilla KD, response-based KD, chain-of-thought KD, multi-teacher KD. 6. Evaluate on reasoning, efficiency, latency, memory, and alignment metrics. 7. Perform ablation studies. 8. Compare with pruning/quantization baselines. 9. Document reproducibility & statistical validation.

GSM8K (Math reasoning), MMLU (Multitask reasoning), Alpaca Instruction Dataset, FLAN Collection, WikiText-103

⁠

https://github.com/openai/grade-school-math

⁠

https://github.com/hendrycks/test

⁠

https://github.com/tatsu-lab/stanford_alpaca

⁠

https://github.com/google-research/FLAN

⁠

https://paperswithcode.com/dataset/wikitext-103

⁠

Clearly define novelty (e.g., “Reasoning-aware multi-teacher KD for domain-adapted LLMs”). Include strong baselines. Report latency, FLOPs, and memory reduction. Use statistical significance tests. Open-source code for credibility. Target venues: EMNLP, ACL, NeurIPS Workshops, IEEE Access, Expert Systems with Applications. Maintain structured thesis: Problem → Literature → Method → Experiments → Analysis → Conclusion. Ensure reproducibility and ethical considerations.

There are no rows in this table

⁠

🟢 Final Summary in One Sentence

This research aims to make a small language model reason better on GSM8K by intelligently transferring high-quality reasoning knowledge from a large teacher model using filtered chain-of-thought distillation.

📘 Knowledge Distillation for GSM8K Reasoning (Simple View)

GSM8K Dataset

(Math Word Problems)

│

▼

┌────────────────────────┐

│ Large Teacher LLM │

│ (Strong Reasoning) │

│ Generates CoT Steps │

└───────────┬────────────┘

│

Filter Correct & Clear

Chain-of-Thought Steps

│

▼

┌────────────────────────┐

│ Small Student LLM │

│ (Compressed Model) │

│ Learns via KD │

└───────────┬────────────┘

│

▼

Evaluate on GSM8K

─ Accuracy

─ Reasoning Quality

─ Inference Speed

🔹 Even More Simplified (One-Line Flow)

GSM8K → Teacher (Generate CoT) → Filter CoT → Distill to Small Model → Test on GSM8K

⁠

Paper Title

Dataset Used

Teacher Model

Student Model

GSM8K Accuracy

Code Link

Paper Link

Research Gap

Published / Submission Date

Distilling Reasoning Capabilities into Smaller Language Models

GSM8K

Large LLM teacher (e.g., PaLM-style models)

Smaller distilled LM

Strong GSM8K improvement via Socratic CoT distillation

Not officially released

⁠

https://arxiv.org/abs/2212.00193⁠

⁠

Early-stage reasoning distillation; limited exploration of implicit reasoning compression; dependence on strong teacher reasoning traces.

Submitted December 2022

Teaching Small Language Models to Reason (related KD work)

GSM8K

Large teacher (e.g., PaLM-540B)

T5-XXL

~21.99 % improvement over baseline

Not officially released

⁠

https://arxiv.org/abs/2212.08410⁠

⁠

Heavy dependence on proprietary teachers; limited open-source reproducibility; computational cost not fully addressed.

Submitted 16 Dec 2022 (updated 1 Jun 2023)

Trace-of-Thought Prompting: Investigating Prompt-Based Knowledge Distillation Through Question Decomposition

GSM8K

Large CoT-capable LLM (prompt-based teacher)

Smaller fine-tuned LM

Significant improvement over baseline on GSM8K (exact % varies by model size)

Not officially released

⁠

https://aclanthology.org/2024.acl-srw.35/⁠

⁠

Focused mainly on prompt-based KD; lacks comparison with parameter-level distillation; limited scaling analysis; restricted evaluation mostly to reasoning benchmarks.

Published August 2024 (ACL SRW 2024)

Implicit Chain-of-Thought Reasoning via Knowledge Distillation

GSM8K

Explicit CoT trained teacher

Implicit reasoning student

≈ 22 % on GSM8K

(without generating reasoning steps.)

https://github.com/da03/implicit_chain_of_thought

⁠

https://arxiv.org/abs/2311.01460⁠

⁠

Limited performance gains compared to explicit CoT; struggles to scale implicit reasoning to larger models; limited evaluation beyond GSM8K; interpretability trade-offs not deeply analyzed.

Submitted 2 Nov 2023

Enhancing Knowledge Distillation for LLMs with Response-Priming Prompting

GSM8K

LLaMA-3.1 405B teacher

LLaMA-3.1 8B student

~55 % performance increase vs baseline distillation

https://github.com/alonso130r/knowledge-distillation

⁠

https://arxiv.org/abs/2412.17846⁠

⁠

Relies on extremely large teacher models; unclear robustness beyond GSM8K; limited ablation on prompt sensitivity.

Submitted 18 Dec 2024

CODI: Compressing Chain-of-Thought via Self-Distillation

GSM8K

Explicit CoT (joint teacher/student self-distilled)

Same model as teacher in implicit representation

Matches explicit CoT performance (28.2 % increase over previous best in implicit distillation)

https://github.com/zhenyi4/codi

⁠

https://aclanthology.org/2025.emnlp-main.36/⁠

⁠

Focused on compression rather than speed/efficiency trade-offs; unclear generalization to multi-hop reasoning; limited study on very small students.

Published November 2025 (EMNLP 2025 proceedings)

There are no rows in this table

⁠

⁠
`KD comparision and what,why,how | PDF Host`⁠
⁠

⁠

https://codepen.io/Temparchit/full/ZYpQMpv⁠

⁠

https://chatgpt.com/share/69a83233-e82c-8011-be23-fe3d14354adf⁠

⁠

https://notebooklm.google.com/notebook/3e450b95-ebaa-4bd0-824d-7f834a2cdab7⁠

⁠

https://chatgpt.com/c/69a82d86-8444-8321-b5f4-bfd7cfd61f4a⁠

⁠

https://chatgpt.com/c/69a82f1f-c370-8324-9710-336e87d7804a⁠

⁠

https://chatgpt.com/c/69a82fed-f348-8320-b5a6-e95c21271b1a⁠

⁠

https://chatgpt.com/c/69a830ef-7030-8323-9b47-8c90ca6396a8⁠

⁠

Here is the same figure explained in a very simple tabular format so that even someone without an AI background can understand it.

Dataset used in the experiment: GSM8K

⁠

Explanation of Figure in Simple Table Res Pri Prom:

Method

What They Did

Simple Example

Accuracy (approx.)

Easy Meaning

No KD

The small AI model was used without any special training from a bigger model.

Small AI tries solving math questions alone.

~12%

Very weak performance.

No KD (Finetuned)

The small AI model was trained using regular training data, but without help from a bigger model.

Student studies using a textbook but no teacher guidance.

~25%

Better than before but still not very good.

Base KD

The small model learns by copying answers from a large “teacher” AI model.

Teacher gives the answer, student memorizes it.

~30%

Improvement, but student doesn’t learn reasoning.

Confidence KD

The teacher AI also gives a confidence score for its answers. Low-confidence answers can be ignored.

Teacher says: “Answer = 8, I am 90% sure.”

~34%

Slight improvement because unreliable answers are filtered.

Teacher KD

The teacher AI explains the solution step-by-step before giving the answer.

Teacher shows the full math steps.

~42%

Much better because the student learns the reasoning process.

Ground Truth KD

The correct answer is given first, then the teacher explains why it is correct.

Teacher says: “The answer is 8. Let me show why step-by-step.”

~48–49%

Best result because explanations are always correct.

There are no rows in this table

⁠

Main Idea in Very Simple Words

Concept

Simple Meaning

Teacher Model

A very powerful AI that already knows a lot.

Student Model

A smaller AI that is being trained to become smarter.

Knowledge Distillation (KD)

The process where the big AI teaches the small AI.

There are no rows in this table

⁠

Key Insight From the Paper

Situation

Result

Small AI learns alone

Very poor results

Small AI copies answers

Slight improvement

Small AI learns step-by-step reasoning

Huge improvement

There are no rows in this table

⁠

So the researchers discovered:

If the teacher AI explains its reasoning clearly, the student AI learns much better.

⁠

Simple Real-Life Analogy

Teaching Style

Outcome

Teacher only gives final answer

Student memorizes answers

Teacher explains full steps

Student learns how to solve problems

There are no rows in this table

⁠

The Ground Truth KD method is like a teacher who knows the correct answer and then explains the method carefully, which leads to the best learning.

⁠

✅ Final takeaway

The research shows that the way a teacher AI explains answers during training greatly affects how well a smaller AI learns.

⁠

If you want, I can also make a very simple diagram of the whole pipeline from the paper (teacher → prompting → student training) that makes the entire research easy to understand in one visual.

⁠

Baseline Methods Explained CODI:

Baseline

Full Meaning

How it Works

Reasoning Used?

No-CoT-SFT

Supervised Fine-Tuning without Chain-of-Thought

Model is trained to directly output the final answer without intermediate steps

❌ No

CoT-SFT

Chain-of-Thought Supervised Fine-Tuning

Model is trained using step-by-step reasoning before giving the final answer

✅ Yes

iCoT

Internalized Chain-of-Thought

Instead of generating reasoning text, the reasoning pattern is stored internally in model states

⚙️ Internal reasoning

Coconut

Continuous Chain-of-Thought method

Generates continuous reasoning representations rather than explicit text steps

🔄 Continuous reasoning

CODI (Ours)

Continuous reasoning method proposed in the paper

Uses continuous thought tokens to improve reasoning ability

⭐ Proposed method

There are no rows in this table

⁠

Implementation of Each Baseline:

Baseline

How It Is Implemented

Training Data Format

No-CoT-SFT

Perform Supervised Fine-Tuning (SFT) on question–answer pairs. The model is trained to directly output the final answer.

Input: Question → Output: Final Answer

CoT-SFT

Fine-tune the model using Chain-of-Thought (CoT) datasets. The model learns to generate reasoning steps before the answer.

Input: Question → Output: Step-by-step reasoning + Answer

iCoT

Uses curriculum learning. First train with reasoning steps, then gradually internalize reasoning into hidden states so the model can answer directly without printing the reasoning.

Training starts with CoT text, later removes visible reasoning

Coconut

Extends iCoT by replacing textual reasoning with continuous thought representations. The model autoregressively generates continuous hidden tokens instead of text reasoning.

Question → Continuous reasoning tokens → Answer

CODI (Ours)

Proposed method. Similar to Coconut but introduces continuous thought tokens explicitly in the model. These tokens represent reasoning steps and are trained to improve reasoning performance.

Question → Continuous thought tokens → Answer

There are no rows in this table

⁠

Key Difference Between Them

Method Type

Reasoning Representation

No-CoT-SFT

No reasoning

CoT-SFT

Text reasoning

iCoT

Hidden reasoning in model states

Coconut

Continuous reasoning vectors

CODI

Continuous thought tokens

There are no rows in this table

⁠

https://arxiv.org/pdf/2311.01460⁠

⁠

https://arxiv.org/pdf/2412.17846⁠

⁠

https://aclanthology.org/2025.emnlp-main.36.pdf⁠

⁠

Want to print your doc?
This is not the way.

Try clicking the ··· in the right corner or using a keyboard shortcut (

CtrlP

) instead.

Knowledge Distillation for LLM Compression & Efficient Reasoning

- Improving GSM8K Reasoning Accuracy in Small Language Models using

Filtered Chain-of-Thought Knowledge Distillation

📊 Research Papers Table — Knowledge Distillation + GSM8K

📊 Simple Gap Summary Table

📘 Survey Paper Summary Table (Related to KD & Efficient LLMs)

🟢 Final Summary in One Sentence

📘 Knowledge Distillation for GSM8K Reasoning (Simple View)

🔹 Even More Simplified (One-Line Flow)

⁠KD comparision and what,why,how | PDF Host⁠⁠

Explanation of Figure in Simple Table Res Pri Prom:

Main Idea in Very Simple Words

Key Insight From the Paper

Simple Real-Life Analogy

Baseline Methods Explained CODI:

Implementation of Each Baseline:

Key Difference Between Them

⁠
`KD comparision and what,why,how | PDF Host`⁠
⁠