icon picker
Some insights on Finetuing small models

Authors: Kareem Ahmad, Zelin Pu

Personal Introduction:

I am Kareem, a fervent Computer Science student currently shaping my skills at , in Egypt. (BTW, Egyptian phone number is not supported for register, so I had to borrow a phone number from my Canadian friend. Hope to see that get fixed in the future!) My academic journey has been a testament to hands-on experiences and certifications in Data Analysis and Machine Learning.
Zelin introduced me here. I met him on, a platform for AI hackathons. We have built two hackathons together, and I am very excited for our current project - Daydreaming. We are developing a groundbreaking AI-powered game engine specifically designed to intricately simulate storyline and the evolution of relationships among characters within virtual worlds.
I love Zelin’s idea of modualizing the engine, and progressively open-sourcing components to foster a community. Over the recent weeks, I've working on a fine-tuning pipeline tailored for a specific scenario. I'm eager to share my experiences, and ask for some insights from this great community.

Senario and Task

Envision this scenario: Reflect on how you determine your first actions upon awakening. Our goal is for the language model to replicate this decision-making process.
Of course, we can use gpt-4 api to achieve this goal. But the cost-effectiveness and the inference speed is not quite ideal. And it seems a bit overkill to use such a powerful model for a simple task.
We aim to develop a language model endowed with reasoning capabilities, tasked with generating the motivations for a character's forthcoming event, taking into account both the immediate context and the character's background.
We want to evaluate the performance of a finetuned small LLM in this senario.

The base model

Qwen 1.5 series
I tried to start finetuning with the smallest base model I can find. After some research, I was impressed by the qwen 1.5 series (), which was released two weeks ago, especially the performance of the 1.8B model.
The Qwen series have 6 model sizes: 0.5B, 1.8B, 4B, 7B, 14B, and 72B; The minimum GPU usage of generating 2048 tokens is only 2.9GB for the 1.8B, and I could only imagine how much faster the 0.5B will be!
Here is a showcase of the input and output of the 1.8B model:
system prompt: based on the sample input and output pair, You need to generate another pair. The data you generated will be used for finetuning. Your output should be in JSON format.\n\nsample: input: “Character name: Lin Yu\nCharacter description: Male, 19 years old, a freshman at Savia University in the United States, a Chinese international student. Ordinary appearance, ordinary status, meticulous thinking, brave and careful, smart, calm and composed, persistent.\nCurrent situation: “It's Monday morning at 7:30, and Lin Yu is tossing and turning in bed, unable to sleep”\nBackground knowledge: Lin Yu has been down for too long due to a recent breakup and wants to cheer himself up. Become a better person.\n\n output: {“purpose_a: “Lin Yu decides to go to the library early to study, to improve his academics, and to temporarily distract himself from thoughts of the breakup.”, purpose_b: “Lin Yu thinks about going to the gym for a workout to energize himself and also improve his physical fitness”, “purpose_c”: “Lin Yu wants to go to a café for breakfast, where he might have a chance to strike up a conversation with a beautiful woman.”}
input prompt: "Character name: Emma Green\nCharacter description: Female, 28 years old, a software engineer in a tech company in Silicon Valley. She has a creative mind, is highly skilled in coding, and is passionate about developing innovative technologies. She is ambitious, hard-working, but sometimes struggles with work-life balance.\nCurrent situation: “It’s Friday evening, and Emma just finished a major project. She’s feeling both relieved and exhausted.”\nBackground knowledge: Emma has been working overtime for weeks to meet the project deadline and has neglected her personal life and hobbies."
model response: { "purpose_a": "Emma decides to take some time off on Saturday to relax and recharge after completing her project. She plans to spend the day doing activities that she enjoys, such as reading books or watching movies.", "purpose_b": "On Sunday, Emma schedules an appointment with her therapist to discuss her stress levels and develop coping mechanisms for managing her workload and balancing her professional and personal life.", "purpose_c": "In addition to taking care of herself physically and emotionally, Emma also plans to volunteer at a local charity organization to give back to her community and make a positive impact." }
The result was impressive and got my hopes up. The 1.8B base model has this kind of performance, and I believe the 0.5B model can be promising after some finetuning!


Data Preparation:
I prepared around 1789 pairs of input and output data, with the format of sharegpt:
"id": "333fee3c-ba98-44d3-94da-e34836a07fb6",
"conversations": [
"from": "user",
"value": "{"Character_name": "Emma","Character_description": "Female, 28 years old, a project manager at a tech startup, lives in a bustling city. She is ambitious, assertive, and highly organized, with a keen sense of style and a passion for clean eating and fitness.","Context": "It's late Friday afternoon, and Emma just completed a major project milestone. She's feeling accomplished but also recognizes the need for some personal time.","Background": "Emma has been working overtime for the past few weeks to meet her project deadlines. With the weekend ahead, she's looking forward to unwinding and having some fun, but also wants to maintain her healthy lifestyle."}"
"from": "assistant",
"value": "{"purpose_a": "Emma decides to attend a yoga class to unwind and maintain her fitness routine.","purpose_b": "Emma thinks about making dinner reservations at a new organic restaurant to treat herself and adhere to her clean eating habits.","purpose_c": "Emma wants to call her friends and organize a casual evening out to socialize and relax after the intensive work period."}"
I used a limited pairs of actual user data generated in testing + data generated by gpt-4 (using hackathon credits), so I guess its synthetic anyways lol, but I was having some issues of the gpt-4 generated data, as the character and the context information is not diverse enough. (The following diagram shows the count of different occupations in the beginning, which is not diverse at all)
I had to manually randomly generate different professions and personality types (mbtis) to make the data a bit more diverse.
image (9).png
I used LLama Factory () as the framework of finetuning. (recommended in the Qwen 1.5 official repo)
I used QLoRa instead of LoRa to save some peak GPU usage, considering I only have 12GB of GPU memory...
Btw, LLama Factory provides a webui that can visualize the training process as well!
So I loaded the dataset, and the base model (Qwen1.5 0.5B), set the training epochs to be 3, logging steps to be 10, save steps to be 1000, gradient_accumulation_steps 4, and etc, just to get the taste of everything.

The Result:

The whole process took around 1 hour, and the loss was reduced from around 1.8 to 1.192 when the 3 epochs are finished.
After merging the adapters and the quantization to bit 4, the final model has only the size of 1.3GB.
Given the loss value after 3 epochs, you could possibly imagine the evaluation is not quite ideal, and the model response tends to be repetitive, but it ran so fast even on low-end gpu! It was still quite promising!

Next Steps

Better custom data needed. We should collect more actual user data.
More training epochs needed.
Run small models client-side. This is a quite bold concept, but with the help of Webllm (using browser webGPU api), we could see a future of running small models, like the Qwen 0.5B and 1.8B model in browser.
Sam Altman highlighted the importance of recognizing the impending arrival of Artificial General Intelligence (AGI) and the anticipation that future iterations like GPT-5 or GPT-6 could evolve into models with unparalleled knowledge and understanding.
However, in the present context, a high-performance AGI remains in development
lossless long context for large language models is still a expensive task (things might change with the new release of google gemini 1.5 pro, which can support up to 1 million tokens in context with reduced cost?)
we believe finetuning small models remains valuable, effectively bridging the gap until a comprehensive AGI solution becomes available.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
) instead.