Fine-Tune LLAMA-3.1 Efficiently with Unsloth: Optimize for Data, Speed & Cost

Optimize your LLAMA-3.1 model with Unsloth's efficient fine-tuning techniques. Learn how to leverage LoRA and QLoRA for faster training, lower VRAM requirements, and improved model performance. Uncover the impact of hyperparameters on your fine-tuned model. Explore Unsloth's chat UI for seamless interaction with your custom LLMs.

February 24, 2025

party-gif

Unlock the power of fine-tuning with LLAMA-3.1, the easiest way to adapt language models to your specific needs. Discover how to efficiently train your own LLAMA-3.1 model using Unsloth's cutting-edge techniques, including LoRa and QLoRa, to achieve remarkable results with minimal GPU resources. This blog post provides a step-by-step guide to help you maximize the potential of your data and create a tailored language model that meets your unique requirements.

Different Stages of Training: Pre-training, Supervised Fine-Tuning, and Preference Alignment

There are typically three different stages of training for large language models:

  1. Pre-training: In this stage, the model is trained on a large corpus of raw text data to learn how to predict the next token or word. The result is a base model that has acquired a lot of general knowledge from the text, but is not yet very useful for specific tasks.

  2. Supervised Fine-Tuning: To make the base model more useful, the second stage is supervised fine-tuning. In this stage, the model is trained on question-answer pairs or instruction-answer pairs. The input is a question or instruction, and the output is the desired response or answer. This stage allows the model to learn task-specific knowledge and capabilities.

  3. Preference Alignment: The optional third stage is preference alignment, where the model is trained to learn what the user prefers in terms of responses, or to align the model to certain principles. This is often done using techniques like Reinforcement Learning from Human Feedback (RLHF) or Debate Policy Optimization (DPO).

The goal of this video is to focus on the supervised fine-tuning stage, specifically how to fine-tune the Llama 3.1 model using the Unsloth library. The video will cover different fine-tuning techniques, such as full fine-tuning, LoRA, and QLoRA, and discuss the trade-offs between them in terms of performance and memory requirements.

Supervised Fine-Tuning Techniques: Full Fine-Tuning, LoRA, and QLoRA

There are three popular options for supervised fine-tuning:

  1. Full Fine-Tuning: In this approach, you take the original model and update the weights with the instruct fine-tune dataset. This will give you the best performance, but the VRAM requirement will be high.

  2. LoRA (Low-Rank Adaptation): Instead of directly updating the weights, you add external adapters to the model. The number of parameters in these external adapters can be controlled. The weight updates for these adapters are done in 16-bit precision, and then merged back into the original model weights. This approach provides quick training, but is still costly due to the 16-bit operations.

  3. QLoRA (Quantized LoRA): This is similar to LoRA, but the weight updates are done in 4-bit precision, and the model weights are also kept in 4-bit. This requires less VRAM, but the performance may not be as great as LoRA or full fine-tuning.

Unsloth supports both LoRA and QLoRA for fine-tuning. The rank of the LoRA adapters, LoRA alpha, and LoRA dropout are important hyperparameters that control the number of parameters and the contribution of the adapters to the final model weights.

Setting Up LoRA Adopters: Rank, LoRA Alpha, and LoRA Dropout

To set up the LoRA adopters, there are a few key parameters to consider:

  1. Rank: The rank of the LoRA adopters controls the number of parameters that will be updated during fine-tuning. A lower rank means fewer parameters, which reduces the VRAM requirements but may also limit the model's ability to adapt. Conversely, a higher rank allows for more flexibility but requires more VRAM.

  2. LoRA Alpha: This parameter controls the contribution of the LoRA adapters to the final model weights. A higher LoRA Alpha value means the LoRA adapters will have a stronger influence, while a lower value means they will have a weaker influence.

  3. LoRA Dropout: This parameter controls the dropout rate applied to the LoRA adapters during training. Increasing the dropout rate can help prevent overfitting, but it may also reduce the model's performance.

By adjusting these parameters, you can find the right balance between VRAM requirements, training speed, and model performance for your specific use case.

Data Preparation: Prompt Template and End-of-Sequence Token

To prepare the data for fine-tuning, we need to set up the prompt template and specify the end-of-sequence token.

The prompt template is crucial, as it defines the format of the input data that the model will be trained on. For the Llama 3.1 family, we'll be using the Alpaca prompt template, which includes an instruction and an input that provides further context. The model is then expected to generate an appropriate response.

# Alpaca prompt template
prompt_template = "Below is an instruction that describes a task paired with an input that provides further context. Respond with a relevant output.\n\nInstruction: {instruction}\nInput: {input}\nOutput:"

Additionally, we need to specify the end-of-sequence token to prevent the model from generating text indefinitely. This is an important step that many people have encountered issues with when using the quantized versions of the models with LamaCPP.

# Set the end-of-sequence token
end_token = "</s>"

By setting up the prompt template and the end-of-sequence token, we ensure that the data is properly formatted for the fine-tuning process, which is a critical step in achieving good results.

Training the Model with SFT Trainer

To train the model, we will be using the SFT (Supervised Fine-Tuning) Trainer from the TRL package, which is created and maintained by Hugging Face.

First, we provide our model, tokenizer, and the dataset to the SFT Trainer. In this case, we are using the text column from the dataset, as we have set up our prompt template to use this field.

We also set the maximum sequence length, which should be based on the examples in your training data. Keep in mind that a higher sequence length will increase the VRAM requirements.

Next, we configure the training arguments, such as the device (in this case, a T4 GPU on Google Colab with around 15GB of VRAM) and the number of training steps.

Finally, we run the trainer, and you can observe the decreasing loss, which is a good indication of the training progress.

The training took about 8 minutes, and the peak reserved memory was around 8GB, which is about 53% of the available VRAM on the T4 GPU. This demonstrates the efficiency of the Unsloth approach, which allows for fine-tuning with a relatively low VRAM requirement.

Inference and Streaming

To do inference, we can use the for_inference class or method on the FastLanguageModel class. We need to provide the trained model and the input prompt in the Alpaca format. We can also set the maximum number of tokens to generate.

# Perform inference
input_prompt = alpaca_prompt(instruction, input)
output = model.for_inference(input_prompt, max_new_tokens=256)
print(output)

To enable streaming, we can create a Streamer object and pass it to the for_inference method. This will display the responses one token at a time.

# Enable streaming
streamer = Streamer()
output = model.for_inference(input_prompt, max_new_tokens=256, streamer=streamer)

With this, you can now perform inference on your fine-tuned model and even enable streaming for a more interactive experience.

Saving, Loading, and Fine-Tuning Models

To save the fine-tuned model and tokenizer, you can use the save_pretrained() function on the model and tokenizer:

model.save_pretrained("path/to/save/model")
tokenizer.save_pretrained("path/to/save/tokenizer")

This will save the model weights and tokenizer to the specified directories as a set of JSON files.

To load the saved model and tokenizer, you can use the same FastLanguageModel class and provide the local directories:

model = FastLanguageModel.from_pretrained("path/to/save/model")
tokenizer = FastLanguageModel.from_pretrained("path/to/save/tokenizer")

This will load the model and tokenizer from the saved files.

Another great feature of Unsloth is the ability to save the models in different formats, such as 16-bit floating-point precision for VLLM or directly in GGUF format for LamaCPP. This allows for easy deployment and integration with various platforms and frameworks.

Unsloth also introduces a new UI based on Gradio, which allows you to run the models trained using Unsloth and engage in interactive chat sessions. You can clone the Unsloth Studio repository and run the provided Colab notebook to experience this feature.

Unsloth's New Chat UI

Unsloth has introduced a new chat UI based on the Radish library, which allows you to easily interact with the language models trained using Unsloth. This chat UI provides a user-friendly interface for chatting with the models and exploring their capabilities.

To use the chat UI, you can clone the Unsloth studio repository and run the provided Google Colab notebook. This will set up the necessary environment and launch the chat UI, where you can start conversing with the language model.

The chat UI supports features like streaming responses, allowing you to see the model's output as it is generated. This can be useful for observing the model's thought process and how it generates responses.

Additionally, Unsloth's chat UI enables you to save and load your fine-tuned models, making it easy to continue working with your customized language models. You can also export your models in various formats, such as 16-bit floating-point precision for VLLM or GGUF format for LamaCPP, providing flexibility in how you use and deploy your models.

Overall, Unsloth's new chat UI is a valuable tool for interacting with and exploring the capabilities of language models trained using the Unsloth framework.

FAQ