# 👋Welcome

New to Unsloth? Start here!

[Unsloth](https://github.com/unslothai/unsloth) makes finetuning large language models like Llama-3, Mistral, Phi-3 and Gemma 2x faster, use 70% less memory, and with no degradation in accuracy! Our docs will help you navigate through training your very own custom model. It covers the essentials of [creating datasets](https://docs.unsloth.ai/tutorials/how-to-finetune-llama-3-and-export-to-ollama#id-6.-alpaca-dataset), running and [deploying](https://docs.unsloth.ai/tutorials/how-to-finetune-llama-3-and-export-to-ollama#id-13.-exporting-to-ollama) your model. You'll also learn how to integrate third-party tools, use tools like [Google Colab](https://docs.unsloth.ai/tutorials/how-to-finetune-llama-3-and-export-to-ollama#id-4.-selecting-a-model-to-finetune) and more!

## What is finetuning and why?

If we want a language model to learn a new skill, a new language, some new programming language, or simply want the language model to learn how to follow and answer instructions like how ChatGPT functions, we do finetuning!

Finetuning is the process of updating the actual "brains" of the language model through some process called back-propagation. But, finetuning can get very slow and very resource intensive.

## How to use Unsloth?

Our open-source version of [Unsloth](https://github.com/unslothai/unsloth) can be installed locally or another GPU service like Google Colab. Most use Unsloth through the interface Google Colab which provides a free GPU to train with. You can access all of our notebooks [here](https://github.com/unslothai/unsloth#-finetune-for-free).

# 📒Unsloth Notebooks

See the list below for all our notebooks:

#### Google Colab

### Main notebooks:

- [Llama 3.1 (8B)](https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing)
- [Mistral NeMo (12B)](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)
- [Gemma 2 (9B)](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
- [_**Inference chat UI**_](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)
- [Phi-3.5 (mini)](https://colab.research.google.com/drive/1lN6hPQveB_mHSnTOYifygFcrO8C1bxq4?usp=sharing)
- [Llama 3 (8B)](https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing)
- [Mistral v0.3 (7B)](https://colab.research.google.com/drive/1_yNCks4BTD5zOnjozppphh5GzMFaMKq_?usp=sharing)
- [Phi-3 (medium)](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
- [Qwen2 (7B)](https://colab.research.google.com/drive/1mvwsIQWDs2EdZxZQF9pRGnnOvE86MVvR?usp=sharing)
- [Gemma (2B)](https://colab.research.google.com/drive/15gGm7x_jTm017_Ic8e317tdIpDG53Mtu?usp=sharing)
- [TinyLlama](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)

###  Other notebooks:

- [ORPO](https://colab.research.google.com/drive/11t4njE3c4Lxl-07OD8lJSMKkfyJml3Tn?usp=sharing)
- [Ollama](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
- [Text Classification](https://github.com/timothelaborie/text_classification_scripts/blob/main/unsloth_classification.ipynb) by Timotheeee
- [Multiple Datasets](https://colab.research.google.com/drive/1njCCbE1YVal9xC83hjdo2hiGItpY_D6t?usp=sharing) by Flail
- [DPO Zephyr](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
- [Conversational](https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing)
- [ChatML](https://colab.research.google.com/drive/15F1xyn8497_dUbxZP4zWmPZ3PJx1Oymv?usp=sharing)
- [Text Completion](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
- [Continued Pretraining](https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing)
- [Mistral v0.3 Instruct (7B)](https://colab.research.google.com/drive/15F1xyn8497_dUbxZP4zWmPZ3PJx1Oymv?usp=sharing)
- [CodeGemma (7B)](https://colab.research.google.com/drive/19lwcRk_ZQ_ZtX-qzFP3qZBBHZNcMD1hh?usp=sharing)
- [Inference only](https://colab.research.google.com/drive/1aqlNQi7MMJbynFDyOQteD2t0yVfjb9Zh?usp=sharing)

# 📚All Our Models

See the list below for all our 4bit bnb uploaded models

You can also view all our uploaded models on [Hugging Face directly](https://huggingface.co/unsloth).

|Model|Base|Instruct|
|---|---|---|
|Llama 3.1|- [8B](https://huggingface.co/unsloth/Meta-Llama-3.1-8B-bnb-4bit)<br>    <br>- [70B](https://huggingface.co/unsloth/Meta-Llama-3.1-70B-bnb-4bit)<br>    <br>- [405B](https://huggingface.co/unsloth/Meta-Llama-3.1-405B-bnb-4bit)|- [8B](https://huggingface.co/unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit)<br>    <br>- [70B](https://huggingface.co/unsloth/Meta-Llama-3.1-70B-Instruct-bnb-4bit)<br>    <br>- [405B](https://huggingface.co/unsloth/Meta-Llama-3.1-405B-Instruct-bnb-4bit/)|
|Phi-3.5||- [mini](https://huggingface.co/unsloth/Phi-3.5-mini-instruct-bnb-4bit)|
|Mistral NeMo|- [12B](https://huggingface.co/unsloth/Mistral-Nemo-Base-2407-bnb-4bit)|- [12B](https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit)|
|Gemma 2|- [2B](https://huggingface.co/unsloth/gemma-2-2b-bnb-4bit)<br>    <br>- [9B](https://huggingface.co/unsloth/gemma-2-9b-bnb-4bit)<br>    <br>- [27B](https://huggingface.co/unsloth/gemma-2-27b-bnb-4bit)|- [2B](https://huggingface.co/unsloth/gemma-2-2b-it-bnb-4bit)<br>    <br>- [9B](https://huggingface.co/unsloth/gemma-2-9b-it-bnb-4bit)<br>    <br>- [27B](https://huggingface.co/unsloth/gemma-2-27b-it-bnb-4bit)|
|Llama 3|- [8B](https://huggingface.co/unsloth/llama-3-8b-bnb-4bit)<br>    <br>- [70B](https://huggingface.co/unsloth/llama-3-70b-bnb-4bit)|- [8B](https://huggingface.co/unsloth/llama-3-8b-Instruct-bnb-4bit)<br>    <br>- [70B](https://huggingface.co/unsloth/llama-3-70b-bnb-4bit)|
|Phi-3||- [mini](https://huggingface.co/unsloth/Phi-3-mini-4k-instruct-bnb-4bit)<br>    <br>- [medium](https://huggingface.co/unsloth/Phi-3-medium-4k-instruct-bnb-4bit)|
|Mistral|- [7B (v0.3)](https://huggingface.co/unsloth/mistral-7b-v0.3-bnb-4bit)<br>    <br>- [7B (v0.2)](https://huggingface.co/unsloth/mistral-7b-v0.2-bnb-4bit)|- [Large](https://huggingface.co/unsloth/Mistral-Large-Instruct-2407-bnb-4bit)<br>    <br>- [7B (v0.3)](https://huggingface.co/unsloth/mistral-7b-instruct-v0.3-bnb-4bit)<br>    <br>- [7B (v0.2)](https://huggingface.co/unsloth/mistral-7b-instruct-v0.2-bnb-4bit)|
|Qwen2|- [1.5B](https://huggingface.co/unsloth/Qwen2-1.5B-bnb-4bit)<br>    <br>- [7B](https://huggingface.co/unsloth/Qwen2-7B-bnb-4bit)<br>    <br>- [72B](https://huggingface.co/unsloth/Qwen2-7B-bnb-4bit)|- [1.5B](https://huggingface.co/unsloth/Qwen2-1.5B-Instruct-bnb-4bit)<br>    <br>- [7B](https://huggingface.co/unsloth/Qwen2-7B-Instruct-bnb-4bit)<br>    <br>- [72B](https://huggingface.co/unsloth/Qwen2-72B-Instruct-bnb-4bit)|
|Llama 2|- [7B](https://huggingface.co/unsloth/llama-2-7b-bnb-4bit)<br>    <br>- [13B](https://huggingface.co/unsloth/llama-2-13b-bnb-4bit)|- [7B](https://huggingface.co/unsloth/llama-2-7b-chat-bnb-4bit)|
|TinyLlama|- [Base](https://huggingface.co/unsloth/tinyllama-bnb-4bit)|- [Instruct](https://huggingface.co/unsloth/tinyllama-chat-bnb-4bit)|
|Zephyr SFT||- [Instruct](https://huggingface.co/unsloth/zephyr-sft-bnb-4bit)|
|CodeLlama|- [7B](https://huggingface.co/unsloth/codellama-7b-bnb-4bit)<br>    <br>- [13B](https://huggingface.co/unsloth/codellama-13b-bnb-4bit)<br>    <br>- [34B](https://huggingface.co/unsloth/codellama-34b-bnb-4bit)||
|Yi|- [6B (v 1.5)](https://huggingface.co/unsloth/Yi-1.5-6B-bnb-4bit)<br>    <br>- [6B](https://huggingface.co/unsloth/yi-6b-bnb-4bit)<br>    <br>- [34B](https://huggingface.co/unsloth/yi-34b-bnb-4bit)|- [34B](https://huggingface.co/unsloth/yi-34b-chat-bnb-4bit)|
# 📥Installation

Learn to install Unsloth locally or on Google Colab.

## Updating

To update Unsloth, follow the steps below:
### Updating without dependency updates

```
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/u
```

## Conda Install

To install Unsloth locally on Conda, follow the steps below:

Select either `pytorch-cuda=11.8` for CUDA 11.8 or `pytorch-cuda=12.1` for CUDA 12.1. If you have `mamba`, use `mamba` instead of `conda` for faster solving. See this [Github issue](https://github.com/unslothai/unsloth/issues/73) for help on debugging Conda installs.

Copy

```
conda create --name unsloth_env \
    python=3.10 \
    pytorch-cuda=<11.8/12.1> \
    pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \
    -y
conda activate unsloth_env

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes
```

## Pip Install

To install Unsloth locally via Pip, follow the steps below:

Do **NOT** use this if you have Anaconda. You must use the Conda install method, or else stuff will BREAK.

1. Find your CUDA version via

```
import torch; torch.version.cuda
```

1. For Pytorch 2.1.0: You can update Pytorch via Pip (interchange `cu121` / `cu118`). Go to https://pytorch.org/ to learn more. Select either `cu118` for CUDA 11.8 or `cu121` for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the `"ampere"` path. For Pytorch 2.1.1: go to step 3. For Pytorch 2.2.0: go to step 4.

```
pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.0 triton \
  --index-url https://download.pytorch.org/whl/cu121
```

```
pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere] @ git+https://github.com/unslothai/unsloth.git"
```

1. For Pytorch 2.1.1: Use the `"ampere"` path for newer RTX 30xx GPUs or higher.

```
pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton \
  --index-url https://download.pytorch.org/whl/cu121
```

Copy

```
pip install "unsloth[cu118-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch211] @ git+https://github.com/unslothai/unsloth.git"
```

1. For Pytorch 2.2.0: Use the `"ampere"` path for newer RTX 30xx GPUs or higher.

```
pip install --upgrade --force-reinstall --no-cache-dir torch==2.2.0 triton \
  --index-url https://download.pytorch.org/whl/cu121
```

```
pip install "unsloth[cu118-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
```

1. If you get errors, try the below first, then go back to step 1:

```
pip install --upgrade pip
```

1. For Pytorch 2.2.1:

```
# RTX 3090, 4090 Ampere GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes

# Pre Ampere RTX 2080, T4, GTX 1080 GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes
```

1. For Pytorch 2.3.0: Use the `"ampere"` path for newer RTX 30xx GPUs or higher.

```
pip install "unsloth[cu118-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
```

1. To troubleshoot installs try the below (all must succeed). Xformers should mostly all be available.

```
nvcc
python -m xformers.info
python -m bitsandbytes
```

# Google Colab

To install and run Unsloth on Google Colab, follow the steps below:

![](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FQzuUQL60uFWHpaAvDPYD%252FColab%2520Options.png%3Falt%3Dmedia%26token%3Dfb808ec5-20c5-4f42-949e-14ed26a44987&width=768&dpr=4&quality=100&sign=f217f37e&sv=1)

If you have never used a Colab notebook, a quick primer on the notebook itself:

1. **Play Button at each "cell".** Click on this to run that cell's code. You must not skip any cells and you must run every cell in chronological order. If you encounter errors, simply rerun the cell you did not run. Another option is to click CTRL + ENTER if you don't want to click the play button.

2. **Runtime Button in the top toolbar.** You can also use this button and hit "Run all" to run the entire notebook in 1 go. This will skip all the customization steps, but is a good first try.

3. **Connect / Reconnect T4 button.** T4 is the free GPU Google is providing. It's quite powerful!

The first installation cell looks like below: Remember to click the PLAY button in the brackets `[ ]`. We grab our open source Github package, and install some other packages.

----
# Basics

# 📂Saving Models

Learn how to save your finetuned model so you can run it in your favorite inference engine.

--------
# Saving to GGUF

Saving models to 16bit for GGUF so you can use it for Ollama, Jan AI, Open WebUI and more!

To save to GGUF, use the below to save locally:

Copy

```
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "f16")
```

For to push to hub:

Copy

```
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q8_0")
```

All supported quantization options for `quantization_method` are listed below:

Copy

```
# https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19
# From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html
ALLOWED_QUANTS = \
{
    "not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
    "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
    "quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
    "f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
    "f16"     : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
    "q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
    "q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
    "q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
    "q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
    "q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
    "q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
    "q3_k_s"  : "Uses Q3_K for all tensors",
    "q4_0"    : "Original quant method, 4-bit.",
    "q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
    "q4_k_s"  : "Uses Q4_K for all tensors",
    "q4_k"    : "alias for q4_k_m",
    "q5_k"    : "alias for q5_k_m",
    "q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
    "q5_1"    : "Even higher accuracy, resource usage and slower inference.",
    "q5_k_s"  : "Uses Q5_K for all tensors",
    "q6_k"    : "Uses Q8_K for all tensors",
    "iq2_xxs" : "2.06 bpw quantization",
    "iq2_xs"  : "2.31 bpw quantization",
    "iq3_xxs" : "3.06 bpw quantization",
    "q3_k_xs" : "3-bit extra small quantization",
}
```

----
# Saving to Ollama


## Saving on Google Colab

You can save the finetuned model as a small 100MB file called a LoRA adapter like below. You can instead push to the Hugging Face hub as well if you want to upload your model! Remember to get a Hugging Face token via: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and add your token!

![](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FBz0YDi6Sc2oEP5QWXgSz%252Fimage.png%3Falt%3Dmedia%26token%3D33d9e4fd-e7dc-4714-92c5-bfa3b00f86c4&width=768&dpr=4&quality=100&sign=9501ce37&sv=1)

After saving the model, we can again use Unsloth to run the model itself! Use `FastLanguageModel` again to call it for inference!

![](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FzymBQrqwt4GUmCIN0Iec%252Fimage.png%3Falt%3Dmedia%26token%3D41a110e4-8263-426f-8fa7-cdc295cc8210&width=768&dpr=4&quality=100&sign=86d9a2d&sv=1)

## Exporting to Ollama

Finally we can export our finetuned model to Ollama itself! First we have to install Ollama in the Colab notebook:

![](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FqNvGTAGwZKXxkMQqzloS%252Fimage.png%3Falt%3Dmedia%26token%3Ddb503499-0c74-4281-b3bf-400fa20c9ce2&width=768&dpr=4&quality=100&sign=96bc10b0&sv=1)

Then we export the finetuned model we have to llama.cpp's GGUF formats like below:

![](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FZduLjedyfUbTmYqF85pa%252Fimage.png%3Falt%3Dmedia%26token%3Df5bac541-b99f-4d9b-82f7-033f8de780f2&width=768&dpr=4&quality=100&sign=31bc57a9&sv=1)

Reminder to convert `False` to `True` for 1 row, and not change every row to `True`, or else you'll be waiting for a very time! We normally suggest the first row getting set to `True`, so we can export the finetuned model quickly to `Q8_0` format (8 bit quantization). We also allow you to export to a whole list of quantization methods as well, with a popular one being `q4_k_m`.

Head over to [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) to learn more about GGUF. We also have some manual instructions of how to export to GGUF if you want here: [https://github.com/unslothai/unsloth/wiki#manually-saving-to-gguf](https://github.com/unslothai/unsloth/wiki#manually-saving-to-gguf)

You will see a long list of text like below - please wait 5 to 10 minutes!!

![](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FcuUAx0RNtrQACvU7uWCL%252Fimage.png%3Falt%3Dmedia%26token%3Ddc67801a-a363-48e2-8572-4c6d0d8d0d93&width=768&dpr=4&quality=100&sign=63f53278&sv=1)

And finally at the very end, it'll look like below:

![](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxRh07PEQjAmmz3s2HJUP%252Fimage.png%3Falt%3Dmedia%26token%3D3552a3c9-4d4f-49ee-a31e-0a64327419f0&width=768&dpr=4&quality=100&sign=b59265b&sv=1)

Then, we have to run Ollama itself in the background. We use `subprocess` because Colab doesn't like asynchronous calls, but normally one just runs `ollama serve` in the terminal / command prompt.

![](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FszDuikrg4HY8lGefwpRQ%252Fimage.png%3Falt%3Dmedia%26token%3Dec1c8762-661d-4b13-ab4f-ed1a7b9fda00&width=768&dpr=4&quality=100&sign=6417dcda&sv=1)

## Automatic `Modelfile` creation

The trick Unsloth provides is we automatically create a `Modelfile` which Ollama requires! This is a just a list of settings and includes the chat template which we used for the finetune process! You can also print the `Modelfile` generated like below:

![](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252Fh6inH6k5ggxUP80Gltgj%252Fimage.png%3Falt%3Dmedia%26token%3D805bafb1-2795-4743-9bd2-323ab4f0881e&width=768&dpr=4&quality=100&sign=ba49821d&sv=1)

We then ask Ollama to create a model which is Ollama compatible, by using the `Modelfile`

![](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252F1123bSSwmjWXliaRUL5U%252Fimage.png%3Falt%3Dmedia%26token%3D2e72f1a0-1ff8-4189-8d9c-d31e39385555&width=768&dpr=4&quality=100&sign=cecd7b1f&sv=1)

## Ollama Inference

And we can now call the model for inference if you want to do call the Ollama server itself which is running on your own local machine / in the free Colab notebook in the background. Remember you can edit the yellow underlined part.

# Troubleshooting

### Saving to `safetensors`, not `bin` format in Colab

We save to `.bin` in Colab so it's like 4x faster, but set `safe_serialization = None` to force saving to `.safetensors`. So `model.save_pretrained(..., safe_serialization = None)` or `model.push_to_hub(..., safe_serialization = None)`

### If saving to GGUF or vLLM 16bit crashes

You can try reducing the maximum GPU usage during saving by changing `maximum_memory_usage`.

The default is `model.save_pretrained(..., maximum_memory_usage = 0.75)`. Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.


-----

# ♻️Continued Pretraining

AKA as Continued Finetuning. Unsloth allows you to continually pretrain so a model can learn a new language.

The [text completion notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) is for continued pretraining/raw text. The [continued pretraining notebook](https://colab.research.google.com/drive/1tEd1FrOXWMnCU9UIvdYhs61tkxdMuKZu?usp=sharing) is for learning another language.

You can read more about continued pretraining and our release in our [blog post](https://unsloth.ai/blog/contpretraining).

## What is Continued Pretraining?

Continued or continual pretraining (CPT) is necessary to “steer” the language model to understand new domains of knowledge, or out of distribution domains. Base models like Llama-3 8b or Mistral 7b are first pretrained on gigantic datasets of trillions of tokens (Llama-3 for e.g. is 15 trillion).

But sometimes these models have not been well trained on other languages, or text specific domains, like law, medicine or other areas. So continued pretraining (CPT) is necessary to make the language model learn new tokens or datasets.

## Advanced Features:

### Loading LoRA adapters for continued finetuning

If you saved a LoRA adapter through Unsloth, you can also continue training using your LoRA weights. The optimizer state will be reset as well. To load even optimizer states to continue finetuning, see the next section.

```
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "LORA_MODEL_NAME",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
trainer = Trainer(...)
trainer.train()
```

### Continued Pretraining & Finetuning the `lm_head` and `embed_tokens` matrices

Add `lm_head` and `embed_tokens`. For Colab, sometimes you will go out of memory for Llama-3 8b. If so, just add `lm_head`.

```
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "lm_head", "embed_tokens",],
    lora_alpha = 16,
)
```

Then use 2 different learning rates - a 2-10x smaller one for the `lm_head` or `embed_tokens` like so:

```
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    ....
    args = UnslothTrainingArguments(
        ....
        learning_rate = 5e-5,
        embedding_learning_rate = 5e-6, # 2-10x smaller than learning_rate
    ),
)
```


-----
# 💬Chat Templates

### List of Colab chat template notebooks:

- [Conversational](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
- [ChatML](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
- [Ollama](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
- [Text Classification](https://github.com/timothelaborie/text_classification_scripts/blob/main/unsloth_classification.ipynb) by Timotheeee
- [Multiple Datasets](https://colab.research.google.com/drive/1njCCbE1YVal9xC83hjdo2hiGItpY_D6t?usp=sharing) by Flail

### More Info

Assuming your dataset is a list of list of dictionaries like the below:

```
[
    [{'from': 'human', 'value': 'Hi there!'},
     {'from': 'gpt', 'value': 'Hi how can I help?'},
     {'from': 'human', 'value': 'What is 2+2?'}],
    [{'from': 'human', 'value': 'What's your name?'},
     {'from': 'gpt', 'value': 'I'm Daniel!'},
     {'from': 'human', 'value': 'Ok! Nice!'},
     {'from': 'gpt', 'value': 'What can I do for you?'},
     {'from': 'human', 'value': 'Oh nothing :)'},],
]
```

You can use our `get_chat_template` to format it. Select `chat_template` to be any of `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth`, and use `mapping` to map the dictionary values `from`, `value` etc. `map_eos_token` allows you to map `<|im_end|>` to EOS without any training.

```
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
```

You can also make your own custom chat templates! For example our internal chat template we use is below. You must pass in a `tuple` of `(custom_template, eos_token)` where the `eos_token` must be used inside the template.

```
unsloth_template = \
    "{{ bos_token }}"\
    "{{ 'You are a helpful assistant to the user\n' }}"\
    "</div>"\
    "<div data-gb-custom-block data-tag="for">"\
        "<div data-gb-custom-block data-tag="if" data-0='role' data-1='role' data-2='] == ' data-3='user'>"\
            "{{ '>>> User: ' + message['content'] + '\n' }}"\
        "<div data-gb-custom-block data-tag="elif" data-0='role' data-1='role' data-2='] == ' data-3='assistant'></div>"\
            "{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\
        "</div>"\
    "</div>"\
    "<div data-gb-custom-block data-tag="if">"\
        "{{ '>>> Assistant: ' }}"\
    "</div>"
unsloth_eos_token = "eos_token"

tokenizer = get_chat_template(
    tokenizer,
    chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)
```

-----
# 💡LoRA Parameters Encyclopedia

Learn how parameters affect the finetuning process.

Written by [Sebastien](https://github.com/sebdg).

## LoraConfig Parameters

Adjusting the `LoraConfig` parameters allows you to balance model performance and computational efficiency in Low-Rank Adaptation (LoRA). Here’s a concise breakdown of key parameters:

### **r**

- **Description**: Rank of the low-rank decomposition for factorizing weight matrices.
- **Impact**:
    - **Higher**: Retains more information, increases computational load.
    - **Lower**: Fewer parameters, more efficient training, potential performance drop if too small.
### **lora_alpha**

- **Description**: Scaling factor for the low-rank matrices' contribution.
- **Impact**:
    - **Higher**: Increases influence, speeds up convergence, risks instability or overfitting.
    - **Lower**: Subtler effect, may require more training steps.

### **lora_dropout**

- **Description**: Probability of zeroing out elements in low-rank matrices for regularization.
- **Impact**:
    - **Higher**: More regularization, prevents overfitting, may slow training and degrade performance.
    - **Lower**: Less regularization, may speed up training, risks overfitting.
### **loftq_config**

- **Description**: Configuration for LoftQ, a quantization method for the backbone weights and initialization of LoRA layers.
- **Impact**:
    - **Not None**: If specified, LoftQ will quantize the backbone weights and initialize the LoRA layers. It requires setting `init_lora_weights='loftq'`.
    - **None**: LoftQ quantization is not applied.
    - **Note**: Do not pass an already quantized model when using LoftQ as LoftQ handles the quantization process itself.

### **use_rslora**

- **Description**: Enables Rank-Stabilized LoRA (RSLora).
- **Impact**:
    - **True**: Uses Rank-Stabilized LoRA, setting the adapter scaling factor to `lora_alpha/math.sqrt(r)`, which has been proven to work better as per the [Rank-Stabilized LoRA paper](https://doi.org/10.48550/arXiv.2312.03732).
    - **False**: Uses the original default scaling factor `lora_alpha/r`.

### **gradient_accumulation_steps**

- **Default**: 1
- **Description**: The number of steps to accumulate gradients before performing a backpropagation update.
- **Impact**:
    - **Higher**: Accumulate gradients over multiple steps, effectively increasing the batch size without requiring additional memory. This can improve training stability and convergence, especially with large models and limited hardware.
    - **Lower**: Faster updates but may require more memory per step and can be less stable.
### **weight_decay**

- **Default**: 0.01
- **Description**: Regularization technique that applies a small penalty to the weights during training.
- **Impact**:
    - **Non-zero Value (e.g., 0.01)**: Adds a penalty proportional to the magnitude of the weights to the loss function, helping to prevent overfitting by discouraging large weights.
    - **Zero**: No weight decay is applied, which can lead to overfitting, especially in large models or with small datasets.

### **learning_rate**

- **Default**: 2e-4
- **Description**: The rate at which the model updates its parameters during training.
- **Impact**:
    - **Higher**: Faster convergence but risks overshooting optimal parameters and causing instability in training.
    - **Lower**: More stable and precise updates but may slow down convergence, requiring more training steps to achieve good performance.

## Target Modules

### **q_proj (query projection)**

- **Description**: Part of the attention mechanism in transformer models, responsible for projecting the input into the query space.
- **Impact**: Transforms the input into query vectors that are used to compute attention scores.

### **k_proj (key projection)**

- **Description**: Projects the input into the key space in the attention mechanism.
- **Impact**: Produces key vectors that are compared with query vectors to determine attention weights.
### **v_proj (value projection)**

- **Description**: Projects the input into the value space in the attention mechanism.
- **Impact**: Produces value vectors that are weighted by the attention scores and combined to form the output.
### **o_proj (output projection)**

- **Description**: Projects the output of the attention mechanism back into the original space.
- **Impact**: Transforms the combined weighted value vectors back to the input dimension, integrating attention results into the model.
### **gate_proj (gate projection)**

- **Description**: Typically used in gated mechanisms within neural networks, such as gating units in gated recurrent units (GRUs) or other gating mechanisms.
- **Impact**: Controls the flow of information through the gate, allowing selective information passage based on learned weights.

### **up_proj (up projection)**

- **Description**: Used for up-projection, typically increasing the dimensionality of the input.
- **Impact**: Expands the input to a higher-dimensional space, often used in feedforward layers or when transitioning between different layers with differing dimensionalities.
### **down_proj (down projection)**

- **Description**: Used for down-projection, typically reducing the dimensionality of the input.
- **Impact**: Compresses the input to a lower-dimensional space, useful for reducing computational complexity and controlling the model size.

----------
# ⚠️Errors

To fix any errors with your setup, see below:

## Saving to GGUF / vLLM 16bit crashes

You can try reducing the maximum GPU usage during saving by changing `maximum_memory_usage`.

The default is `model.save_pretrained(..., maximum_memory_usage = 0.75)`. Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.

## Evaluation Loop - also OOM or crashing.

Set the trainer settings for evaluation to:

```
SFTTrainer(
    args = TrainingArguments(
        fp16_full_eval = True,
        per_device_eval_batch_size = 2,
        eval_accumulation_steps = 4,
        evaluation_strategy = "steps",
        eval_steps = 1,
    ),
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
```

This will cause no OOMs and make it somewhat faster with no upcasting to float32.

## NotImplementedError: A UTF-8 locale is required. Got ANSI

See https://github.com/googlecolab/colabtools/issues/3409

In a new cell, run the below:

```
import locale
locale.getpreferredencoding = lambda: "UTF-8"
```