diff --git a/Unsloth.md b/Unsloth.md index 9733927..acc5397 100644 --- a/Unsloth.md +++ b/Unsloth.md @@ -592,6 +592,76 @@ Adjusting the `LoraConfig` parameters allows you to balance model performance an - **Description**: Used for down-projection, typically reducing the dimensionality of the input. - **Impact**: Compresses the input to a lower-dimensional space, useful for reducing computational complexity and controlling the model size. +----- +# 🏆Reward Modelling - DPO & ORPO + +To use DPO or ORPO with Unsloth, follow the steps below: + +DPO (Direct Preference Optimization), ORPO (Odds Ratio Preference Optimization), PPO, Reward Modelling all work with Unsloth. + +We have Google Colab notebooks for reproducing ORPO and DPO Zephyr: + +- [ORPO notebook](https://colab.research.google.com/drive/11t4njE3c4Lxl-07OD8lJSMKkfyJml3Tn?usp=sharing) +- [DPO Zephyr notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) + +We're also in 🤗Hugging Face's official docs! We're on the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and the [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth). + +## DPO Code + +``` +from unsloth import FastLanguageModel, PatchDPOTrainer +from unsloth import is_bfloat16_supported +PatchDPOTrainer() +import torch +from transformers import TrainingArguments +from trl import DPOTrainer + +model, tokenizer = FastLanguageModel.from_pretrained( + model_name = "unsloth/zephyr-sft-bnb-4bit", + max_seq_length = max_seq_length, + dtype = None, + load_in_4bit = True, +) + +# Do model patching and add fast LoRA weights +model = FastLanguageModel.get_peft_model( + model, + r = 64, + target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", + "gate_proj", "up_proj", "down_proj",], + lora_alpha = 64, + lora_dropout = 0, # Supports any, but = 0 is optimized + bias = "none", # Supports any, but = "none" is optimized + # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes! + use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context + random_state = 3407, + max_seq_length = max_seq_length, +) + +dpo_trainer = DPOTrainer( + model = model, + ref_model = None, + args = TrainingArguments( + per_device_train_batch_size = 4, + gradient_accumulation_steps = 8, + warmup_ratio = 0.1, + num_train_epochs = 3, + fp16 = not is_bfloat16_supported(), + bf16 = is_bfloat16_supported(), + logging_steps = 1, + optim = "adamw_8bit", + seed = 42, + output_dir = "outputs", + ), + beta = 0.1, + train_dataset = YOUR_DATASET_HERE, + # eval_dataset = YOUR_DATASET_HERE, + tokenizer = tokenizer, + max_length = 1024, + max_prompt_length = 512, +) +dpo_trainer.train() +``` ---------- # ⚠️Errors