Fine-tuning TTS Models
A technical guide to fine-tuning Text-to-Speech (TTS) models with Unsloth.
Fine-tuning TTS Models: A Practical Guide#
This is a straight-to-the-point guide on fine-tuning Text-to-Speech (TTS) models. We’ll be using Unsloth ↗ to speed up the process. The goal is to customize a TTS model for a specific voice or style.
Fine-tuning vs. Zero-shot#
Model Selection#
Smaller models are generally better for TTS due to lower latency. Models under 3B parameters are a good starting point.
CSM-1B is a base model. It needs audio context for each speaker to perform well. Fine-tuning from a base model like this requires more compute, but offers more control.
Orpheus-TTS is a fine-tuned model that excels at generating realistic speech with built-in emotional cues. It’s easier to get good results out of the box with this one.
Unsloth also supports other models like CrisperWhisper, Spark, Llasa-TTS, and Oute-TTS.
Loading the Model#
Here’s how to load a model using Unsloth. We’re using LoRA 16-bit, but you have other options.
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/orpheus-3b-0.1-pretrained",
load_in_4bit=False, # Use False for LoRA 16-bit
# load_in_8bit=True, # Or use 8-bit
# full_finetuning=True # Or full fine-tuning
)pythonDataset Preparation#
A good dataset is critical. You’ll need audio clips and their corresponding transcripts.
-
Choose Your Dataset. You can use a pre-existing one from Hugging Face or create your own. The Elise dataset ↗ is a good example. It has two variants:
MrDragonFox/Elise: Includes emotion tags like<sigh>and<laughs>.Jinsaryko/Elise: A base version without emotion tags.
-
Prepare a Custom Dataset (Optional). If you have your own audio, organize it in a folder and create a CSV or TSV file with
filenameandtextcolumns.
plaintextfilename,text 0001.wav,Hello there! 0002.wav,<sigh> I am very tired. -
Load and Process the Data. Here’s an example using the
datasetslibrary to load and prepare the data.
pythonfrom datasets import load_dataset, Audio # For a Hugging Face dataset dataset = load_dataset("MrDragonFox/Elise", split="train") # For a custom CSV # dataset = load_dataset("csv", data_files="mydata.csv", split="train") # Set the sampling rate (Orpheus expects 24kHz) dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
The Fine-tuning Process#
Now, let’s fine-tune the model.
-
Configure the Trainer.
pythonfrom transformers import TrainingArguments,Trainer,DataCollatorForSeq2Seq from unsloth import is_bfloat16_supported trainer = Trainer( model = model, train_dataset = dataset, args = TrainingArguments( per_device_train_batch_size = 1, gradient_accumulation_steps = 4, warmup_steps = 5, max_steps = 60, # Set to a low number for testing learning_rate = 2e-4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "outputs", report_to = "none", ), ) -
Run the Training.
pythontrainer.train() -
Save the LoRA Adapters.
pythonmodel.save_pretrained("lora_model") tokenizer.save_pretrained("lora_model")