QLoRA
Quantized Low-Rank Adaptation — a fine-tuning technique that combines 4-bit quantization of the frozen base model with LoRA adapters trained in higher precision, enabling fine-tuning of large models on consumer-grade GPUs.
QLoRA (Dettmers et al., 2023) made fine-tuning 65B+ models practical on a single 48GB GPU—previously requiring multi-GPU clusters. The key innovation is NF4 (NormalFloat4), a 4-bit quantization format optimized for normally distributed weights, combined with double quantization (quantizing the quantization constants themselves) and paged optimizers that use CPU RAM as an overflow buffer for GPU memory spikes.
The quality gap between QLoRA and full fine-tuning is surprisingly small—often within 1–2% on downstream tasks—because LoRA adapters are trained in BFloat16 and compensate for quantization noise. This makes QLoRA the go-to method for researchers and practitioners who want to fine-tune large models without access to A100/H100 clusters.
Practical QLoRA recipes use rank 16–64 adapters, learning rate 2e-4, and the Alpaca or ShareGPT instruction format. Libraries like Axolotl, Unsloth, and LLaMA-Factory wrap QLoRA in user-friendly training interfaces. As of 2026, the open-source fine-tuning ecosystem is almost entirely built on LoRA/QLoRA variants.