PARALLEL DATA LAB 

PDL Abstract

Practical Offloading for Fine-Tuning LLM on Commodity GPU via Learned Sparse Projectors

39th Annual AAAI Conference on Artificial Intelligence, February 25 – March 4, 2025. Philadelphia, Pennsylvania, USA.

Siyuan Chen, Zhuofeng Wang*, Zelong Guan, Yudong Liu, Phillip B. Gibbons

Carnegie Mellon University
*Peking University

http://www.pdl.cmu.edu/

Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU. A common solution to this memory challenge is offloading compute and data from the GPU to the CPU. However, this approach is hampered by the limited bandwidth of commodity hardware, which constrains communication between the CPU and GPU, and by slower matrix multiplications on the CPU.

In this paper, we present an offloading framework, LSPOffload, that enables near-native speed LLM fine-tuning on commodity hardware through learned sparse projectors. Our data-driven approach involves learning efficient sparse compressors that minimize communication with minimal precision loss. Additionally, we introduce a novel layer-wise communication schedule to maximize parallelism between communication and computation. As a result, our framework can fine-tune a 1.3 billion parameter model on a 4GB laptop GPU and a 6.7 billion parameter model on an NVIDIA RTX 4090 GPU with 24GB memory. Compared to state-of-the-art offloading frameworks, our approach reduces end-to-end fine-tuning time by 33.1%-62.5% when converging to the same accuracy.

FULL TR: pdf