
Fast Large Language Model Serving with a Consumer-grade GPU

January 4, 2024



Objective: PowerInfer is introduced as a high-speed LLM inference engine optimized for personal computers equipped with a single consumer-grade GPU.


minority, hot neurons: consistently activated across inputs.

majority, cold neurons, vary based on specific inputs.

Innovation: The core innovation of PowerInfer is the exploitation of the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This leads to a novel GPU-CPU hybrid inference engine design.

Experiment:OPT-175B on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU.


按层进行model offload[14] 不是一个好方案,因为会造成 locality mismatch between hardware architecture and the characteristics of LLM inference:因为储存是按着data locality设计的层次化结构,但是layer-wise的划分不符合data locality设计。

而LLM本身是具有locality的。For example, in the OPT model, less than 10% of the elements in the activation map are non-zero, and these can be predicted with more than 93% accuracy at runtime 1

基于这个insight,PowerInfer认为我们可以把hot neurons和cold neurons进行预处理分层。但是存在以下三个挑战:

  • 构建的predictor会占用一部分GPU内存=>自适应的predictor(稀疏、高偏度的predictor可以缩小)
  • cuSPARSE 不适用 neuron-aware
  • 初始化neuron placement是困难的,integer linear programming problem


Activation Sparsity

Recent studies have revealed that LLM inference shows a notable sparsity in neuron activa- tion 123

这种sparsity同时存在于 attention 和 MLP 中:

  • attention中,一半的head处于minimal contributions
  • MLP中,主要是ReLU使得negative的值被过滤,这部分稀疏性极大

因此可以提前预测:it is possible to predict neuron activations a few layers in advance within the ongoing model iteration

关键性研究:DejaVu 1, utilizes MLP-based predictors during inference, achieving a remarkable accuracy rate of at least 93% in pre- dicting neuron activation.



截屏2024-01-05 09.31.49

只使用GPU处理单元,需要的数据从CPU传输;典型就是FlexGen4,但是这种方式会牺牲latency换取throughput; DejaVu1利用了稀疏性,但是没有考虑offload

UM技术可以参考一下:Since DejaVu only works for GPU, we modified it by using NVIDIA Unified Memory (UM) 5 to fetch parameters from CPU memory.


截屏2024-01-05 09.37.12

在CPU、GPU之间划分层;The CPU processes its layers first, then sends intermediate results to the GPU for token generation.



Insight1: Power-law Activation

截屏2024-01-05 10.12.45


Insight2: Fast In-CPU Computation

截屏2024-01-05 10.14.49

CPU也可以算的很快,相比传输到GPU计算;especially with the small number of activated neurons and the small batch sizes


截屏2024-01-05 10.26.24

LLM Profiler and Policy Solver (Offline)

step1: 由于对不同的LLM其激活情况是不一样的,所以需要做profile,这一步会从数据集中sample请求

step2: 利用ILP解决cold&hot神经元分配的问题(maximize the GPU’s impact metric for neurons)

Neuron-aware LLM Inference Engine (Online)

step3: preload

step4: 预测激活并且各自执行,只有cpu计算得到的activation会传输。 engine predicts neuron activation and skips non-activated ones. Activated neurons preloaded in GPU memory are processed there, while the CPU calculates and transfers results for its neurons to the GPU for integra- tion.





