vllm源码解析(四)：LLM模型权重加载与kv-cache初始化

七模型初始化

在这里插入图片描述
图来自B站某个视频，发现找不到原视频了！

我们先来看下LLM是怎么结合到vllm中的。

llm = LLM(model=model_path,dtype='half',enable_prefix_caching= False,# dtype='float16'# 把模型层均分到n个gpu上, 而不是运行n个完整模型# tensor_parallel_size=1# gpu利用率最大70%# gpu_memory_utilization=0.7,)
tokenizer = AutoTokenizer.from_pretrained(model_path, )

这是模型的入口，model_path路径指向下载的hugging-face模型文件。

class LLM:...def __init__(self,...) -> None:...# 将外部参数映射为EngineArgs的属性,没做其他修改,便于后续参数的管理engine_args = EngineArgs(...)# 使用配置好的engine参数,初始LLMEngine实例self.llm_engine = LLMEngine.from_engine_args(engine_args, usage_context=UsageContext.LLM_CLASS)# 全局唯一id,1个 prompt(一个batch可能包含多条prompt)的视为1个request,为这个prompt分配一个唯一idself.request_counter = Counter()

可以看到通过from_engine_args来加载，继续往下看

    def from_engine_args(cls,engine_args: EngineArgs,usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,stat_loggers: Optional[Dict[str, StatLoggerBase]] = None,) -> "LLMEngine":"""Creates an LLM engine from the engine arguments."""# Create the engine configs.engine_config = engine_args.create_engine_config()executor_class = cls._get_executor_cls(engine_config)# Create the LLM engine.engine = cls(**engine_config.to_dict(),executor_class=executor_class,log_stats=not engine_args.disable_log_stats,usage_context=usage_context,stat_loggers=stat_loggers,)return engine

from_engine_args输入参数如下：
在这里插入图片描述
cls(…) 指向如下代码：

class LLMEngine:...def __init__(self,...) -> None:...if not self.model_config.skip_tokenizer_init:self.tokenizer = self._init_tokenizer()self.detokenizer = Detokenizer(self.tokenizer)else:self.tokenizer = Noneself.detokenizer = None...self.model_executor = executor_class(model_config=model_config,cache_config=cache_config,parallel_config=parallel_config,scheduler_config=scheduler_config,device_config=device_config,lora_config=lora_config,multimodal_config=multimodal_config,speculative_config=speculative_config,load_config=load_config,prompt_adapter_config=prompt_adapter_config,)if not self.model_config.embedding_mode:self._initialize_kv_caches()...# pipeline_parallel_size:并行的gpu数量, 会把可用的 物理blocks平均分配到并行的gpu上# 同时, 每个gpu都会维护一个调度器scheduler, self.scheduler是包含多个scheduler的listself.scheduler = [Scheduler(scheduler_config, cache_config, lora_config,parallel_config.pipeline_parallel_size) for _ in range(parallel_config.pipeline_parallel_size)]...

可以发现在vllm初始化时，主要初始化4个模块：tokenizer（分词器），model_executor（tf模型转换到vllm模型），self._initialize_kv_caches（kv block初始化），scheduler （调度器）, 这在本章开头的结构图中也能清晰看到。

tokenizer比较简单，这里略过，schedule在第二篇文章中已经讲过。

我们来看下model_executor与_initialize_kv_caches的具体工作，这两部分代码是以后向vllm手动添加新模型（model_executor），优化vllm推理性能（_initialize_kv_caches）的核心代码。

7.1 model_executor

executor_class继承自基类ExecutorBase，有cpu_executor,gpu_executor,tpu_executor…,等各种执行器可选，由当前设备类型，或指定executor来决定使用哪一个。我们以gpu_executor来说明，其他executor也都大同小异。

class GPUExecutor(ExecutorBase):uses_ray: bool = Falsedef _init_executor(self) -> None:"""Initialize the worker and load the model."""assert self.parallel_config.world_size == 1, ("GPUExecutor only supports single GPU.")self.driver_worker = self._create_worker()self.driver_worker.init_device()self.driver_worker.load_model()...def execute_model(self, execute_model_req: ExecuteModelRequest) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]:output = self.driver_worker.execute_model(execute_model_req)return output...

self.driver_worker是work（vllm/worker/worker.py）的一个实例对象（每个gpu上的都维护着自己的Worker实例），负责维护 KV-cache，并在 GPU 上执行模型。在分布式推理的情况下，每个work都会被分配模型的一部分（不同的head并行计算，然后汇总计算结果）。

self.driver_worker.load_model()是加载模型的方法，但经过多层转包后，才能找到真正的初始化模型的代码：

vllm/model_executor/model_loader/loader.py class DefaultModelLoader

    def load_model(self, *, model_config: ModelConfig,device_config: DeviceConfig,lora_config: Optional[LoRAConfig],multimodal_config: Optional[MultiModalConfig],parallel_config: ParallelConfig,scheduler_config: SchedulerConfig,cache_config: CacheConfig) -> nn.Module:target_device = torch.device(device_config.device)with set_default_torch_dtype(model_config.dtype):with target_device:model = _initialize_model(model_config, self.load_config,lora_config, multimodal_config,cache_config, scheduler_config)model.load_weights(# 加载model.safetensors权重文件self._get_weights_iterator(model_config.model,model_config.revision,fall_back_to_pt=getattr(model,"fall_back_to_pt_during_load",True)), )for _, module in model.named_modules():quant_method = getattr(module, "quant_method", None)if quant_method is not None:# When quant methods need to process weights after loading# (for repacking, quantizing, etc), they expect parameters# to be on the global target device. This scope is for the# case where cpu offloading is used, where we will move the# parameters onto device for processing and back off after.with device_loading_context(module, target_device):quant_method.process_weights_after_loading(module)return model.eval()

我们解析下涉及的两个主要函数：

vllm/model_executor/model_loader/loader.py

def _initialize_model(model_config: ModelConfig,load_config: LoadConfig,lora_config: Optional[LoRAConfig],multimodal_config: Optional[MultiModalConfig],cache_config: CacheConfig,scheduler_config: Optional[SchedulerConfig] = None) -> nn.Module:"""Initialize a model with the given configurations."""# 通过下载hf模型时自带的config，根据config['architectures']参数，获得当前模型名称model_class = get_model_architecture(model_config)[0]# 取得量化相关参数，在当前版本中没有启用该参数quant_config = _get_quantization_config(model_config, load_config)# 通过加载vllm/model_executor/models/llama.py，获得模型结构(这是vllm改造后的结构)return model_class(config=model_config.hf_config,# cache_config=cache_config,# quant_config=quant_config,**_get_model_initialization_kwargs(model_class, lora_config, multimodal_config,scheduler_config))

_initialize_model函数的功能为通过hf模型的config参数，获得模型名，
然后根据这个名称去加载vllm改造后的该模型模型结构

我们以llama为例来说明如何加载hf权重：

vllm/model_executor/models/llama.py

    def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):# vllm与hf两种模型实现方式之间的名称映射stacked_params_mapping = [# (param_name, shard_name, shard_id)# vllm, hf,share_id(".qkv_proj", ".q_proj", "q"),(".qkv_proj", ".k_proj", "k"),(".qkv_proj", ".v_proj", "v"),(".gate_up_proj", ".gate_proj", 0),(".gate_up_proj", ".up_proj", 1),]# 获得当前vllm改造后llama模型的参数和对应的权重(此时的权重应是随机生成的)params_dict = dict(self.named_parameters())# 遍历hf模型每层参数的名称和权重for name, loaded_weight in weights:...# vllm, hf,share_idfor (param_name, weight_name, shard_id) in stacked_params_mapping:if weight_name not in name:continue# 将hf模型的层名，替换为vllm中的层名name = name.replace(weight_name, param_name)# Skip loading extra bias for GPTQ models.if name.endswith(".bias") and name not in params_dict:continueif is_pp_missing_parameter(name, self):continue# 获得vllm改造后llama权重参数param = params_dict[name]weight_loader = param.weight_loader# 将hf模型参数更新到对应的vllm模型参数中,完成权重参数的映射工作weight_loader(param, loaded_weight, shard_id)breakelse:...

通过上述vllm中llama的load_weights方法(经过观察，所有decode-only模型的load_weights几乎都一样)，将vllm模型和hf模型不同参数名之间做映射，之后将hf类型的权重赋值给vllm模型中（通过参数名联系），至此，完成模型转换工作。

注：需要知道模型中有不同结构，所有weight_loader（vllm/model_executor/layers/linear.py）也有多个变体（分布在不同类中）。

以对QKV的转换为例说明weight_loader的变换过程（源码比较复杂，这里仅描述下处理逻辑）：
llama3.1的qkv是分开计算的，类似于下面这样

        self.q = nn.Linear(dim, dim_q, bias=False)self.k = nn.Linear(dim, dim_kv, bias=False)self.v = nn.Linear(dim, dim_kv, bias=False)

而vllm中会把他们合并起来，类似于下面这样

self.qkv=nn.Linear(dim, dim_q+2*dim_kv, bias=False)

通过这个模块的解析，我们可以知道，对未支持的新模型也能通过手动修改load_model源码的方式在vllm中使用。

7.2 _initialize_kv_caches

作用是计算当前blocks总量，可用blocks数量。
tranformers中，一个正常的k/v shape为[batch_size, nums_head, len_k, head_dim]（推理阶段，len_k=1）
vllm中kv_cache_shape=[2, num_blocks, block_size, num_kv_heads, head_size]

一个块（block）占用空间的计算公式如下(2表示kv各一个，它们是成对出现的)：2 * block_size * num_head * head_size * num_layers，
即每个 token 对应的 K V 个数为2, 每个块可以存放 block_size 个 token 对应的 K V 值，每个 token 对应的 K V 占用空间为2 * num_head * head_size * num_layers * dtype_size，所以每个块总共要存放block_size * 2 * num_head * head_size * num_layers * dtype_size个值。
num_layers是模型的layers层数，每个token要保存计算过的所有层的kv值，这样才算一个完整的kv-cache。

kv每个值占用的空间为 dtype_size 个字节（如果 tensor 的 dtype 为 float16，则 dtype_size 为 2，dtype 为 float32，则 dtype_size 为 4）。

一个block占用空间的计算代码如下：

vllm/worker/cache_engine.py
vllm/engine/llm_engine.py

    def _initialize_kv_caches(self) -> None:"""Initialize the KV cache in the worker(s).The workers will determine the number of blocks in both the GPU cacheand the swap CPU cache."""num_gpu_blocks, num_cpu_blocks = self.model_executor.determine_num_available_blocks()...self.cache_config.num_gpu_blocks = num_gpu_blocksself.cache_config.num_cpu_blocks = num_cpu_blocksself.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)

_initialize_kv_caches方法的目的是计算出GPU/CPU block数量，然后对这些block进行初始化。

计算block数量的方法为self.model_executor.determine_num_available_blocks()

vllm/worker/worker.py

    def determine_num_available_blocks(self) -> Tuple[int, int]:# Profile the memory usage of the model and get the maximum number of# cache blocks that can be allocated with the remaining free memory.torch.cuda.empty_cache()# Execute a forward pass with dummy inputs to profile the memory usage# of the model.# 构建推理允许的最大seq和tokens 数量组成的推理数据，进行不使用kv-cache的模型推理self.model_runner.profile_run()# Calculate the number of blocks that can be allocated with the# profiled peak memory.torch.cuda.synchronize()# 记录此时可用的GPU和总GPU数量，此时模型运行占用的GPU显存还没释放free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()# peak_memory就是当前模型占用的显存peak_memory = self.init_gpu_memory - free_gpu_memory...# 获得一个block占用的GPU显存cache_block_size = self.get_cache_block_size_bytes()# 计算总的可用GPU block数量num_gpu_blocks = int((total_gpu_memory * self.cache_config.gpu_memory_utilization -peak_memory) // cache_block_size)# 计算CPU数量,对于CPU，不需要额外计算，因为是固定大小的内存。num_cpu_blocks = int(self.cache_config.swap_space_bytes // cache_block_size)num_gpu_blocks = max(num_gpu_blocks, 0)num_cpu_blocks = max(num_cpu_blocks, 0)if self.model_runner.lora_manager:self.model_runner.remove_all_loras()gc.collect()torch.cuda.empty_cache()return num_gpu_blocks, num_cpu_blocks

**self.model_runner.profile_run()**作用是构建假数据，走一遍不使用kv-cache的模型推理，记录此时的GPU占用情况。
profile_run流程如下（代码太多，不在此贴出，代码不难，想进一步了解细节可去看源码）：

构建假数据
初始化LLMEngine引擎时，会提供两个重要参数（这两个参数在当前版本由budget管理）：
max_num_seqs：在1个推理阶段中，可处理的最大seqs数量
max_num_batched_tokens：在1个推理阶段中，可处理的最大tokens数量

这两个参数值由外部指定，若未指定，系统会分配一个。那么如何通过这两个值构建数据呢？
假设在推理过程中，平均一个seq要处理max_num_batched_tokens // max_num_seqs个token，余数部分我们默认放在第一个seq中。
例如，若max_num_batched_tokens=10，max_num_seqs = 3，那么可以构建出3条seq，每个seq的长度分别为4，3，3

使用这些空数据，走一遍推理流程，可以获得模型使用GPU显存的情况。
（free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()）

计算出分配多少的显存给KV cache：

分配给KV cache显存 = gpu总显存 - 不使用KV cache做1次推理时的显存占用（包括模型本身和推理过程中的中间数据）

在上述代码中有详细注释。

分配kv-cache
计算出了可用block数量，接下就能通过initialize_cache初始化vllm推理过程中的kv-cache了。

vllm/worker/worker.py

    def _init_cache_engine(self):assert self.cache_config.num_gpu_blocks is not Noneself.cache_engine = [CacheEngine(self.cache_config, self.model_config,self.parallel_config, self.device_config)for _ in range(self.parallel_config.pipeline_parallel_size)]self.gpu_cache = [self.cache_engine[ve].gpu_cachefor ve in range(self.parallel_config.pipeline_parallel_size)]

初始化kv-cache的工作最终是在CacheEngine的__init__()函数中完成，层层嵌套，vllm架构越来复杂了。

vllm_module/worker/cache_engine.py

    def _allocate_kv_cache(self,num_blocks: int,device: str,) -> List[torch.Tensor]:"""Allocates KV cache on the specified device."""# shape=[num_blocks, block_size，num_kv_heads，head_size]kv_cache_shape = self.attn_backend.get_kv_cache_shape(num_blocks, self.block_size, self.num_kv_heads, self.head_size)pin_memory = is_pin_memory_available() if device == "cpu" else Falsekv_cache: List[torch.Tensor] = []# 遍历每一层，一个token的完整kv-cache包含所有层的子kvfor _ in range(self.num_attention_layers):# null block in CpuGpuBlockAllocator requires at least that# block to be zeroed-out.# We zero-out everything for simplicity.kv_cache.append(torch.zeros(kv_cache_shape,dtype=self.dtype,pin_memory=pin_memory,device=device))return kv_cache

最终kv-cache的样子（显卡：RTX4090，模型：Meta-Llama-3.1-8B-Instruct）如下：
在这里插入图片描述

kv-cache.shape每个维度代表含义如下：

list 28：当前模型有28层，每层都要保持当前层计算的kv

内部元素的shape含义：

2：分别存储k和v的计算结果
2760：当前GPU有2760个block
16：每个block有16个槽位，即可以放16个k或v的值
8：当前模型head数量
128：每个head的head_size

这个kv-cache就是推理过程中用于存储kv值的容器，这里一次性初始好，全部填充为0，所以在实际推理过程中会发现，vllm会直接把显存占用爆涨到一个很大值，就是因为初始化了很多预填充kv-cache的block。

vllm源码解析(四)：LLM模型权重加载与kv-cache初始化

七模型初始化

7.1 model_executor

7.2 _initialize_kv_caches

最新新闻

热搜词

vllm源码解析(四)：LLM模型权重加载与kv-cache初始化

七 模型初始化

7.1 model_executor

7.2 _initialize_kv_caches

最新新闻

热搜词

七模型初始化