KylinChen | Blog

UNIFIED LANGUAGE-VISION PRETRAINING

DYNAMIC DISCRETE VISUAL TOKENIZATION

[TOC] Abs general image tokenizer Intro MLLM三种方式： adapter architecture、Emu、ours（unified tokenizer） challenges： Adater-style： is centered on predicting textual descriptions de...

Posted by Kylin on December 1, 2023

FLAT Attention

An Optimized Dataflow for Mitigating Attention Bottlenecks

[TOC] Abs Attention的问题：large memory requirements and computational complexity => This limitation is due to inherently limited data reuse opportunities and quadratic growth in memory footprints...

Posted by Kylin on November 9, 2023

Coding on 状态机DP

Example leetcode 买卖股票问题

[TOC] 不限交易次数 eg. Leetcode122 给你一个整数数组 prices ，其中 prices[i] 表示某支股票第 i 天的价格。在每一天，你可以决定是否购买和/或出售股票。你在任何时候最多只能持有一股股票。你也可以先购买，然后在同一天出售。返回你能获得的最大利润。 Solution: 1）先画状态机： 2）给出两个状态的转移方程...

Posted by Kylin on October 31, 2023

EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS

Lost in the middle in LLM serving

[TOC] https://arxiv.org/pdf/2309.17453.pdf Abs 两个gap： during the decoding stage, caching previous tokens’ Key and Value states (KV) consumes extensive memory popular LLMs cannot genera...

Posted by Kylin on October 29, 2023

KV Cache Optimization

KV Cache Reading Sheet

[TOC] Why KV cache? refer to https://kylinchen.cn/2023/08/21/KVCache/ 需要注意的是，KV cache本身也是一种优化策略，用缓存Key Value（空间）避免重复计算（时间），但是现在发现这个KV cache实在太大了，尤其是在Serving System了里面，轻松达到模型参数占用4倍。所以优化一般分为...

Posted by Kylin on October 29, 2023

FlashAttention

[TOC] Abs 目标：reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM Intro 为什么IO-aware在transformer里是重要的？对pytorch实现版的attention进行profiling，发现利用GP...

Posted by Kylin on October 29, 2023

Coding on 前后缀分解

前后缀分解及题单

[TOC] 前后缀分解 leetcode 8026：给你一个下标从 0 开始、大小为 n * m 的二维整数矩阵 grid ，定义一个下标从 0 开始、大小为 n * m 的的二维矩阵 p。如果满足以下条件，则称 p 为 grid 的乘积矩阵：对于每个元素 p[i][j] ，它的值等于除了 grid[i][j] 外所有元素的乘积。乘积对 12345 取余数。返回 ...

Posted by Kylin on October 15, 2023

FastServe - A distributed Serving System

Fast Distributed Inference Serving for Large Language Models

[TOC] Abs 问题：Existing LLM serving systems use run-to-completion（即使是Orca也是FCFS的服务1） processing for inference jobs, which suffers from head-of-line blocking and long JCT. JCT是从提交作业（或任务、请求等）到作业完...

Posted by Kylin on October 13, 2023

SARATHI Piggybacking Decodes

Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

[TOC] Abs challenge: prefill 和 decode 阶段的 compute density 不一样，具体来说，prefill在小batchsize时候就填满，但是decode阶段此时GPU利用率低。 challenge带来的问题：在pipeline parallelism中产生bubble，造成micro-batches不平衡提出了：chunked-pref...

Posted by Kylin on October 8, 2023

LLM Inference Optimization

LLM 推理优化技术综述

[TOC] ref: https://zhuanlan.zhihu.com/p/642412124 Subgraph Fusion 图融合技术即通过将多个 OP（算子）合并成一个 OP（算子），来减少Kernel的调用。因为每一个基本 OP 都会对应一次 GPU kernel 的调用，和多次显存读写，这些都会增加大量额外的开销。 1.1 FasterTransformer b...

Posted by Kylin on October 5, 2023

Kylin Page

UNIFIED LANGUAGE-VISION PRETRAINING

DYNAMIC DISCRETE VISUAL TOKENIZATION

FLAT Attention

An Optimized Dataflow for Mitigating Attention Bottlenecks

Coding on 状态机DP

Example leetcode 买卖股票问题

EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS

Lost in the middle in LLM serving

KV Cache Optimization

KV Cache Reading Sheet

FlashAttention

FlashAttention

Coding on 前后缀分解

前后缀分解及题单

FastServe - A distributed Serving System

Fast Distributed Inference Serving for Large Language Models

SARATHI Piggybacking Decodes

Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

LLM Inference Optimization

LLM 推理优化技术综述

FEATURED TAGS

ABOUT ME

FRIENDS