KylinChen | Blog

Coding on 贡献法

贡献法及题单

[TOC] 贡献法 leetcode 2681 由于元素的顺序不影响答案，先排序。设有 $a, b, c, d, e$ 五个数，顺序从小到大。如果把 $d$ 当成最大值: 如果只选 $d$ 单独一个数, 那么力量为 $d^3$ 。选 $a$ 为最小值，由于中间的 $b$ 和 $c$ 可选可不选，一共有 $2^2$ 种方案，所以力量总和为 $d^2 \cdot a \c...

Posted by Kylin on September 26, 2023

DeepUM(DNN Models on Unified Memory) on Aspolos 23

Tensor Migration and Prefetching in Unified Memory

[TOC] ASPOLOS 23 Ab Proposed DeepUM allows memory oversubscription using a page fault mechanism. DeepUM uses a new correlation prefetching technique to hide the page migration overhead. ou...

Posted by Kylin on September 23, 2023

Speculative Decoding

Fast Inference from Transformers via Speculative Decoding

[TOC] ICML 23 an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. Two insights: 1) hard language-modeling t...

Posted by Kylin on September 14, 2023

PagedAttention the paper of vLLM

An Inference System for 10-100 Billion Parameter Transformer Models

[TOC] Challenge：However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically, When managed inefficiently, this memory...

Posted by Kylin on September 14, 2023

two paper about Balanced Pipeline

Memory-Balanced Pipeline Parallelism for Training Large Language Models

[TOC] ICML 23 一篇类似的：Bapipe：Exploration of Balanced Pipeline Parallelism for DNN Training BPipe 只针对activation transfer：BPIPE employs an activation balancing method to transfer intermedia...

Posted by Kylin on September 14, 2023

Language Modeling

Language Modeling 的两种方式

[TOC] 一个绝好的教程：https://huggingface.co/docs/transformers/tasks/language_modeling Casual Language Modeling Causal language modeling predicts the next token in a sequence of tokens, and the mode...

Posted by Kylin on September 13, 2023

EnergonAI as a prototype of Alpa

An Inference System for 10-100 Billion Parameter Transformer Models

[TOC] Background：Deployment of 10-100 billion parameter models is still uncertain due to the latency, throughput, and memory constraints. Contribution： EnergonAI adopts a hierarchy-controller sy...

Posted by Kylin on September 11, 2023

C4 Quant with ML

Machine Learning Strategies in Quant

[TOC] WQ Pipeline Sharp 不行直接放弃idea 对冲基金turnover很高，国内私募相对低 A股做不了decay 0 Outline 使用范畴：因子端、模型端因子端包括因子挖掘，另类因子分析挖掘模型端包括模型算法的改进、因子合成等。 Quant 的机器学习策略比较注重逻辑，因为大多数都是few-shot问题，样本量少常...

Posted by Kylin on September 9, 2023

vLLM for distributed serving

Easy, Fast, and Cheap LLM Serving with PagedAttention

[TOC] vLLM Document：https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys an...

Posted by Kylin on September 8, 2023

Orca the origin of continous batching

A Distributed Serving System for Transformer-Based Generative Models

[TOC] osdi 22 这是continuous batching的思想来源：实验做了一个latency和throughput的图：

Posted by Kylin on September 8, 2023

Kylin Page

Coding on 贡献法

贡献法及题单

DeepUM(DNN Models on Unified Memory) on Aspolos 23

Tensor Migration and Prefetching in Unified Memory

Speculative Decoding

Fast Inference from Transformers via Speculative Decoding

PagedAttention the paper of vLLM

An Inference System for 10-100 Billion Parameter Transformer Models

two paper about Balanced Pipeline

Memory-Balanced Pipeline Parallelism for Training Large Language Models

Language Modeling

Language Modeling 的两种方式

EnergonAI as a prototype of Alpa

An Inference System for 10-100 Billion Parameter Transformer Models

C4 Quant with ML

Machine Learning Strategies in Quant

vLLM for distributed serving

Easy, Fast, and Cheap LLM Serving with PagedAttention

Orca the origin of continous batching

A Distributed Serving System for Transformer-Based Generative Models

FEATURED TAGS

ABOUT ME

FRIENDS