Kylin Page

A fool who dreams.

Coding on 贡献法

贡献法及题单

[TOC] 贡献法 leetcode 2681 由于元素的顺序不影响答案,先排序。 设有 $a, b, c, d, e$ 五个数,顺序从小到大。 如果把 $d$ 当成最大值: 如果只选 $d$ 单独一个数, 那么力量为 $d^3$ 。 选 $a$ 为最小值,由于中间的 $b$ 和 $c$ 可选可不选,一共有 $2^2$ 种方案, 所以力量总和为 $d^2 \cdot a \c...

DeepUM(DNN Models on Unified Memory) on Aspolos 23

Tensor Migration and Prefetching in Unified Memory

[TOC] ASPOLOS 23 Ab Proposed DeepUM allows memory oversubscription using a page fault mechanism. DeepUM uses a new correlation prefetching technique to hide the page migration overhead. ou...

Speculative Decoding

Fast Inference from Transformers via Speculative Decoding

[TOC] ICML 23 an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. Two insights: 1) hard language-modeling t...

PagedAttention the paper of vLLM

An Inference System for 10-100 Billion Parameter Transformer Models

[TOC] Challenge:However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically, When managed inefficiently, this memory...

two paper about Balanced Pipeline

Memory-Balanced Pipeline Parallelism for Training Large Language Models

[TOC] ICML 23 一篇类似的:Bapipe:Exploration of Balanced Pipeline Parallelism for DNN Training BPipe 只针对activation transfer:BPIPE employs an activation balancing method to transfer intermedia...

Language Modeling

Language Modeling 的两种方式

[TOC] 一个绝好的教程:https://huggingface.co/docs/transformers/tasks/language_modeling Casual Language Modeling Causal language modeling predicts the next token in a sequence of tokens, and the mode...

EnergonAI as a prototype of Alpa

An Inference System for 10-100 Billion Parameter Transformer Models

[TOC] Background:Deployment of 10-100 billion parameter models is still uncertain due to the latency, throughput, and memory constraints. Contribution: EnergonAI adopts a hierarchy-controller sy...

C4 Quant with ML

Machine Learning Strategies in Quant

[TOC] WQ Pipeline Sharp 不行直接放弃idea 对冲基金turnover很高,国内私募相对低 A股做不了decay 0 Outline 使用范畴:因子端、模型端 因子端包括因子挖掘,另类因子分析挖掘 模型端包括模型算法的改进、因子合成等。 Quant 的机器学习策略比较注重逻辑,因为大多数都是few-shot问题,样本量少 常...

vLLM for distributed serving

Easy, Fast, and Cheap LLM Serving with PagedAttention

[TOC] vLLM Document:https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys an...

Orca the origin of continous batching

A Distributed Serving System for Transformer-Based Generative Models

[TOC] osdi 22 这是continuous batching的思想来源: 实验做了一个latency和throughput的图: