Tags

A fool who dreams.
paper

LLM推理优化Review202408

LLM Infer Paper Review202408


Displaced Patch Pipeline Parallelism

DiT时代的模型推理优化


KnowLA

通过知识适应来增强参数高效的微调


Survey on Graph and RAG

GraphRAG综述


DeepSpeed从0到1

DeepSpeed Cookbook


标签词是锚点?压缩&ICL新思路

An Information Flow Perspective for Understanding In Context Learning


OPERA (CVPR 24) 通过过度信任惩罚和回顾分配减轻多模态大语言模型中的幻觉

Alleviating Hallucination in Multi Modal Large Language Models via Over Trust Penalty and Retrospection Allocation


优化器Optimizer从0到1

Optimizer Cookbook


2024年2月多模态大模型幻觉综述

A Survey on Hallucination in Large Vision Language Models


幻方2025届算法面筋

Interview to DeepSeek


DeepSeekVL Paper Reading

Introduction to DeepSeekVL


DeepSeekLLM Paper Reading

Introduction to DeepSeekLLM


记忆增强的视频理解 MALLM paper reading

Memory Augmented Large Multimodal Model for LongTerm Video Understanding


Claude 3 Technical Report

The Claude 3 Model Family


Infini Attention 详解及数学推导

Efficient Infinite Context Transformers with Infini Attention 详解


Mini Gemini

Mining the Potential of Multimodality Vision Language Models


关于Pretrain和摩托车修理技术

Pretrain and How to Love


Diffusion Model 推理优化研究综述

MLSys for Diffusion Models


InternLM-XComposer2 详解及 Code Review

Mastering Free form TextImage Composition and Comprehension in Vision Language Large Models


LLM Inference Optimization 2403 Review

LLM优化技术进展


SkipDecode

Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference


Survey on Decoding Algorithm

主流Decoding Algorithms优化


Emu2 训练细节

Generative Multimodal Models are In-Context Learners


H2O filtering KV cache

Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models


S3 Scheduling with predictable decoding

Increasing GPU Utilization during Generative Inference for Higher Throughput


Deja Vu

Contextual Sparsity for Efficient LLMs at Inference Time


PowerInfer

Fast Large Language Model Serving with a Consumer-grade GPU


Towards Efficient Generative Large Language Model Serving A Survey from Algorithms to Systems

高效LLM推理算法&系统综述


红黑树最大高度

红黑树最大树高的更准确估计


CoDi Any to Any Generation

CoDi 系列论文 Review


Introduction to the A* Algorithm

A* 算法解析


Real Bottlenck of Transformer

Transformer真正的优化瓶颈在哪里?


Gemini Technic Report

Gemini技术报告解析


FlashLLM

Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity


How texts generates images?

自回归文生图研究脉络考古


A Survey on Unified Multi-Modal Models

统一多模态模型研究调研


UNIFIED LANGUAGE-VISION PRETRAINING

DYNAMIC DISCRETE VISUAL TOKENIZATION


FLAT Attention

An Optimized Dataflow for Mitigating Attention Bottlenecks


EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS

Lost in the middle in LLM serving


FlashAttention

FlashAttention


FastServe - A distributed Serving System

Fast Distributed Inference Serving for Large Language Models


SARATHI Piggybacking Decodes

Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills


LLM Inference Optimization

LLM 推理优化技术综述


DeepUM(DNN Models on Unified Memory) on Aspolos 23

Tensor Migration and Prefetching in Unified Memory


Speculative Decoding

Fast Inference from Transformers via Speculative Decoding


PagedAttention the paper of vLLM

An Inference System for 10-100 Billion Parameter Transformer Models


two paper about Balanced Pipeline

Memory-Balanced Pipeline Parallelism for Training Large Language Models


EnergonAI as a prototype of Alpa

An Inference System for 10-100 Billion Parameter Transformer Models


vLLM for distributed serving

Easy, Fast, and Cheap LLM Serving with PagedAttention


Orca the origin of continous batching

A Distributed Serving System for Transformer-Based Generative Models


Redio Optimization Towards Disk I/Os

Accelerating Disk-Based Graph Processing by Reducing Disk I/Os


Zero Offload

Democratizing Billion-Scale Model Training


DeepSpeed Inference

Enabling Efficient Inference of Transformer Models at Unprecedented Scale


Model Parallel Swapping of Computron

Serving Distributed Deep Learning Models with Model Parallel Swapping


PETALS Collaborative Inference and FT of LLMs

Collaborative Inference and Fine-tuning of Large Models


Introduction to SmartMOE

Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization


Introduction to FlexGen

High-Throughput Generative Inference of Large Language Models with a Single GPU


Introduction to Mobius

Fine Tuning Large-Scale Models on Commodity GPU Servers


GSPMD for ops partition across muti-devices

General and Scalable Parallelization for ML Computation Graphs


AlpaServe Distributed ML Serving

Statistical Multiplexing with Model Parallelism for Deep Learning Serving


Alpa Distributed ML Compiler

Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning


novel ideas for MLLM research

comprehensive survey for MLLM research


MacawLLM

MULTI-MODAL LANGUAGE MODELING WITH IMAGE, AUDIO, VIDEO


Notes for M3IT

A LargeScale Dataset towards MultiModal Instruction Tuning


Notes for BLIP2

VQA


Speedy Transformer Inference

Turbocharge NLP Inference at the Edge via Elastic Pipelining


Early-Exiting Framework with Parallel Decoding

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding


coding

Blip2代码解析

Blip2训练代码详解


多模态发展技术纵览

An Overview of Multimodal Technology Development


LLaVA-NeXT 改进推理、OCR 和世界知识

LLaVA NeXT Improved reasoning, OCR, and world knowledge


基于树状推测解码和验证加速LLM服务

Accelerating Large Language Model Serving with Tree based Speculative Inference and Verification


Llama 3 蒸馏实践

knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty


Coding on 排列&背包问题

Example leetcode 排列&背包问题


Coding on 最短路

D and F 及其题单


Coding on All

Leetcode题单


Coding on 线段树

线段树 及其题单


Coding on 树状数组

树状数组 及其题单


Coding on Dijkstra

Dijkstra 及其题单


Coding on 换根DP

换根DP 及其题单


理解辗转相除法

gcd原理解析


Gosper's Hack

Gosper's Hack原理解析


Coding on 中位数贪心

中位数贪心及题单


Coding on 二维前缀和(差分)

二维前缀和(差分)及题单


Coding on 单调栈

单调栈及题单


Coding on 状态机DP

Example leetcode 买卖股票问题


Coding on 前后缀分解

前后缀分解及题单


Coding on 贡献法

贡献法及题单


Coding on 位运算

位运算及题单


Intro To SSD and Evaluation

SSD相关研究及测评方法


Coding on 滑动窗口

滑动窗口及题单


Coding on 二分查找

二分查找技巧及题单


Coding on 树上倍增

树上倍增 技巧及题单


Coding on 分组循环

分组循环技巧及题单


Python3 Cookbook

fast introduction to python3


C++1X新特性

C++1X新特性