Kylin Page

A fool who dreams.

Blip2代码解析

Blip2训练代码详解

[TOC] 看了下网络的代码分析,大多集中在推理端代码分析,而忽略了blip2qformer的训练代码,因此来分析一下: Inference(or stage-2) 网络上分析的代码多是推理端的,blip2在做推理的时候,只有图像一侧的block起作用,相对来说流程上是比较简单的,即过N个Blip2QFormerLayer输出的learnable tokens作为图像特征,过mlp...

多模态发展技术纵览

An Overview of Multimodal Technology Development

[TOC] CLIP:对比学习构建图文桥梁(2021,OpenAI) Contrastive Language Image Pre-train 详细解析1 典型的双塔模型,有两个 encoder,一个对应图片,一个对应文本,图像和文本经过各自的 encoder 后,通过简单的点乘来代表不同模态的交互(相似性)。 训练时,假设一个 batch 有 N 对(图像,文本)...

LLaVA-NeXT 改进推理、OCR 和世界知识

LLaVA NeXT Improved reasoning, OCR, and world knowledge

[TOC] Abs 相比LLaVA-1.5,LLaVA-NeXT最大的更新是在支持分辨率上兼容672x672, 336x1344, 1344x336 三种resolution。 据说34B性能超过Gemini-Pro Method Dynamic High-Resolution 感觉好暴力啊 Data Mixture High-quality User Instru...

基于树状推测解码和验证加速LLM服务

Accelerating Large Language Model Serving with Tree based Speculative Inference and Verification

[TOC] Abs 在推测解码上做了两个事情: 1)把predictions组织成tree-based,every node represent a candidate token sequence 2)tree-based predictions可以并行verification Intro insight:并行猜测、树状组织、并行验证 但是有两个challenge: 1...

Llama 3 蒸馏实践

knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

[TOC] 理论篇 Baby Llama:knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty Abstract 首先在10M-word BabyLM dataset上sft两个teachers:GPT-2 and small...

Coding on 排列&背包问题

Example leetcode 排列&背包问题

[TOC] example. 377. 组合总和 Ⅳ 要我们求一个列表中等于target的取数序列(可重复、排列问题) class Solution: def combinationSum4(self, nums: List[int], target: int) -> int: dp = [0]*(target+1) dp[0] = 1 ...

DeepSpeed从0到1

DeepSpeed Cookbook

[TOC] ### Parameter Server Reference

标签词是锚点?压缩&ICL新思路

An Information Flow Perspective for Understanding In Context Learning

[TOC] EMNLP 23 best paper Abstract In-context learning (ICL) : promising capability of large language models (LLMs) by providing them with demonstration exam- ples to perform diverse tasks,即...

OPERA (CVPR 24) 通过过度信任惩罚和回顾分配减轻多模态大语言模型中的幻觉

Alleviating Hallucination in Multi Modal Large Language Models via Over Trust Penalty and Retrospection Allocation

[TOC] Abstract 现有的幻觉解决方法:特殊数据训练、外部知识库矫正(eg. Agent) Opera 是一种 decoding 方法,只涉及解码几乎就是free lunch Insight:MLLMs tend to generate new tokens by focusing on a few summary tokens, but not all the previ...

优化器Optimizer从0到1

Optimizer Cookbook

[TOC] SGD Naive SGD 梯度下降法中,每一个step需要在全部数据集上计算Loss之后再用反向传播得到所有参数的梯度。这种方式的坏处是如果数据量很大的话,计算量很大,这让训练数据不容易规模化。 随机梯度下降法 SGD(Stochastic Gradient Descent,SGD)每次随机挑一个 mini-batch 数据来优化神经网络的参数。这种方式可以近似地等价于...