记忆增强的视频理解 MALLM paper reading

[TOC]

CVPR 2024

Abstract

以前MLLM的问题：can only take in a limited number of frames for short video understanding

motivation：Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank.

意味着是串行的进行memory处理

Intro

Research Timeline

最早期研究直接把每一帧的query embedding插入到prompt中，但是收到最长文本、GPU显存限制（LLaVA 256 tokens，Blip2 32 tokens）
平均池化：Video- ChatGPT，但是缺少temporal modeling。
Video-LLaMA：employs an extra video querying transformer (Q-Former) to obtain video-level representation，但是设计过于复杂，且不能支持online提取。

MALLM Structure

截屏2024-04-16 16.21.42

与以前相似的components：

a visual encoder to extract visual features,

a querying transformer to align the visual and text embedding spaces,

a large language model

创新：一个online、sequential frame memory bank：an online processing approach that takes video frames sequen- tially and stores the video features in the proposed long-term memory bank.

两个优点：
- 规避maxlenth
- 不会爆GPU显存

据说还有memory compression、frame selecting相关的创新

MALLM

截屏2024-04-16 16.39.45

Visual Feature Extraction

对每一frame过 ViT 生成 vision feature序列，之后加上PE (positional embedding) 作为时序特征： $V=\left[v_1, v_2, . ., v_T\right], v_t \in \mathbb{R}^{P \times C}$

\[f_t=v_t+P E(t), f_t \in \mathbb{R}^{P \times C} .\]

Long-term Temporal Modeling

Q-former 有多个block，使用32个learnable tokens（和blip2保持一致），主要包含两个attention：

(1) cross-attention layer, which interacts with the raw visual embedding ex- tracted from the frozen visual encoder
(2) self-attention layer, which models interactions within the input queries

1）Visual Memory Bank

visual features $F_t=$ Concat $\left[f_1, f_2, . ., f_t\right], F_t \in \mathbb{R}^{t P \times C}$ Given the input query $z_t$, the visual memory bank acts as the key and value as: $Q=z_t W_Q, K=F_t W_K, V=F_t W_V .$

Then we apply the cross-attention operation as: $O=\operatorname{Attn}(Q, K, V)=\operatorname{Softmax}\left(\frac{Q K^T}{\sqrt{C}}\right) V$ 注意所有的Q-former blocks共享一个视觉bank

2）Query Memory Bank

storing queries of each timestep to cross attention, represented as $Z_t=\operatorname{Concat}\left[z_1, z_2, . ., z_t\right], Z_t \in$ $\mathbb{R}^{t N \times C}$: $Q=z_t W_Q, K=Z_t W_K, V=Z_t W_V .$ 每个Q-former blocks独享一个query bank

3）Memory Bank Compression

计算相似度，合并相似feature项

for visual memory bank containing a list of $\left[f_1, f_2, . ., f_M\right], f_t \in$ $\mathbb{R}^{P \times C}$, when a new frame feature $f_{M+1}$ comes in, we need to compress the memory bank by reducing the length by 1. At each spatial location $i$, we first calculate the cosine similarity between all the temporally adjacent tokens as $s_t^i=\cos \left(f_t^i, f_{t+1}^i\right), t \in[1, M], i \in[1, P] .$

Then we select the highest similarity across time, which can be interpreted as the most temporally redundant features: $k=\operatorname{argmax}_t\left(s_t^i\right) .$

Next, we simply average the selected token features at all the spatial locations to reduce the memory bank length by 1 : $\hat{f}_k^i=\left(f_k^i+f_{k+1}^i\right) / 2$

Text Decoding

注意MALLM的LLM部分是始终frozen的，training 时候就用文本的cross entropy监督： $\mathcal{L}=-\frac{1}{S} \sum_{i=1}^S \log P\left(w_i \mid w_{<i}, V\right)$ in which V represents the input video, and w_i is the i-th ground-truth text token.

Exp

主要在三个维度上评测：

Long-term Video Understanding

Video Question Answering

Video Captioning

Online Action Prediction

真正提供这些评测的是一系列数据集

Training Detail

不分training阶段，在三个tasks上训练：

Limitation

online的处理方法耗时长，可能的solution：hierarchical method，分段处理，再建模段间关系
替换image-based visual encoder => video or clip-based encoder

Memory Augmented Large Multimodal Model for LongTerm Video Understanding