OPERA (CVPR 24) 通过过度信任惩罚和回顾分配减轻多模态大语言模型中的幻觉

[TOC]

Abstract

现有的幻觉解决方法：特殊数据训练、外部知识库矫正（eg. Agent）

Opera 是一种 decoding 方法，只涉及解码几乎就是free lunch

Insight：MLLMs tend to generate new tokens by focusing on a few summary tokens, but not all the previous tokens. Such partial over- trust inclination results in the neglecting of image tokens and describes the image content with hallucination.

（MLLM解码时候注意力矩阵偏向于使用部分tokens进行解码，而对视觉部分的忽视就很可能造成幻觉）

solution（惩罚、回滚、重选）：OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue, along with a rollback strategy that retrospects the presence of summary tokens in the previously generated tokens, and re-allocate the token selection if necessary.

Intro

一个发现：聚合模式的token（attention是全局、柱状的）之后容易出现幻觉，并且这类token本身实际意义不足（如标点符号），但是却会对后面的生成产生很大影响。

截屏2024-04-20 20.42.04

（1）Aggregation pattern’ seems to be the nature of LLM：本文假设这样的token是summary token，类比‘anchor token’ ¹ observation in the NLP area，用于总结上文，引导之后的生成。

（2）Aggregation pattern’ leads to hallucination of current MLLMs：现有的MLLM的vision tokens大多放在文本前，随着文本长度增长，以及summary token之间的信息传递，视觉信息会迅速衰减（因为一个summary token很难总结所有视觉信息），即：经过的summary token越多，幻觉越强。

截屏2024-04-20 21.06.23

Method：

1）通过对beam search增加惩罚项：The over-trust penalty introduces a weighted score for the candidate selection step in the Beam Search, so that the candidate with an over-trust pattern will have lower priority to be selected.

2）增加解码回滚机制，因为观测到knowledge aggregation pattern存在滞后性。

Relative Work

DoLa ² decoding is a recently proposed decoding method that aims to mitigate the hallucinations in MLLMs, which contrasts the logits of mature layer and pre-mature layers and rescale the increments as the output.

Method

Over-Trust Logit Penalty

首先有attention feature map： $\mathbf{W}_{t-1}^k=\left\{\mathbf{w}^i\right\}_{i=t-k}^{t-1}, \quad \text { s.t. } \mathbf{w}^i=\left\{\omega_{i, j}\right\}_{j=t-k}^i$ 之后过滤上三角并且乘上一个放大（scaling up，防止attention过小）的系数： $\mathbf{W}_{t-1}^k \triangleq\left\{\mathbf{w}^i\right\}_{i=t-k}^{t-1}, \quad \text { s.t. } \mathbf{w}^i=\left\{\sigma \omega_{i, j}\right\}_{j=t-k}^{t-1},$ where $\left{\omega_{i, j}\right}_{j=i+1}^{t-1}$ are zeros and $\sigma$ is a configurable scaling factor.

Penalty 系数的计算： $\phi\left(\omega_{<t}\right)=\prod_{i=c}^{t-1} \sigma \omega_{i, c}, \quad \text { s.t. } c=\underset{t-k \leq j \leq t-1}{\arg \max } \prod_{i=j}^{t-1} \sigma \omega_{i, j}$ 利用Penalty做next token prediction： $p\left(x_t \mid x_{<t}\right)=\operatorname{Softmax}\left[\mathcal{H}\left(h_t\right)-\alpha \phi\left(w_{\leq t}\right)\right]_{x_t} \text {, } \quad \text { s.t. } x_t \in \mathcal{Y},$ 其中 $\mathcal{H}$ 是 vocabulary head.

截屏2024-04-20 23.55.27

Retrospection-Allocation Strategy

有一种情况是所有candidate都存在penalty和幻觉：there still exists a few cases that all of the candidates get penal- ized and the hallucination already occurred，因此我们需要回退。

回退的思想就是看连续r个最大pattern位置是不是同一个，超过阈值就会退到这个summary token处，并且在next token prediction中排除原来的分支：

pattern位置候选集： $\mathcal{C}=\left\{c \mid c=\underset{t-k \leq j \leq z}{\arg \max } \prod_{i=j}^z \sigma \omega_{i, j}, z \in[t-l, t-1]\right\}$ 位置判断函数（overlap）： $N_{\text {overlap }}=\sum_{c \in \mathcal{C}} \mathbb{1}_{c=s}, \quad \text { s.t. } s=\operatorname{Mode}(\mathcal{C}) \text {, }$ 如果$N_{overlap} \ge r$，回退到$s = Mode(C)$，并且在下一个token选择上选择 $\mathcal{Y} /\left\{x_{s+1}\right\}$.

截屏2024-04-20 23.56.30

EXP

Metrics

CHAIR: 计算幻觉的比例（sentence层面和image层面）

POPE: 随机sampling问题“Is There a <object> in the image?”，用于VQA.

Eliminating Repetition

作者发现repetition会伴随周期性的knowledge aggregation patterns，这样就可以用本文的rollback去解决。

截屏2024-04-20 23.36.24

Limitation

1）在处理幻觉的多样性上存在不足，因为作者认为很多case的失败是在于pretrain时候的问题，以及MLLM视觉上的不足。

2）在处理短序列幻觉不足，因为这个工作的metrics存在滞后性。

Reference

Label words are anchors: An information flow perspective for understanding in- context learning. https://arxiv.org/abs/2305.14160 ↩
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by con- trasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023. 4, 6 ↩

Alleviating Hallucination in Multi Modal Large Language Models via Over Trust Penalty and Retrospection Allocation