OPERA (CVPR 24) 通过过度信任惩罚和回顾分配减轻多模态大语言模型中的幻觉

Alleviating Hallucination in Multi Modal Large Language Models via Over Trust Penalty and Retrospection Allocation

Posted by Kylin on April 20, 2024

[TOC]

Abstract

现有的幻觉解决方法:特殊数据训练、外部知识库矫正(eg. Agent)

Opera 是一种 decoding 方法,只涉及解码几乎就是free lunch

Insight:MLLMs tend to generate new tokens by focusing on a few summary tokens, but not all the previous tokens. Such partial over- trust inclination results in the neglecting of image tokens and describes the image content with hallucination.

(MLLM解码时候注意力矩阵偏向于使用部分tokens进行解码,而对视觉部分的忽视就很可能造成幻觉)

solution(惩罚、回滚、重选):OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue, along with a rollback strategy that retrospects the presence of summary tokens in the previously generated tokens, and re-allocate the token selection if necessary.

Intro

一个发现:聚合模式的token(attention是全局、柱状的)之后容易出现幻觉,并且这类token本身实际意义不足(如标点符号),但是却会对后面的生成产生很大影响。

截屏2024-04-20 20.42.04

(1)Aggregation pattern’ seems to be the nature of LLM:本文假设这样的token是summary token,类比‘anchor token’ 1 observation in the NLP area,用于总结上文,引导之后的生成。

(2)Aggregation pattern’ leads to hallucination of current MLLMs:现有的MLLM的vision tokens大多放在文本前,随着文本长度增长,以及summary token之间的信息传递,视觉信息会迅速衰减(因为一个summary token很难总结所有视觉信息),即:经过的summary token越多,幻觉越强。

截屏2024-04-20 21.06.23

Method:

1)通过对beam search增加惩罚项:The over-trust penalty introduces a weighted score for the candidate selection step in the Beam Search, so that the candidate with an over-trust pattern will have lower priority to be selected.

2)增加解码回滚机制,因为观测到knowledge aggregation pattern存在滞后性。

Relative Work

DoLa 2 decoding is a recently proposed decoding method that aims to mitigate the hallucinations in MLLMs, which contrasts the logits of mature layer and pre-mature layers and rescale the increments as the output.

Method

Over-Trust Logit Penalty

首先有attention feature map: \(\mathbf{W}_{t-1}^k=\left\{\mathbf{w}^i\right\}_{i=t-k}^{t-1}, \quad \text { s.t. } \mathbf{w}^i=\left\{\omega_{i, j}\right\}_{j=t-k}^i\) 之后过滤上三角并且乘上一个放大(scaling up,防止attention过小)的系数: \(\mathbf{W}_{t-1}^k \triangleq\left\{\mathbf{w}^i\right\}_{i=t-k}^{t-1}, \quad \text { s.t. } \mathbf{w}^i=\left\{\sigma \omega_{i, j}\right\}_{j=t-k}^{t-1},\) where $\left{\omega_{i, j}\right}_{j=i+1}^{t-1}$ are zeros and $\sigma$ is a configurable scaling factor.

Penalty 系数的计算: \(\phi\left(\omega_{<t}\right)=\prod_{i=c}^{t-1} \sigma \omega_{i, c}, \quad \text { s.t. } c=\underset{t-k \leq j \leq t-1}{\arg \max } \prod_{i=j}^{t-1} \sigma \omega_{i, j}\) 利用Penalty做next token prediction: \(p\left(x_t \mid x_{<t}\right)=\operatorname{Softmax}\left[\mathcal{H}\left(h_t\right)-\alpha \phi\left(w_{\leq t}\right)\right]_{x_t} \text {, } \quad \text { s.t. } x_t \in \mathcal{Y},\) 其中 $\mathcal{H}$ 是 vocabulary head.

截屏2024-04-20 23.55.27

Retrospection-Allocation Strategy

有一种情况是所有candidate都存在penalty和幻觉:there still exists a few cases that all of the candidates get penal- ized and the hallucination already occurred,因此我们需要回退。

回退的思想就是看连续r个最大pattern位置是不是同一个,超过阈值就会退到这个summary token处,并且在next token prediction中排除原来的分支:

pattern位置候选集: \(\mathcal{C}=\left\{c \mid c=\underset{t-k \leq j \leq z}{\arg \max } \prod_{i=j}^z \sigma \omega_{i, j}, z \in[t-l, t-1]\right\}\) 位置判断函数(overlap): \(N_{\text {overlap }}=\sum_{c \in \mathcal{C}} \mathbb{1}_{c=s}, \quad \text { s.t. } s=\operatorname{Mode}(\mathcal{C}) \text {, }\) 如果$N_{overlap} \ge r$,回退到$s = Mode(C)$,并且在下一个token选择上选择 \(\mathcal{Y} /\left\{x_{s+1}\right\}\).

截屏2024-04-20 23.56.30

EXP

Metrics

CHAIR: 计算幻觉的比例(sentence层面和image层面)

POPE: 随机sampling问题“Is There a <object> in the image?”,用于VQA.

Eliminating Repetition

作者发现repetition会伴随周期性的knowledge aggregation patterns,这样就可以用本文的rollback去解决。

截屏2024-04-20 23.36.24

Limitation

1)在处理幻觉的多样性上存在不足,因为作者认为很多case的失败是在于pretrain时候的问题,以及MLLM视觉上的不足。

2)在处理短序列幻觉不足,因为这个工作的metrics存在滞后性。

Reference

  1. Label words are anchors: An information flow perspective for understanding in- context learning. https://arxiv.org/abs/2305.14160 

  2. Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. Dola: Decoding by con- trasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023. 4, 6