Gemini Technic Report

[TOC]

截屏2023-12-08 13.00.31

若干技术细节：

decoder-only 结构
multi-query attention
interleaved multi-modal input data
text and image output data
visual encoder：Flamingo，CoCa，PaLI（with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens）
visual decoder：discrete image tokens ¹²
video understanding：encoding the video as a sequence of frames，video frames or images can be interleaved naturally with text or audio as part of the model input
audio understanding：can understand signals from Universal Speech Model (USM)

Yu, Jiahui, et al. “Scaling autoregressive models for content-rich text-to-image generation.” arXiv preprint arXiv:2206.10789 2.3 (2022): 5. ↩
Ramesh, Aditya, et al. “Zero-shot text-to-image generation.” International Conference on Machine Learning. PMLR, 2021. ↩

Gemini技术报告解析