Gemini Technic Report

Gemini技术报告解析

Posted by Kylin on December 8, 2023

[TOC]

Model Architecture

截屏2023-12-08 13.00.31

若干技术细节:

  • decoder-only 结构
  • multi-query attention
  • interleaved multi-modal input data
  • text and image output data
  • visual encoder:Flamingo,CoCa,PaLI(with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens)
  • visual decoder:discrete image tokens 12
  • video understanding:encoding the video as a sequence of frames,video frames or images can be interleaved naturally with text or audio as part of the model input
  • audio understanding:can understand signals from Universal Speech Model (USM)

Reference

  1. Yu, Jiahui, et al. “Scaling autoregressive models for content-rich text-to-image generation.” arXiv preprint arXiv:2206.10789 2.3 (2022): 5. 

  2. Ramesh, Aditya, et al. “Zero-shot text-to-image generation.” International Conference on Machine Learning. PMLR, 2021.