[TOC]
Non-autoregressive decoding
[97, 104, 108] 是最初在机器翻译问题上提出的研究
[271] : non-autoregressive translation 方向的survey
disadvantage:到目前为止,conditional dependence between output tokens 仍然是不可知的,因此解码结果不可靠
Speculative decoding
speculative execution [47] 最早提出于1985年,作为一个普遍的并行计算问题
SpecInfer [177] introducing multiple small draft models coupled with a novel tree-based speculative inference and token verification mechanism (adopted by [48, 118, 168, 185, 229, 236, 274, 310])
parallel decoding(self-checking decoding,高效性由线性代数理论保证):https://arxiv.org/pdf/2305.10427.pdf
gram缓存池:https://lmsys.org/blog/2023-11-21-lookahead-decoding/
advance:without any changes to the outputs,这是由每一次推理都是由较大的LLM验证保证的
从方法带来的效果上,有两类的decode研究:
- 改进解码质量,如greedy,top-p sampling还有contrastive search
- 改进解码速度,如speculative decoding
Early exiting
early exiting 1
[117, 147, 163, 167, 234, 272, 282, 291, 308] 这些2345678910研究探索了不同的终止条件(7\8\9要仔细看看)
adaptive computation 1112 对不同的request进行动态化调整(11要仔细看看,efficient inference相关的)
disadvantage:准确率不高
idea:这个可不可以和speculative decoding结合起来,浅层推理、深层验证
Cascade inference
基于varying complexities of inference requests,把不同难度问题交给不同的LLM,to minimize response time.
CascadeBERT [157] 用bert去进行问题难度分类,之后把不同问题交给不同LLM
A concurrent work [312] jointly optimizes model multiplexing and query caching and also analyzes the optimality of minimizing inference cost.
Mixture-of-thought [288] extends the cascade idea to LLM reasoning tasks for cost-saving, which samples answers from both Chain-of-Thought [258] and Program-of-Thought [57] prompts.
challenge:design an accurate dispatching mechanism to avoid compromising model quality.
Reference
-
Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. 2016. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd international conference on pattern recognition (ICPR). IEEE, 2464–2469. ↩
-
Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, and Trishul Chilimbi. 2021. Magic pyramid: Accelerating inference with early exiting and token pruning. arXiv preprint arXiv:2111.00230 (2021). ↩
-
Jun Kong, Jin Wang, Liang-Chih Yu, and Xuejie Zhang. 2022. Accelerating Inference for Pretrained Language Models by Unified Multi-Perspective Early Exiting. In Proceedings of the 29th International Conference on Computational Linguistics. 4677–4686. ↩
-
Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, and Bin He. 2021. A global past-future early exit method for accelerating inference of pre-trained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013–2023. ↩
-
Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, and Qi Ju. 2020. FastBERT: a Self-distilling BERT with Adaptive Inference Time. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6035–6044. ↩
-
Tianxiang Sun, Xiangyang Liu, Wei Zhu, Zhichao Geng, Lingling Wu, Yilong He, Yuan Ni, Guotong Xie, Xuanjing Huang, and Xipeng Qiu. 2022. A simple hash-based early exiting approach for language understanding and generation. arXiv preprint arXiv:2203.01670 (2022). ↩
-
Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. 2020. DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2246–2251. ↩
-
Deming Ye, Yankai Lin, Yufei Huang, and Maosong Sun. 2021. TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5798–5809. ↩
-
Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, and Claire Cui. 2023. Learning to Skip for Language Modeling. arXiv preprint arXiv:2311.15436 (2023). ↩
-
Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. 2020. Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems 33 (2020), 18330–18341. ↩
-
Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, and Subhabrata Mukherjee. 2023.SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference. arXiv preprint arXiv:2307.02628 (2023). ↩
-
Tal Schuster, Adam Fisch, Tommi Jaakkola, and Regina Barzilay. 2021. Consistent Accelerated Inference via Confident Adaptive Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4962–4979. ↩