Optimal Use of Computational Resources when Using LLM

Prakhar Goel; Sukanya Sahoo

Authors

Prakhar Goel Independent Researcher, Bangalore, 560066, India
Sukanya Sahoo Independent Researcher, Bangalore, 560066, India

Keywords:

Optimization, Quantization, Pruning, ONNX, Knowledge distillation, Approximation techniques

Abstract

LLMs or Large Language Models are the machine learning models that are used to understand and generate human type languages, and also they have proven outstanding performance on a variety of natural language processing tasks, such as sentiment analysis, text generation, text completion, question-answering, language translations, etc. LLMs models are based on neural networks, and it uses a technique of pre-training that is used to learn representations of language, that can be further fine-tuned for specific tasks. Language models like GPT have been incredibly successful at natural language processing tasks but come with high computational demands, making their deployment on resource-constrained devices challenging.

There are many ways through which the computational resources consumed by these models during inference can be optimized. The first approach is the optimization of the model architecture, where the architecture is modified in such a way that will reduce the number of parameters, and all the computations required for the inference. The next technique is called quantization, where the weights of the models are represented with fewer bits to reduce the computation requirements and memory footprints. There is one more way called pruning through which redundant parameters can be removed, which also helps in reducing the computation and the model size. Knowledge Distillation is another way in which it trains a smaller, more compact model to mimic the behavior of the larger model, reducing the computational requirement, and maintaining the accuracy of the model. Approximation Technique, which is also known as tensor decomposition, or low-rank matrix factorization, helps in reducing the computational complexity of the model.

References

Nina Shenker-Tauris, “Do Large Language Models (LLMs) reason?”. Internet: https://www.shaped.ai/blog/do-large-language-models-llms-reason, Feb. 2023.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, et al., “Attention is All You Need”, arXiv arXiv:1706.03762, Jun. 2017.

Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, John Guttag. “What is the state of neural network pruning?.” arXiv preprint arXiv:2003.03033, Mar. 2020.

Han, Song, et al. “Learning both weights and connections for efficient neural networks.” arXiv arXiv:1506.02626, 2015.

Xinhe, Chen Suyue, et al. “Neural Compressor”. Internet: https://github.com/intel/neural-compressor/blob/master/docs/source/smooth_quant.md, Mar. 29, 2023.

Molchanov, Pavlo, et al. “Pruning convolutional neural networks for resource efficient inference.” arXiv preprint arXiv:1611.06440, 2016.

Babaeizadeh, Mohammad, Paris Smaragdis, and Roy H. Campbell. “Noiseout: A simple way to prune neural networks.” arXiv preprint arXiv:1611.06211, 2016.

François Lagunas, Ella Charlaix, “NN pruning”. Internet: https://github.com/huggingface/nn_pruning, Aug. 30, 2021.

Elias Frantar, Dan Alistarh, “SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot” arxiv:2301.00774, 2023

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, “OPTQ: Accurate Quantization for Generative Pre-trained Transformers”, ICLR 2023 poster, 2023

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, et al. “Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes”, arXiv arXiv:2305.02301, May 2023.

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”, arXiv arXiv:2210.17323, Oct. 2022.

Optimal Use of Computational Resources when Using LLM

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Make a Submission

Information

Browse

Current Issue