Intel researchers propose a new AI approach to deploy LLMs on CPUs more efficiently

Intel researchers propose a new AI approach to deploy LLMs on CPUs more efficiently

Large Language Models (LLMs) have taken the world by storm due to their amazing performance and capabilities across a variety of tasks. They are known for their abilities in text generation, language understanding, text summarization, and much more. The downside to its widespread adoption is the astronomical size of its model parameters, which requires large memory capacity and specialized hardware for inference. As a result, deploying these models has been very difficult.

One way in which the computational power required for inference can be reduced is by using quantization methods, i.e. reducing the precision of the weights and activation functions of an artificial neural network. INT8 and weight-only quantization are two ways in which the cost of inference can be improved. However, these methods are generally optimized for CUDA and may not necessarily work on central processing units (CPUs).

The authors of this paper from Intel proposed an effective method to efficiently deploy LLMs on central processing units (CPUs). Their approach supports automatic weight quantization only for INT-4 (low precision is applied to model weights only while the flow of activation functions is kept high). They also designed a specific LLM runtime that contains highly optimized kernels that speed up the inference process on CPUs.

Quantization flow is developed based on the Intel Neural Compressor and allows tuning of different quantization recipes, details and batch sizes to create an INT4 model that meets the accuracy goal. The model is then passed to the LLM runtime, a specialized environment designed to evaluate quantitative model performance. The runtime is designed to provide efficient inference for LLMs on central processing units (CPUs).

For their experiments, the researchers chose some popular LLMs that had a variety of parameter sizes (from 7B to 20B). They evaluated the performance of the FP32 and INT4 models using open source datasets. They observed that the accuracy of the quantitative model on the selected datasets was almost on par with the accuracy of the FP32 model. Additionally, they did a comparative analysis of the latency of next-generation tokens and found that the LLM runtime outperforms the ggml-based solution by up to 1.6 times.

In conclusion, this paper presents a solution to one of the biggest challenges associated with LLM programs, i.e. inference on central processing units (CPUs). Traditionally, these models require specialized hardware such as GPUs, making them inaccessible to many organizations. This paper presents INT4 model quantization combined with a specialized LLM runtime to provide efficient LLM inference on CPUs. When evaluated on a set of popular LLMs, the method showed an advantage over ggml-based solutions and gave accuracy on par with FP32 models. However, there is room for further improvement, and researchers plan to enable generative AI on computers to meet the increasing demands of AI-generated content.

Check the Paper and GitHub. All credit for this research goes to the researchers in this project. Also don’t forget to join We have 32k+ ML SubReddit, 41k+ Facebook community, Discord channelAnd Email newsletterwhere we share the latest AI research news, cool AI projects, and more.

If you like our work, you’ll love our newsletter.

We are also on cable And WhatsApp.

I am a graduate of Civil Engineering (2022) from Jamia Millia Islamia University, New Delhi, and I have a keen interest in data science, especially neural networks and their applications in various fields.

🔥 Meet Retouch4me: a set of AI-powered plugins for photo retouching

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *