Running Llama2 on CPU and GPU with OpenVINO

Raymond Lo, PhD
3 min readSep 6, 2023

--

With the new weight compression feature from OpenVINO, now you can run llama2–7b with less than 16GB of RAM on CPUs!

One of the most exciting topics of 2023 in AI should be the emergence of open-source LLMs like Llama2, Red Pajama, and MPT. However, these models do not come cheap! For example, it took over 3.3M hours of GPU computing to train all 7B, 13B, 35B (yes it existed), and 70B models. See https://arxiv.org/abs/2307.09288.

With these significant efforts already put in place in training these models, the next question people often ask is how do we use it and whether would it run on our current hardware like laptops or desktops at all.

Today, we would like to introduce the support of LLMs in the OpenVINO 2023.1 release. As a preview, we have created a sample tutorial notebook that can run, not one, but 3 different LLMs using OpenVINO runtime. Additionally, we provided all the steps you would need to run this chatbot both locally or remotely on your own server! :)

The available options are:

  • red-pajama-3b-chat — A 2.8B parameter pre-trained language model based on GPT-NEOX architecture. It was developed by Together Computer and leaders from the open-source AI community. The model is fine-tuned on OASST1 and Dolly2 datasets to enhance chatting ability. More details about the model can be found in HuggingFace model card.
  • llama-2–7b-chat — LLama 2 is the second generation of LLama models developed by Meta. Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. llama-2–7b-chat is 7 billion parameters version of LLama 2 finetuned and optimized for dialogue use cases. More details about the model can be found in the paper, repository and HuggingFace model card
  • mpt-7b-chat — MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference. These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing positional embeddings with Attention with Linear Biases (ALiBi). Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence. MPT-7B-chat is a chatbot-like model for dialogue generation. It was built by finetuning MPT-7B on the ShareGPT-Vicuna, HC3, Alpaca, HH-RLHF, and Evol-Instruct datasets. More details about model can be found in blog post, repository and HuggingFace model card.

The image below illustrates the provided user instruction and model answer examples.

So, how well can it run? Here I provided a few demos on Raptor Lake — i9–13900K, Xeon Silver 4416+, as well as Intel ARC A770m! With the NNCF optimization, we see really responsive results across these devices! Check them out.

Code:

You can download the code below, and let us know what you think!

Update:

In the next release, we will have INT8 and GPU working together. Here is an example of Llama-7B-INT8 running iGPU of Intel i9–12900H :). This will also run on Intel Core Ultra 7 and 9 as well.

#iamintel

--

--

Raymond Lo, PhD

@Intel - OpenVINO AI Software Evangelist. ex-Google, ex-Samsung, and ex-Meta (Augmented Reality) executive. Ph.D. in Computer Engineer — U of T.