Intel neural-chat-7b Model Achieves Top Ranking on LLM Leaderboard!

Jack_Erickson · ‎11-30-2023

The Intel neural-chat-7b model has achieved the #1 ranking for 7-billion-parameter models on the Hugging Face Open LLM Leaderboard. It first reached the top of the leaderboard with an average score of 59.06 on November 13, 2023, and was still on top as of this post’s publication:

Note that three versions of the model are listed in the leaderboard – one is a half-precision floating-point (float16), one is bfloat16, and the third is a 4-bit integer (INT4). All three are top of the leaderboard for their respective levels of precision. A 32-bit floating point version is also available.

This model is the foundation for the NeuralChat chatbot available within Intel® Extension for Transformers*. This library is built on top of Hugging Face Transformers by leveraging Intel® Neural Compressor to provide transformer-specific model compression and the ability to implement and customize a chatbot with just a few commands.

At 7 billion parameters, neural-chat-7b is at the low end of today’s large language model (LLM) sizes. Yet it achieved comparable accuracy scores to models 2-3x larger (just filter the LLM leaderboard for model sizes of ~13B and ~35B and compare their scores to its 59.06). So even though it was fine-tuned using Intel® Gaudi®2 AI accelerators, its small size means you can deploy it to a wide range of compute platforms.

This model is based on the Mistral-7B-v0.1 transformer model from Mistral AI, which achieved competitive benchmark results as a small LLM with a large context window. The Mistral AI model was also chosen because it is open source and uses the Apache License, Version 2.0, making it suitable for both academic and commercial use. The NeuralChat model uses the same licensing.

The team performed supervised fine-tuning of the Mistral foundation model using a pipeline available in Intel Extension for Transformers. Fine-tuning a foundation model builds on the breadth of language skills of the foundation model while providing timely and specific knowledge updates. Fine-tuning also only requires a fraction of the compute time and effort compared to training a model from scratch.

Attaining such a competitive benchmark score by only fine-tuning an open-source model required a novel approach. Direct preference optimization (DPO) provides human preference feedback similar to reinforcement learning from human feedback (RLHF). The advantage of DPO is that it does not require a reward model and is more computationally efficient, providing training data consisting of prompts and pairs of preferred and non-preferred responses. This blog post explains the supervised fine-tuning process and its use of DPO.

This blog post walks through an example to get started using this model for inference, from installing the proper libraries to loading the model and performing inference. You can try inference on Intel® Developer Cloud, running the same code on Intel Gaudi2 AI accelerators, Intel® Data Center Max Series GPUs, or 4th Generation Intel® Xeon® Scalable processors. The GitHub* repo for Intel Extension for Transformers contains a variety of examples for fine-tuning, customizing, and optimizing the NeuralChat chatbot.

Intel Extension for Transformers and Intel Neural Compressor are both available, along with a full suite of end-to-end AI software from Intel.

Notices & Disclaimers: Performance varies by use, configuration and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.