Microsoft Introduces New LLM phi-1: Specialized in Python Coding Tasks

ODSC - Open Data Science
2 min readJul 7, 2023

In a new paper, a team from Microsoft introduces phi-1 to the world. A new transformer-based large language model for code. Specialized in Python coding, it has a significantly smaller size compared to competing models. In the study, the team also investigates the impact of high-quality data on enhancing the performance of SOTA LLMS while reducing dataset size and training computation.

According to the team, the model utilizes “textbook quality” data. This includes synthetic generation from GPT-3.5 and web-sourced filtering. This is followed by fine-turning on “textbook-exercise-like” data. Which was done in the 1.3B- parameter model. So despite phi-1’s smaller size, it outperforms its larger competitors and is able to demonstrate the potential of high-quality data in optimizing LLM performance.

The paper also dives into the enhancement of data quality. This was most notable when it came to data cleaning. This is a critical step in generating modern datasets. This, in turn, could lead to a more streamlined series of datasets which provides to ability to iterate data more extensively. In terms of performance, the team attained 50.6% pass@1 accuracy on HumanEval and 55.5% pass@1 accuracy on MBPP (Mostly Basic Python Programs). This is one of the best self-reported numbers using only one LLM generation.

As mentioned above, what makes this significant is in terms of Python coding and the reduction of required computational resources with few datasets. The team at Microsoft showed that phi-1 is able to perform with all of this in mind and still achieve impressive accuracy scores when it comes to code-related tasks while still being orders smaller than competing models.

This could, in theory, help[ lead to more efficient and effective language models in the future, helping to reshape the market of the near future by providing developers and their organizations, with a new tool. Not only does it open up streamlined coding tasks, but can help tech-focused organizations and developers enhance their productivity while reducing environmental costs through resource use.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.