Supercharging Your Data Science Projects with GitHub Tools

ODSC - Open Data Science
7 min readNov 10, 2023

Technology is advancing at a rapid pace, bringing new innovations that are transforming our workplaces. One role that is being especially disrupted by these advancements is that of the data scientist. Data science is already an exciting field, but new tools are taking it to the next level in terms of productivity and capabilities. With the help of these new technologies, data scientists can work faster and more efficiently than ever before. In this post, we will show you these advancements in action.

A data science project with Python, VS Code and GitHub Tools

Let’s dive deep into some innovative GitHub tools and features that can improve the productivity of your data science workflow. To explore them, let’s imagine we have been asked to create a predictive model to forecast the number of rentals for a bicycle rental business based on seasonality and weather conditions.

To build such a model, starting from an historical rental’s dataset, we are going to perform some data analysis and experiments in a Python Jupyter notebook on VS Code. The secret sauce of the productivity boost to our project is made by two main ingredients:

  • GitHub Copilot, an AI-empowered assistant, embedded into the VS Code interface and offering inline suggestions, slash commands and a chat experience.
  • GitHub Codespaces, a pre-defined development environment hosted in the Cloud.

Creating our workspace

Before we can start writing our first line of Python code, or even start creating a new Jupyter notebook, we need to have the last version of Python installed on our local machine and the Python Extension installed in VS Code. Then, we’ll have to install Python libraries needed to explore, clean, and visualize the data and the ones needed to train and evaluate our machine learning model. This set of pre-requisites may vary from one project to another, and some of them might have conflicts and dependencies, requiring some additional efforts to our workflow. Also, if we collaborate with a team of colleagues on the same project, they should replicate the same installation processes to be able to contribute to our code.

This is the context in which GitHub Codespaces is tremendously helpful, enabling us to create a reproducible and pre-configured workspace for our project, that we can host and share on the Cloud. But how to get started?

Once we have enabled GitHub Copilot Chat in our IDE (VS Code), we can interact with this built-in virtual assistant through a chat interface, by asking questions in natural language or using pre-defined slash commands to define the scope of the answer we expect.

For example, the prompt “/createWorkspace for a Jupyter Python notebook with a GitHub Codespaces configuration installing pPandas, numpy and scikit-learn” will output a suggested directory structure for our project:

  • `.devcontainer/devcontainer.json` — configuration file for the GitHub Codespaces development container, specifying the Docker image to use and the extensions to install in the container.
  • `.devcontainer/requirements.txt` — configuration file listing the Python packages to install in the development container.
  • `data/my_data.csv`- placeholder for the file containing the data to be used in the Jupyter notebook.
  • `notebooks/my_notebook.ipynb`- template Jupyter notebook file, importing pPandas, numpy and scikit-learn.
  • `README.md`- placeholder file containing the documentation for the project.

Also, by clicking on “Create Workspace” at the bottom, the directory structure will be created locally, and the files will be initialized with some basic content that we can customize for your scenario.

Now, to create a GitHub Codespaces starting from here, we should first publish our code on GitHub, through the ‘Source Code’ panel of the sidebar menu in Visual Studio Code.

Then, we can customize the configuration files that will be used to build the container. For example, we can add GitHub Copilot and GitHub Copilot chat extensions in the devcontainer.json file with the following lines of code:

"customizations": {
"vscode": {
"extensions": [
"github.copilot",
"github.copilot-chat"
]
}

Note that the ‘customizations’ field should be at the same level in the json structure as the container ‘name’. If the json file created by GitHub Copilot already has an extensions array, we just have to add the two extensions in the queue.

In this way, we’ll be able to use GitHub Copilot features also in our remote environment.

After that, we can ask again support to GitHub Copilot chat to create a remote GitHub Codespaces on top of our repository, by asking: ‘How can I create now a GitHub Codespaces starting from the devcontainer configuration files I have in this folder structure?’

Following the instructions provided in the reply will enable us to build and open a GitHub Codespaces, configured with the pre-defined requirements.

Writing, debugging, and documenting our Python Code

Once we open our GitHub Codespaces in Visual Studio Code, we can start with our experiments. The first step of our project is to import the data we’ll be using to train our model into a Pandas dataframe. As we write our Python code, we can notice that GitHub Copilot provides us with inline suggestions (the grey line in the screenshot) that we can fully accept, accept only a portion, or ignore.

Also, since Pandas library was listed in the requirements file used to build the GitHub Codespaces, there’s no extra step needed before executing our first code cell.

Now let’s move further and let’s do some data visualization. Let’ imagine we would like to create a histogram with matplotlib representing the bike rentals distribution in our dataset.

In the example above, we forgot to define the axis object and so we are getting a NameError exception. In a case like this, GitHub Copilot can assist us troubleshooting the error. We just need to click on the Fix using Copilot button and we’ll get an analysis of the error along with the suggested code changes to fix it.

After some data exploration and cleaning — which are out of the scope of this article — let’s suppose we are ready to train our regression model. Since this is the core part of our solution, we would like to have some clear documentation to accompany the code. We can accelerate the tedious but essential task of documenting our code with the GitHub copilot /doc command.

By selecting the piece of code of interest and then typing the command in the chat window, we can easily get the desired output, we can use for example as the content of a markdown cell in our notebook.

Summary

In this article we provided some tips and tricks for data scientists to improve their productivity and enhance collaboration by leveraging GitHub tools, Python, and VS Code. We covered how to create a reproducible workspace using GitHub Codespaces and interact with GitHub Copilot through a chat interface to streamline project setup. We also demonstrated how GitHub Copilot provides inline suggestions, assists with debugging, and helps automate documentation tasks, ultimately improving the efficiency and effectiveness of data science projects.

If you are going to try the prompts we used in our demo in your environment, be aware that you might get different results. This is because GitHub Copilot is powered by the OpenAI GPT-4 model, which is non-deterministic as every large language model, meaning that for the same input we can get different outputs.

Interested in learning more about using GitHub and VS Code to improve your productivity? Watch the VS Code Explains series and attend the Supercharging your Data Science projects with GitHub tools webinar.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.