Data Wrangling with Python

Michael Stephenson
5 min readFeb 21, 2023

Raw data is processed to make it easier to analyze and interpret. Because it can swiftly and effectively handle data structures, carry out calculations, and apply algorithms, Python is the perfect language for handling data. The goal of data cleaning, the data cleaning process, selecting the best programming language and libraries, and the overall methodology and findings will all be covered in this post.

Data wrangling requires that you first clean the data. It entails searching the data for missing values and assigning or imputed values to them. Through this process, the data is made very accurate and prepared for analysis.

Data wrangling prepares raw data for analysis by cleaning, converting, and manipulating it. It might be a time-consuming operation but it is a necessary stage in data analysis. This blog article will look at manipulating data using Python and Jupyter Notebooks.

Getting Started

First, we need to import the necessary libraries. Pandas is a powerful data manipulation library in Python, which we'll be using to load, transform and analyze the data. We'll also use numpy and matplotlib libraries for numerical computations and data visualization.

Python


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Loading Data

The first step in data wrangling is loading the data into a Pandas data frame. There are different ways to load data into a data frame, such as from a CSV file, an Excel file, a SQL database, or a web API. In this example, we'll load a CSV file using the read_csv() method.

data = pd.read_csv('data.csv')

Cleaning Data

Once we have loaded the data, we must clean it by removing any missing or duplicated values. The dropna() method can remove any rows with missing values.

data = data.dropna()

We can also use the drop_duplicates() method to remove duplicated rows.

data = data.drop_duplicates()

Transforming Data

Next, we can transform the data by applying various operations to it. For example, we can convert a column to a different data type or merge multiple data frames into one.

To change the data type of a column, we can use the astype() method.

data['column_name'] = data['column_name'].astype('int')

We can use the merge() method to merge two data frames.

merged_data = pd.merge(data1, data2, on='column_name')

Aggregating Data

We can also aggregate the data by grouping it based on one or more columns and applying a function to each group. For example, we can calculate each group's average column value.

grouped_data = data.groupby('column_name')['column_to_aggregate'].mean()

Visualizing Data

Finally, we can visualize the data using matplotlib library. We can create various types of plots, such as line plots, bar plots, and scatter plots.

plt.plot(data['column_name']) plt.show()

Data wrangling is an essential step in the data analysis process. In this blog post, we have explored some basic data-wrangling techniques using Python and Jupyter Notebooks. Following these techniques, we can clean, transform, and analyze the data to gain insights and make informed decisions.

Python for manipulating data:

Renaming Columns

To rename columns in a Pandas data frame, we can use the rename() method. For example, if we want to rename the "old_column_name" to "new_column_name," we can use the following code:

df = df.rename(columns={'old_column_name': 'new_column_name'})

Replacing Values

To replace values in a Pandas data frame, we can use the replace() method. For example, if we want to replace all occurrences of "old_value" with "new_value" in a column called "column_name," we can use the following code:

df['column_name'] = df['column_name'].replace('old_value', 'new_value')

Filtering Data

To filter data in a Pandas data frame based on certain conditions, we can use the following code:

filtered_data = df[df['column_name'] > 50]

This code will create a new data frame called "filtered_data" that contains only the rows where the value in the "column_name" column is greater than 50.

Aggregating Data

To aggregate data in a Pandas data frame, we can use the groupby() method. For example, if we want to calculate the mean of a column called "column_name" for each value in a column called "group_column", we can use the following code:

grouped_data = df.groupby('group_column')['column_name'].mean()

This code will create a new data frame called "grouped_data" that contains the mean value of the "column_name" column for each value in the "group_column" column.

Pivoting Data

To pivot data in a Pandas data frame, we can use the pivot_table() method. For example, if we have a data frame with columns "column1", "column2", and "value," and we want to create a new data frame where the values in "column1" are the rows, the values in "column2" are the columns, and the values in "value" are the cell values, we can use the following code:

pivoted_data = df.pivot_table(index='column1', columns='column2', values='value')

This code will create a new data frame called "pivoted_data" that has the values in "column1" as the rows, the values in "column2" as the columns, and the values in "value" as the cell values.


import pandas as pd

def clean_csv(csv_file): # Load the CSV file into a Pandas dataframe

df = pd.read_csv(csv_file) # Drop any rows with missing values
df = df.dropna() # Drop any duplicated rows
df = df.drop_duplicates() # Convert any string columns with dates to datetime format
date_columns = ['date_column1', 'date_column2'] df[date_columns] = df[date_columns].apply(pd.to_datetime) # Replace any string values that indicate missing data with NaN
missing_data_values = ['N/A', 'NA', 'n/a', 'na', 'NULL', 'null', '', ' ', '-'] df = df.replace(missing_data_values, np.nan) # Convert any string columns with numerical data to numeric format
numeric_columns = ['numeric_column1', 'numeric_column2'] df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric) # Reset the index of the dataframe
df = df.reset_index(drop=True) # Return the cleaned dataframe

return df

🎁 Let’s make it easy, shall we,

The method above takes a CSV file as input and performs the following cleaning operations:

Loads the CSV file into a Pandas data frame

Drops any rows with missing values

Drops any duplicated rows

Converts any string columns with dates to datetime format

Replaces any string values that indicate missing data with NaN

Converts any string columns with numerical data to numeric format

Resets the index of the data frame

Returns the cleaned data frame

You can call this method by passing the path to the CSV file as an argument:

cleaned_data = clean_csv('path/to/csv_file.csv')

BECOME a WRITER at MLearning.ai

--

--

Michael Stephenson

Applying Computer Vision Technologies to MLOps pipelines is my area of interest. I also have an Academic background in Data Analytics.