Dogs vs Cats Audio Classification

Using PyTorch Deep Learning Framework and CNN Architecture

6 min readMay 27, 2023

Motivation

Build a proof-of-concept for Audio Classification using a deep-learning neural network with PyTorch framework.

Audio Classification is a machine learning task that involves identifying and tagging audio signals into different classes or categories. The goal of audio classification is to enable machines to automatically recognize and distinguish between different types of audio, such as music, speech, and environmental sounds. — Papers With Code

Scenario

Given a sound clip of a cat or dog, determine if the raw sound event is either from a dog or a cat. Data Source here. This is inherently a supervised learning problem.

After completing the notebook, results yield a 90% validation accuracy, which was quite impressive given the relatively small dataset.

Strategy

After researching various approaches, this research article, ‘Animal Sound Classification Using A Convolutional Neural Network’ intrigued me and I began to study various approaches.

My implementation will:

Convert WAV sound format to Spectrogram image files using librosa library
- The librosa package essentially captures the wave signals and wavelengths. It measures amplitude and decibels. This allows me to capture features of cat and dog sounds for the neural network.
Use a Convolutional Neural Network to classify the Spectrograms to either cat or dog.

Intuition of Convolutional Neural Networks by StatQuest with Josh Starmer

Notebook

Link to source code

Data Prep

def create_spectogram(audio_file_name,source_path,save_path): 
    x, sr = librosa.load(source_path+audio_file_name)
    X = librosa.stft(x)
    Xdb = librosa.amplitude_to_db(abs(X))
    plt.figure(figsize=(14, 5))
    librosa.display.specshow(Xdb, sr=sr, y_axis='hz')
    plt.ylabel('')
    plt.axis('off')
    file_name = audio_file_name.replace('.wav','')
    plt.savefig(save_path+file_name+'.jpg', bbox_inches='tight', pad_inches=0)
    plt.close() # Comment if you want to see the image

original code credit to Alessandro Bombini

The function create_spectogram reads audio file from the source directory and saves the image to the target directory.

Build Dataset and Data loader

Data loaders help modularize our notebook by separating the data preparation step and the model training step.

By using image_location, I am able to store images on disk as opposed to loading all the images in memory. During training, images are streamed into the neural network.

Loading all the images in memory and then feeding them to the neural network would increase the chance of failure during backpropagation.

class CatDogDataset(Dataset):
    """User defined class to build a datset using Pytorch class Dataset."""
    
    def __init__(self, data, transform = None):
        """Method to initilaize variables.""" 
        self.img_labels = data['target']
        self.img_loc = data['image_location']
        
        self.transform = transform
        
  
    def __getitem__(self, idx):
        
        img_path = os.path.join(self.img_loc.iloc[idx])        
        label = self.img_labels.iloc[idx]
        
        if label == 'cat':
            label = 0
        else:
            label = 1
        
        image = Image.open(img_path)
        
        if self.transform is not None:
            image = self.transform(image)

        return image, label

    def __len__(self):
        return len(self.img_labels)

Dataset class contains the image location and label of the cat/dog sound file. When an image is called via function getitem, the object fetches the image via directory, opens the image, runs a transform, and returns image data and label. These two objects will be fed into our neural network.

transform = transforms.Compose([
    transforms.PILToTensor(),
    transforms.ConvertImageDtype(torch.float),
    transforms.Resize(size = (256,256)),
])

Transformations are run when the image is called. These steps convert the image to a Tensor which is the datatype for the neural network. The Tensor representation of the spectrogram images will allow the neural network to be trained on which image describes a cat or dog sound.

train_set = CatDogDataset(train_csv, transform)

batch_size = 15
train_loader = DataLoader(train_set, batch_size=batch_size)

CNN Architecture


class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.net=nn.Sequential(
                      
            # C1
            nn.Conv2d(in_channels = 3, out_channels = 12, kernel_size = 3, stride = 1, padding = 1),            
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2),
            
            # C2
            nn.Conv2d(in_channels = 12, out_channels = 24, kernel_size = 3, stride = 1, padding = 1), 
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2),
            
            #C3
            nn.Conv2d(in_channels = 24, out_channels = 12, kernel_size = 3, stride = 1, padding = 1), 
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = 2),
            
            
            # Dropout
            nn.Dropout(.2),
            
            nn.Flatten(), # 256/2/2/2 = 32
            
            # FC1
            nn.Linear(in_features = 32 * 32 * 12, out_features = 128), 
            nn.ReLU(),
            
            # FC2
            nn.Linear(in_features = 128, out_features = 64),  
            nn.ReLU(),
            
            #output layer
            nn.Linear(in_features = 64, out_features = 1),
            nn.Sigmoid()
                       )

    def forward(self, X):
        
        X = self.net(X)
                
        return X

This CNN architecture was adapted from this example. It includes 3 convolutional layers, a dropout layer for regularization, a flattening layer to feed into the 2 fully connected layers, and an output layer for resulting prediction. Since the scenario only has 2 outputs (cat or dog), I decided to output one neuron and use the sigmoid function to return a probability.

During prediction, any value that is greater than or equal to .5 is predicted to be a dog sound, and anything less is assigned to the cat label.

Train Neural Network

Hyperparameters and optimization:

loss_fn = nn.BCELoss()  # binary cross entropy
optimizer = optim.Adam(model.parameters(), lr=0.0001)
epochs = 50
current_best = 0

Training Script:

start = time.time()

for epoch in range(epochs):
    train_loss = 0
    model.train()
    for x, y in train_loader:            
        
        # Reset the optimizer
        optimizer.zero_grad()

        # Push the data forward through the model layers
        output = model(x)

        # Get the loss
        loss = loss_fn(output, y.reshape(-1,1).to(torch.float32))

        # Keep a running total
        train_loss += loss.item()

        # Backpropagate
        loss.backward()
        optimizer.step()
        
    if epoch % 5 == 0:
        metric, test_loss = eval_model(model,test_loader)
        if metric > current_best:
            best_model = model
            current_best = metric
            print(f'best accuracy so far is {current_best}')
        epoch_nums.append(epoch)
        training_loss.append(train_loss)
        validation_loss.append(test_loss)
        validation_acc.append(metric)

The training script feeds the training spectrogram image data in batches at a time. With each batch, the neural network undergoes feedforward and backpropagation.

Gradients are stored in DCG and update the weights and bias parameters in the neural network during backpropagation. Typically, the larger the neural network, the larger the batch, and the larger the dataset, the longer it takes for training to run.

# function for evaluating a model's performance
def eval_model(model,data_loader):
    model.eval()
    y_true_list=[]
    y_pred_list=[]
    test_loss = 0
    for x,y in data_loader:
        outputs=model(x)
        
        # an output > .5 will round up to 1 which corresponds to dog. 
        y_pred = torch.round(outputs)
        y_pred_list.extend(y_pred.clone().detach().tolist())
        y_true_list.extend(y.clone().detach().tolist())

        # Get the loss
        loss = loss_fn(outputs, y.reshape(-1,1).to(torch.float32))

        # Keep a running total
        test_loss += loss.item()
        
    acc=classification_report(y_true_list, y_pred_list,output_dict=True)['accuracy']
    return acc, test_loss

The evaluation function eval_model gets called after every 5 epochs using the held-out validation dataset. By setting model.eval(), we prevent the current neural network from before backpropagation and gradient descent.

Evaluation

Overall, these are good results where training and validation loss steadily decrease with each epoch and show that the neural network is able to improve. Because training loss is not significantly higher than validation loss, we can conclude that the model is NOT overfitting.

Resource in interpreting loss curves

The Neural Network was able to learn and improve to 90% accuracy by the 50th epoch!

Running Inferences

Sourcing cat and dog sounds from the internet such as soundcamp and wavlist, accuracy is 80%. Albeit the sample is very small, it was a good direction for the generalizability of the neural network.

Looking Ahead:

I am curious what other approaches in solving this use-case such as using an RNN type neural network, normally used for sequential data.
Hyperparameter tuning
- To improve the current model, I would utilize hyperparameter tuning jobs using AWS/Azure, since they offer parallel runs and early stopping functionality.
Load model to the cloud (AWS/Azure)
Rearchitect the CNN using examples from research papers.