MalgudiGPT : How I used AI to document the past for the future.

10 min readJan 26, 2024

In this experiment I am using regional language speech models to help my ageing dad reconnect with his past and preserve it for the future.

Being back home in India for the holidays allowed me to work on this project. This project is an amalgamation of my projects for the months of August and November of the larger project 2023.

My dad is into his senior citizen years and something I have always wanted to do was for a way to preserve his account of his upbringing and stories. Something that I can share with my children in the years ahead. Although my dad can speak good English, getting my kids to listen to his stories was a little difficult (anyone with kids know the attention spans of 3-year-olds). The details were just not there, My dad’s stories in his native tongue, Telugu were more vivid and visceral, and he just couldn’t carry it over into English. Even though I have been trying to keep my kids bilingual, I have not been successful in those efforts, something I want to keep working on. This problem provided me an opportunity to explore native language ML models of India. I used AI and some tech to bring my Dad’s childhood memories to life for my 3 and 5 year olds. They were able to enjoy both visually and in audio format the wonderful stories of my Dad’s childhood. To top it all off I worked on a bound book for my dad with these stories that he will forever cherish. I felt happy bridging an important gap from the past to the future and I used AI to do it.

Here are the results:

The Book:

Documenting the stories from my Dad’s voice notes

The Video: (Chapter 3)

Chapter 3 : The Festive Spirits of Viruvur

Lets dive into the experiment:

This was a bittersweet experiment. This was my final project for 2023 but it was also one that I had the closest to my heart. Over the last year or so I have been pondering ways to capture my very recently contemplative Dad’s thoughts. I was looking into various apps and services. I was looking for apps like Storyworth and Remento. They all provided good platforms, but they just did not cut it for the non-english speaker. Their prompts were also very Americana focused.

So I set about trying to build my own. These were my requirements.

I wanted to build a pipeline of story to visual and auditory elements.
This means I need for the stories to be visually expressive.
I wanted to make the process of recording the stories as easy as possible. Something my dad could do with a click of a button.
I wanted to create some prompts for him to tell stories about that are very region specific. For example, I wanted to focus on what my dad did for Indian festivals like Diwali, Sankranthi etc.,
Once the story was ready, I needed to make it a visual feast for my Kids.
The intention was to translate it into English with the right context.
Build beautiful imagery that complemented the story.
Animate the imagery for a more engaging story.
Bind a book for my dad’s keepsakes.

Trying to get the tech setup right.

The requirements gave me a good starting point in finalizing the tech needed.

For the story capture part. I needed to come up with a solution where I proactively prompt my dad to record the story. I decided the best way to do it would be to build a whatsapp bot (quick sidenote whatsapp is very very popular in non-US countries. It is the primary mode of communication in India, and its multimedia capabilities are great) that would capture the story from him through an audio message. I can also send whatsapp messages reminding him to record for the prompt.

Sample of audio recording.

A sample of the voice notes i got from my dad for the process

2. Now for the translation and retelling of the story in English. I tried various speech to text models like whisper and google translate. I had limited success. I finally landed on some really good Indian Language models that seem to do the trick. You can find them here.

Here were my selection criteria for selecting the models.
— I started with using the whisper models from OpenAI. These can be found here Speech to text — OpenAI API

Measuring the effectiveness of language models:

There are 2 metrics involved when evaluating language or speech models.

WER (Word Error Rate): WER is a metric used to assess speech recognition models by comparing the model’s transcribed text against a reference transcription. It calculates the proportion of errors (insertions, deletions, substitutions) relative to the total number of words in the reference, with a lower WER indicating better model performance.

Google FLEURS: Google FLEURS is a benchmark for evaluating machine learning models on their ability to process and understand spoken language across multiple languages, especially low-resource ones. It focuses on few-shot learning, testing how well models can generalize from limited data across various linguistic contexts.

The problem was that not all languages produce good results. They have the published results over here.openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision (github.com)

Telugu (the language for my experiment) had a WER of 39.3 on the Google FLEURS benchmark. This threshold was too much for me and I decided to look for the some finetuned models. I stumbled upon the work done by the Speech Lab at IIT madras (Speech Lab, IIT Madras — Home)
They do great work in developing models for Indian languages. I took their fine tuned model for telugu that was finetuned whisper model. vasista22/whisper-telugu-medium · Hugging Face This had a relatively low WER rate of 9.470

I compared the results of the non-finetuned whisper model and this finetuned whisper model and the results were really positive towards the finetuned model!

3. Now that the story was converted to English. I can now use popular AI models to do text to speech, text to image and text to video to complete the story output. I decided to use OpenAI’s new TTS and RunwayML for the image and video generation.

The Steps

Lets start with the first step. Building a whatsapp bot. I went about using voiceflow service to do it. They provide a good interface to setup prompts and make communication easier.

Voiceflow is a great chatbot builder and offers whatsapp integration. This is what i used for my workflow. Alternatively you can just have the user also use the voice recorder and send the recording.

For the prompts to record. I created some that are more personal to me. I did take some inspiration from ChatGPT but It had too much of an American flavor.
So here are some excellent prompts that have a good local flavor on topics that I thought are very needed to pass on generationally.

Describe your village/town/city you grew up in. Describe the house and surroundings you grew up in. What was everyday life like. One incident or anecdote about village life. Amenities in the village when growing up. Favorite foods while growing up.
How were school days like. Journey from home to school. Class size, how was the education like, Indoor vs outdoor activities and games. How were exams like. Who were their good friends at school.
Describe festivals when growing up. How and what were the most celebrated festivals. Describe the festival in good detail. Anecdotes and any peculiar rituals.

These below ones were more for older kids and for sons and daughters like me.

Describe your first time you had to leave home to go for studying or work. Describe any interesting romantic trysts. Describe a time when you faced difficulty and how you got around it.
Describe in detail your journey and experience of having your son/daughter. Circumstances of birth. Any cool anecdotes.

Now for step 2 translating this into a cohesive story in English. Using the identified models I was able to daisy chain with OpenAI to complete the story telling narrative. Here is the code snippet that combined the hugging face Telugu finetuned model with OpenAI translations.

import os
import torch
from pydub import AudioSegment
from transformers import pipeline
from openai import OpenAI
openaiapi= os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=openaiapi)

audio_path = "SettingVillage.mp3"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Initialize the ASR pipeline
transcribe = pipeline(task="automatic-speech-recognition", model="vasista22/whisper-telugu-medium", device=device)
transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="te", task="transcribe")

# Load the audio file
audio = AudioSegment.from_mp3(audio_path)



# Define chunk length in milliseconds (30000 ms = 30 seconds)
chunk_length_ms = 30000

# Split the audio file into chunks
chunks = [audio[i:i + chunk_length_ms] for i in range(0, len(audio), chunk_length_ms)]

# Process each chunk and concatenate the results
final_transcript = ""
for i, chunk in enumerate(chunks):
    # Save the chunk as a temporary file
    chunk_name = f"chunk_{i}.wav"
    chunk.export(chunk_name, format="wav")

    # Transcribe the chunk
    result = transcribe(chunk_name)["text"]
    final_transcript += result + " "

    # Delete the temporary chunk file
    os.remove(chunk_name)

# Print the final transcript
print('Transcription: ', final_transcript)

# Path to the audio file to be transcribed

system_prompt = "Take the transcript that is in telugu and translate it into english. Then take the translation and make it into a chapter in the style of malgudi days, the protagonist in this case is xyz and the town is x be very descriptive and visual"

def generate_story_transcript(system_prompt, transcript):
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": transcript
            }
        ]
    )
    return response.choices[0].message.content


final_text = generate_story_transcript(system_prompt, final_transcript)
print (final_text)

Here is a code version if you are using whisper model from OpenAI

from openai import OpenAI
client = OpenAI()

audio_file= open("Schoolrecording.m4a", "rb")
transcript = client.audio.translations.create(
  model="whisper-1", 
  file=audio_file
)

print(transcript)

I wanted to infuse a writing style into the stories and since this is my experiment, I went with the writings that are my go-to comfort reads. I used openai to convert these into writing chapters using the style of Malgudi days ( Malgudi days is a very popular Indian stories written by the mercurial RK Narayan They are the epitome of simple life and nostalgia).

For step 3, this is the one that took the longest. I had to create a valid pipeline that broke the story down into actionable image prompts. Then I had to run them through the Midjourney engine and finally through runway ML to complete the story.

For Midjourney I used the V6 to generate results. There are great guides to using V6 and this one is my favorite.

I started by collecting images for different periods of the story. See example collages of images generated for each chapter.

Using ChatGPT to create visual prompts for the storyline

Chapter 2 had the story of my Dad’s life in School.

Images generated for Chapter 2 using the storyline

Human feedback from the user!

I went through each of the options with my dad and had him pick the image that closely resembled his life. See below the selection process.

Now comes the part about animating these images. Here is a quick rundown of how runway ML works for enhanced animations.

Once you have your starter image, you can then take it and use the magic brush option from runwayML

Lets take an example. I want to animate the lights on this below i

Animating the light sources would make this image pop

Upload this image and select Motion Brush.

Here is an in progress and completed animation.

brushes to highlight only the elements you want to animate

With the above updates, I went ahead and created a working animated story.

For the Text to Speech Narration I used OpenAI TTS models here is the prompt to use them.

from openai import OpenAI
client = OpenAI()

speech_file_path ="speech1.mp3"
response = client.audio.speech.create(
  model="tts-1",
  voice="onyx",
  input= "long storyline goes here")
response.stream_to_file(speech_file_path)

Last thing to do was to proofread the stories and polishing them with any misstated facts. Then put it all together.

What this experiment has allowed me to discover the awesomeness of finetuned models and the brilliance of runway ML. With these tools getting down to completing the task of documenting the past just became easier.

I was finally able to add a worthy keepsake to my Dad’s collection.