Query an LLM, like SQL, for dataset creation in one prompt query!

5 min readJul 28, 2023

LLMs are prevalent, easy to use, and they give factual data, so why not use LLMs to create datasets for fact checking (statistical and factual purposes of existing phenomenon)?

In this post, I use LangChain to create features for a simple dataset on disasters. LangChain is a python program that allows one to make complex prompts for LLMs, such that a reliable and organized output is returned. LangChain is similar to SQL; LangChain concisely prompts the model as SQL concisely queries the database.

import os
from os import environ
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', 0)

from dotenv import load_dotenv, find_dotenv
dot_env_file_exist = load_dotenv(find_dotenv()) # read local .env file

# Returns true or false if .env exists in current directory
print('dot_env_file_exist: ', dot_env_file_exist)

Insert Credentials: environmental variables

If you have never used bash environmental variables in python, below are instructions using the dotenv PyPI library.

Install dotenv from PyPI: pip install dotenv
The .env file that contains your credentials will not exist in your current directory of your jupyter-notebook or ipython. The .env file contains a simple list of variable names assigned to keys.

Follow methods 0 or 1, to create your .env file in your working directory.

Way 0: setting an environmental variable in bash using the dotenv CLI library

cd /YOUR_CURRENT_WORKING_DIRECTORY
Install python-dotenv[cli] from PyPI: pip install “python-dotenv[cli]” . Installing this in your current working directory will automatically create the .env file.
Run the following commands to add or list contents of the .env file.

dotenv set OPENAI_API_KEY XXX   # Save a key on a line
dotenv list  # view contents of .env
dotenv run -- python foo.py # to include .env credentials while running a .py file

Way 1: setting an environmental variable in bash using the dotenv library

cd /YOUR_CURRENT_WORKING_DIRECTORY
Run basic bash commands to create the .env file and view the contents of the .env file

echo "OPENAI_API_KEY=XXX" >> .env  # Create .env file with the key saved on a line
cat .env    # view contents of .env

Select a Model

After the credentials/keys for the models are inserted into the .env file, select a model.

# Select the LLM OR chat model

temperature = 0

which_model = 'openai_chat'

if which_model == 'openai_llm':
    import openai
    # Get credential from .env file
    openai.api_key = os.getenv("OPENAI_API_KEY")
    from langchain.llms import OpenAI
    # llm = OpenAI()
    # OR
    # llm = OpenAI(openai_api_key=openai.api_key, openai_organization="YOUR_ORGANIZATION_ID")
    # OR
    llm = OpenAI(openai_api_key=openai.api_key, temperature=temperature)
    # Reliable output

elif which_model == 'openai_chat':
    import openai
    # Get credential from .env file 
    openai.api_key = os.getenv("OPENAI_API_KEY")
    from langchain.chat_models import ChatOpenAI
    llm_chat = ChatOpenAI(temperature=temperature)

elif which_model == 'databricks_llm':
    # sudo python3.11 -m pip install openllm
    from langchain.llms import OpenLLM
    llm = OpenLLM(model_name="dolly-v2", model_id="databricks/dolly-v2-3b", temperature=temperature, repetition_penalty=1.2,)
    
elif which_model == 'vertexai_llm':
    PROJECT_ID = os.getenv("PROJECT_ID")
    LOCATION = os.getenv("LOCATION")
    # Google Credentials
    # https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev
    # In the command line : gcloud auth application-default login
    # It creates the application_default_credentials.json file
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/YOUR_HOME_PATH/.config/gcloud/application_default_credentials.json"
    llm = VertexAI(location=LOCATION, max_output_tokens=128, model_name='text-bison', project=PROJECT_ID)

elif which_model == 'huggingface_llm':
    # Get credential from .env file 
    huggingfacehub_api_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")
    print('huggingfacehub_api_token: ', huggingfacehub_api_token)
    huggingfacehub_api_token = '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'
    print('huggingfacehub_api_token: ', huggingfacehub_api_token)
    
    # Assign the retreived credential from the .env file to the Python environmental variables
    os.environ["HUGGINGFACEHUB_API_TOKEN"] = huggingfacehub_api_token

    from langchain import HuggingFaceHub
    repo_id = "google/flan-t5-xxl"  # See https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads for some other options
    
    # Temperature can not be zero
    if temperature == 0:
        temperature = 0.1
    llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature": temperature, "max_length": 500})

Prompt/Query the LLM for a factual dataset

In this example, I was interested in information related to disasters for a feature engineering and ML/DL analysis.

# Feature engineering : dataset creation

### Example: More complex Q&A - multi-prompt requesting multiple outputs per prompt
to_year = np.arange(2013,2023)
from_year = np.arange(2014,2024)

from langchain.prompts import ChatPromptTemplate

df_tot = pd.DataFrame()

for ind, ty in enumerate(to_year):
        
    # [0] Insert information into text_template
    # Way 0 : could webscrape a news site for factual news
    # input_text = """List 100 disasters that occured in the last 10 years"""
    input_text = f"""List 100 disasters that occured from January 1, {ty} to January 1, {from_year[ind]}"""
    
    # OR
    
    # Way 1 : ask Chatgpt to use its own knowledge base
    
    # Read this from a pandas table
    feature0 = "date_of_disaster"
    description0 = f"""Date that disaster occured"""
    
    feature1 = "name_of_disaster"
    description1 = f"""Name of disaster that occured"""
    
    feature2 = "type_of_disaster"
    description2 = f"""Type of disaster that occured"""
    
    feature3 = "duration_of_disaster"
    description3 = f"""Duration of disaster"""
    
    feature4 = "cost_of_disaster"
    description4 = f"""Cost to repair environment after disaster"""
    
    feature5 = "physical_health"
    description5 = f"""Prevalent physical health condition caused by disaster"""
    
    feature6 = "mental_health"
    description6 = f"""Prevalent mental health condition caused by disaster"""

    feature7 = "location_of_disaster"
    description7 = f"""Location of the disaster"""

    feature8 = "area_of_disaster"
    description8 = f"""Area of physical damage in meters squared caused by the disaster"""
    
    text_template = f"""\
    For the following topic in text {input_text}, extract the following information:
    {feature0}: {description0}
    {feature1}: {description1}
    {feature2}: {description2}
    {feature3}: {description3}
    {feature4}: {description4}
    {feature5}: {description5}
    {feature6}: {description6}
    {feature7}: {description7}
    {feature8}: {description8}
    Format the output as JSON with the following keys:
    {feature0}
    {feature1}
    {feature2}
    {feature3}
    {feature4}
    {feature5}
    {feature6}
    {feature7}
    {feature8}
    """
    
    # [1] Declare prompt template
    prompt_template = ChatPromptTemplate.from_template(text_template)
    
    # [2] Declare the prompt
    prompt = prompt_template.format_messages(text=input_text)
    
    # [3] Declare langchain object (one time only)
    temperature = 0
    chat = ChatOpenAI(temperature=temperature)
    
    # [4] Call OpenAI to process the complex langchain prompt
    response = chat(prompt)
    out = response.content  # string output
    # print('out: ', out)
    
    # [5] Table output
    resp = out.split("\n")[1:-1]
    ques_statements = [description0, description1, description2, description3, description4, 
                       description5, description6, description7, description8]
    key = [feature0, feature1, feature2, feature3, feature4, feature5, feature6, feature7, feature8]
    
    # [6] Create a Dataframe
    # Multi-row response per key
    from collections import defaultdict
    d = defaultdict(list)
    for txtline in resp:
        # print('txtline:', txtline) 
        for k in key:
            # print('k:', k) 
            if txtline.find(k) != -1:
                # Clean text: remove single and double quotes
                replace0 = ['"', "'", f"{k}: ", ',']
                replace1 = ['', '', '', '']
                for ind, i in enumerate(replace0):
                    txtline = txtline.replace(i, replace1[ind])
                # print('txtline: ', txtline)
                d[k].append(txtline)
                
    df = pd.DataFrame.from_dict(d)
    df_tot = pd.concat([df_tot, df], axis=0)


    # Save data to .csv file
    df_tot.to_csv('observation0_disasterdata.csv')

df_tot.reset_index(drop=True, inplace=True)
df_tot.to_csv('../data/observation0_disasterdata.csv', index=False)

After a few minutes we have a factual pandas dataset.

Upload the Dataset to GCP for SQL analysis

bq load \
            --location=$location \
            --source_format=CSV \
            --skip_leading_rows=1 \
            --autodetect \
            $dataset_name.$TABLE_name \
            ./$CSV_NAME