Query an LLM, like SQL, for dataset creation in one prompt query!

Practicing DatScy
5 min readJul 28, 2023

LLMs are prevalent, easy to use, and they give factual data, so why not use LLMs to create datasets for fact checking (statistical and factual purposes of existing phenomenon)?

In this post, I use LangChain to create features for a simple dataset on disasters. LangChain is a python program that allows one to make complex prompts for LLMs, such that a reliable and organized output is returned. LangChain is similar to SQL; LangChain concisely prompts the model as SQL concisely queries the database.

import os
from os import environ
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', 0)

from dotenv import load_dotenv, find_dotenv
dot_env_file_exist = load_dotenv(find_dotenv()) # read local .env file

# Returns true or false if .env exists in current directory
print('dot_env_file_exist: ', dot_env_file_exist)

Insert Credentials: environmental variables

If you have never used bash environmental variables in python, below are instructions using the dotenv PyPI library.

  1. Install dotenv from PyPI: pip install dotenv
  2. The .env file that contains your credentials will not exist in your current directory of your jupyter-notebook or ipython. The .env file contains a simple list of variable names assigned to keys.

Follow methods 0 or 1, to create your .env file in your working directory.

Way 0: setting an environmental variable in bash using the dotenv CLI library

  1. cd /YOUR_CURRENT_WORKING_DIRECTORY
  2. Install python-dotenv[cli] from PyPI: pip install “python-dotenv[cli]” . Installing this in your current working directory will automatically create the .env file.
  3. Run the following commands to add or list contents of the .env file.
dotenv set OPENAI_API_KEY XXX   # Save a key on a line
dotenv list # view contents of .env
dotenv run -- python foo.py # to include .env credentials while running a .py file

Way 1: setting an environmental variable in bash using the dotenv library

  1. cd /YOUR_CURRENT_WORKING_DIRECTORY
  2. Run basic bash commands to create the .env file and view the contents of the .env file
echo "OPENAI_API_KEY=XXX" >> .env  # Create .env file with the key saved on a line
cat .env # view contents of .env

Select a Model

After the credentials/keys for the models are inserted into the .env file, select a model.

# Select the LLM OR chat model

temperature = 0

which_model = 'openai_chat'

if which_model == 'openai_llm':
import openai
# Get credential from .env file
openai.api_key = os.getenv("OPENAI_API_KEY")
from langchain.llms import OpenAI
# llm = OpenAI()
# OR
# llm = OpenAI(openai_api_key=openai.api_key, openai_organization="YOUR_ORGANIZATION_ID")
# OR
llm = OpenAI(openai_api_key=openai.api_key, temperature=temperature)
# Reliable output

elif which_model == 'openai_chat':
import openai
# Get credential from .env file
openai.api_key = os.getenv("OPENAI_API_KEY")
from langchain.chat_models import ChatOpenAI
llm_chat = ChatOpenAI(temperature=temperature)

elif which_model == 'databricks_llm':
# sudo python3.11 -m pip install openllm
from langchain.llms import OpenLLM
llm = OpenLLM(model_name="dolly-v2", model_id="databricks/dolly-v2-3b", temperature=temperature, repetition_penalty=1.2,)

elif which_model == 'vertexai_llm':
PROJECT_ID = os.getenv("PROJECT_ID")
LOCATION = os.getenv("LOCATION")
# Google Credentials
# https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev
# In the command line : gcloud auth application-default login
# It creates the application_default_credentials.json file
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/YOUR_HOME_PATH/.config/gcloud/application_default_credentials.json"
llm = VertexAI(location=LOCATION, max_output_tokens=128, model_name='text-bison', project=PROJECT_ID)

elif which_model == 'huggingface_llm':
# Get credential from .env file
huggingfacehub_api_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")
print('huggingfacehub_api_token: ', huggingfacehub_api_token)
huggingfacehub_api_token = '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'
print('huggingfacehub_api_token: ', huggingfacehub_api_token)

# Assign the retreived credential from the .env file to the Python environmental variables
os.environ["HUGGINGFACEHUB_API_TOKEN"] = huggingfacehub_api_token

from langchain import HuggingFaceHub
repo_id = "google/flan-t5-xxl" # See https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads for some other options

# Temperature can not be zero
if temperature == 0:
temperature = 0.1
llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature": temperature, "max_length": 500})

Prompt/Query the LLM for a factual dataset

In this example, I was interested in information related to disasters for a feature engineering and ML/DL analysis.

# Feature engineering : dataset creation

### Example: More complex Q&A - multi-prompt requesting multiple outputs per prompt
to_year = np.arange(2013,2023)
from_year = np.arange(2014,2024)

from langchain.prompts import ChatPromptTemplate

df_tot = pd.DataFrame()

for ind, ty in enumerate(to_year):

# [0] Insert information into text_template
# Way 0 : could webscrape a news site for factual news
# input_text = """List 100 disasters that occured in the last 10 years"""
input_text = f"""List 100 disasters that occured from January 1, {ty} to January 1, {from_year[ind]}"""

# OR

# Way 1 : ask Chatgpt to use its own knowledge base

# Read this from a pandas table
feature0 = "date_of_disaster"
description0 = f"""Date that disaster occured"""

feature1 = "name_of_disaster"
description1 = f"""Name of disaster that occured"""

feature2 = "type_of_disaster"
description2 = f"""Type of disaster that occured"""

feature3 = "duration_of_disaster"
description3 = f"""Duration of disaster"""

feature4 = "cost_of_disaster"
description4 = f"""Cost to repair environment after disaster"""

feature5 = "physical_health"
description5 = f"""Prevalent physical health condition caused by disaster"""

feature6 = "mental_health"
description6 = f"""Prevalent mental health condition caused by disaster"""

feature7 = "location_of_disaster"
description7 = f"""Location of the disaster"""

feature8 = "area_of_disaster"
description8 = f"""Area of physical damage in meters squared caused by the disaster"""

text_template = f"""\
For the following topic in text {input_text}, extract the following information:
{feature0}: {description0}
{feature1}: {description1}
{feature2}: {description2}
{feature3}: {description3}
{feature4}: {description4}
{feature5}: {description5}
{feature6}: {description6}
{feature7}: {description7}
{feature8}: {description8}
Format the output as JSON with the following keys:
{feature0}
{feature1}
{feature2}
{feature3}
{feature4}
{feature5}
{feature6}
{feature7}
{feature8}
"""

# [1] Declare prompt template
prompt_template = ChatPromptTemplate.from_template(text_template)

# [2] Declare the prompt
prompt = prompt_template.format_messages(text=input_text)

# [3] Declare langchain object (one time only)
temperature = 0
chat = ChatOpenAI(temperature=temperature)

# [4] Call OpenAI to process the complex langchain prompt
response = chat(prompt)
out = response.content # string output
# print('out: ', out)

# [5] Table output
resp = out.split("\n")[1:-1]
ques_statements = [description0, description1, description2, description3, description4,
description5, description6, description7, description8]
key = [feature0, feature1, feature2, feature3, feature4, feature5, feature6, feature7, feature8]

# [6] Create a Dataframe
# Multi-row response per key
from collections import defaultdict
d = defaultdict(list)
for txtline in resp:
# print('txtline:', txtline)
for k in key:
# print('k:', k)
if txtline.find(k) != -1:
# Clean text: remove single and double quotes
replace0 = ['"', "'", f"{k}: ", ',']
replace1 = ['', '', '', '']
for ind, i in enumerate(replace0):
txtline = txtline.replace(i, replace1[ind])
# print('txtline: ', txtline)
d[k].append(txtline)

df = pd.DataFrame.from_dict(d)
df_tot = pd.concat([df_tot, df], axis=0)


# Save data to .csv file
df_tot.to_csv('observation0_disasterdata.csv')

df_tot.reset_index(drop=True, inplace=True)
df_tot.to_csv('../data/observation0_disasterdata.csv', index=False)
After a few minutes we have a factual pandas dataset.

Upload the Dataset to GCP for SQL analysis

bq load \
--location=$location \
--source_format=CSV \
--skip_leading_rows=1 \
--autodetect \
$dataset_name.$TABLE_name \
./$CSV_NAME

I was very excited to see that this method can be used to create any type of factual dataset imaginable! 🎉

References

  1. Deeplearning.AI LangChain for LLM Application Development: https://www.deeplearning.ai/short-courses/. This class inspired the text template organization.
  2. LangChain : https://python.langchain.com/docs/get_started/introduction.html .

Happy Practicing! 👋

Correction : there is no association with EEG or art

WRITER at MLearning.ai

--

--

Practicing DatScy

Practicing coding, Data Science, and research ideas. Blog brand: Use logic in a clam space, like a forest, and use reliable Data Science workflows!