Skip to content
Share
Explore

Lab Workbook: Introduction to JSON and Big Data for AI Conversational Memory

Preamble:
Welcome to this foundational lab on JSON and Big Data in the context of AI conversational memory!
In this lab, we'll explore how JSON (JavaScript Object Notation) is used to structure and store conversational data for AI language models.
We'll also touch on Big Data concepts and how they relate to training and operating large language models.
Understanding how to work with JSON and manage large datasets is crucial for developing AI applications, especially chatbots and conversational AI.
This lab will prepare you for future work with more complex AI models and data processing tasks. Another element of course is to learn the tooling for AI data analytics with R and later Power BI.

Let's get started! Part 1: Setting Up the Environment (15 minutes)

We'll be using Python in Google Colab for this lab. Later we will introduce a more sophisticated and feature rich tool called R Studio.

Python is widely used in AI and data science, and Colab provides a free, easy-to-use environment.
Open Google Colab:
Create a new notebook
Rename it to "JSON_BigData_AI_ConversationalMemory_Lab"

In the first cell, let's install and import the necessary libraries:

!pip install pandas nltk

import json
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
Run this cell.
You should see output indicating successful installation and imports.

Part 2: Introduction to JSON (30 minutes)

JSON is a lightweight, text based data interchange format that's easy for humans to read and write and easy for machines to parse and generate.
It's commonly used for storing and transmitting data in web applications, including AI models.
One of the great wins for application development is: We can programmatically change the shape of the data containers under program control dynamically at runtime.
In the traditional MVC web application: your business rules and algorithms are stored: In the controller, in the format of programming structures (loops, if then).
With JSON, we can store our business processes in the database - Because we describe business processes in a declarative XML language format called BPEL and BPMN.
Business Process Execution Language.
Business Process Modeller Notation.
JSON is the datastore for Service Oriented Architectures (SOA) - and SOA is the “front end” point of interaction between Human and AI LLM:
info

Note: There are 2 architectures in the world right now to construct Enterprise Systems.

(1) SOA : What is a Service? Just a Method Call from an object. A service is Just a Method Call. When we do an SOA, we are now calling methods on objects which live inside Operating Systems which I can connect to via TCP/IP. In SOA, 1 BLOCK (1 container) PER algorithm. Each block talks to the others via TCP IP they each have their own Port Number to be addressed to. Each block has its very own database, usually a JSON data store. SOA is phenominally useful for building BPEL distributed applications, which is becoming a big thing now with Internet of Things and Distributed Computing Edge Applications.
(2) MVC : MVC is a very narrow subcase of SOA. MVC is still SOA because in this Monolithic cement block of the Controller: Within one runtime enviroment. In MVC all algorithms are trapped inside this monolith called the Controller.
Watch my video on how SOA applications are built:
These are xml schema languges like HTML and JSON which describe and control your Business processes, can be stored in JSON databases like MONGO DB, and changed programmatically under AI control at runtime to adapt to changing to changing market, business and enviromental conditions.
image.png

IBM Watson was Chat GPT before there was Chat GPT:


Let's create a simple JSON structure to represent a conversation:
import json
conversation = {
"conversation_id": "12345",
"participants": ["user", "ai"],
"messages": [
{
"sender": "user",
"content": "Hello, how are you?",
"timestamp": "2023-11-20T10:00:00Z"
},
{
"sender": "ai",
"content": "Hello! I'm functioning well, thank you. How can I assist you today?",
"timestamp": "2023-11-20T10:00:05Z"
}
]
}

print(json.dumps(conversation, indent=2))
Run this cell. You'll see a nicely formatted JSON output representing a simple conversation.
Now, let's write this conversation to a file:

with open('conversation.json', 'w') as f:
json.dump(conversation, f)

print("Conversation saved to file.")
And read it back:

with open('conversation.json', 'r') as f:
loaded_conversation = json.load(f)

print("Loaded conversation:")
print(json.dumps(loaded_conversation, indent=2))
You should see the same conversation structure printed out.

Part 3: Working with Conversational Data

Now, let's simulate a larger dataset of conversations. We'll create multiple conversations and store them in a list:
import random
import datetime

def generate_random_conversation(conv_id):
user_messages = [
"Hello, how are you?",
"What's the weather like today?",
"Can you tell me a joke?",
"What's the capital of France?",
"How do I bake a cake?"
]
ai_responses = [
"Hello! I'm doing well. How can I assist you?",
"I'm sorry, I don't have real-time weather information. You might want to check a weather website or app for the most current data.",
"Sure! Why don't scientists trust atoms? Because they make up everything!",
"The capital of France is Paris.",
"To bake a cake, you'll need ingredients like flour, sugar, eggs, and butter. Start by preheating your oven, then mix your dry ingredients..."
]
messages = []
for _ in range(random.randint(2, 5)):
user_msg = random.choice(user_messages)
ai_msg = ai_responses[user_messages.index(user_msg)]
messages.append({
"sender": "user",
"content": user_msg,
"timestamp": datetime.datetime.now().isoformat()
})
messages.append({
"sender": "ai",
"content": ai_msg,
"timestamp": (datetime.datetime.now() + datetime.timedelta(seconds=5)).isoformat()
})
return {
"conversation_id": str(conv_id),
"participants": ["user", "ai"],
"messages": messages
}

conversations = [generate_random_conversation(i) for i in range(1000)]

print(f"Generated {len(conversations)} conversations.")
print("\nSample conversation:")
print(json.dumps(conversations[0], indent=2))
This script generates 1000 simulated conversations. Run it and examine the output to see a sample conversation.

Part 4: Analyzing Conversational Data

Now that we have a dataset, let's perform some basic analysis:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Convert to DataFrame for easier analysis
df = pd.json_normalize(conversations, record_path='messages', meta=['conversation_id'])

print("Total number of messages:", len(df))
print("\nMessages per sender:")
print(df['sender'].value_counts())

print("\nUnique conversations:", df['conversation_id'].nunique())

print("\nAverage message length:")
df['message_length'] = df['content'].apply(lambda x: len(word_tokenize(x)))
print(df.groupby('sender')['message_length'].mean())
This code converts our JSON data to a pandas DataFrame and performs some basic analysis. Run it and examine the output.

Part 5: Storing and Retrieving Conversational Memory

In a real AI system, we'd typically store this data in a database. For this lab, we'll simulate database operations using JSON files.
Let's create a simple system to store and retrieve conversational memory:
class ConversationMemory:
def __init__(self, filename='conversation_memory.json'):
self.filename = filename
try:
with open(self.filename, 'r') as f:
self.memory = json.load(f)
except FileNotFoundError:
self.memory = {}

def save_conversation(self, conversation):
self.memory[conversation['conversation_id']] = conversation
with open(self.filename, 'w') as f:
json.dump(self.memory, f)

def get_conversation(self, conversation_id):
return self.memory.get(conversation_id, None)

def get_last_message(self, conversation_id):
conversation = self.get_conversation(conversation_id)
if conversation and conversation['messages']:
return conversation['messages'][-1]['content']
return None

# Initialize ConversationMemory
memory = ConversationMemory()

# Save a few conversations
for conv in conversations[:5]:
memory.save_conversation(conv)

print("Saved 5 conversations to memory.")

# Retrieve a conversation
retrieved_conv = memory.get_conversation('2')
print("\nRetrieved conversation 2:")
print(json.dumps(retrieved_conv, indent=2))

# Get last message from a conversation
last_message = memory.get_last_message('3')
print("\nLast message from conversation 3:")
print(last_message)
This code demonstrates a simple system for storing and retrieving conversational memory.
Run it and examine the output.
megaphone

Lecture Notes:

Typical database technologies used for storing conversational memory in AI systems, along with sample code and workflows.

This will expand on the JSON file-based approach we used in the lab and introduce more scalable solutions.

Lecture Note: Database Technologies for AI Conversational Memory
In production AI systems, especially those dealing with large-scale conversational data, we typically use more robust and scalable database solutions than simple JSON files.
Here are some common database technologies used in AI applications:
1. NoSQL Databases (JSON) 2. Relational Databases 3. Vector Databases

1. NoSQL Databases: ​NoSQL databases are often preferred for storing conversational data due to their flexibility with unstructured data and scalability.
Example: MongoDB
MongoDB is a popular document-based JSON NoSQL database that stores data in BSON (Binary JSON) format, making it ideal for JSON-like conversational data.
Sample code using PyMongo (MongoDB's Python driver):
```python from pymongo import MongoClient
# Connect to MongoDB client = MongoClient('mongodb://localhost:27017/') db = client['conversational_ai_db'] conversations = db['conversations']
# Store a conversation conversation = { "conversation_id": "12345", "participants": ["user", "ai"], "messages": [ { "sender": "user", "content": "Hello, how are you?", "timestamp": "2023-11-20T10:00:00Z" }, { "sender": "ai", "content": "Hello! I'm functioning well, thank you. How can I assist you today?", "timestamp": "2023-11-20T10:00:05Z" } ] }
result = conversations.insert_one(conversation) print(f"Conversation inserted with id: {result.inserted_id}")
# Retrieve a conversation retrieved_conv = conversations.find_one({"conversation_id": "12345"}) print(retrieved_conv) ```
2. Relational Databases:
While less common for storing raw conversational data, relational databases are often used in conjunction with NoSQL databases for structured metadata or analytics.
Example: PostgreSQL
Sample code using psycopg2 (PostgreSQL adapter for Python):
```python import psycopg2 import json
# Connect to PostgreSQL conn = psycopg2.connect("dbname=conversational_ai user=postgres password=password") cur = conn.cursor()
# Create tables cur.execute(""" CREATE TABLE IF NOT EXISTS conversations ( id SERIAL PRIMARY KEY, conversation_id VARCHAR(255) UNIQUE, participants JSONB, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) """)
cur.execute(""" CREATE TABLE IF NOT EXISTS messages ( id SERIAL PRIMARY KEY, conversation_id VARCHAR(255) REFERENCES conversations(conversation_id), sender VARCHAR(50), content TEXT, timestamp TIMESTAMP ) """)
# Insert a conversation conversation = { "conversation_id": "12345", "participants": ["user", "ai"], "messages": [ { "sender": "user", "content": "Hello, how are you?", "timestamp": "2023-11-20T10:00:00Z" }, { "sender": "ai", "content": "Hello! I'm functioning well, thank you. How can I assist you today?", "timestamp": "2023-11-20T10:00:05Z" } ] }
cur.execute( "INSERT INTO conversations (conversation_id, participants) VALUES (%s, %s)", (conversation['conversation_id'], json.dumps(conversation['participants'])) )
for message in conversation['messages']: cur.execute( "INSERT INTO messages (conversation_id, sender, content, timestamp) VALUES (%s, %s, %s, %s)", (conversation['conversation_id'], message['sender'], message['content'], message['timestamp']) )
conn.commit()
# Retrieve a conversation cur.execute(""" SELECT c.conversation_id, c.participants, json_agg(json_build_object('sender', m.sender, 'content', m.content, 'timestamp', m.timestamp)) as messages FROM conversations c JOIN messages m ON c.conversation_id = m.conversation_id WHERE c.conversation_id = %s GROUP BY c.conversation_id, c.participants """, ("12345",))
retrieved_conv = cur.fetchone() print(json.dumps(retrieved_conv, indent=2))
cur.close() conn.close() ```
3. Vector Databases: For advanced AI applications, especially those involving semantic search or similarity matching, vector databases are becoming increasingly popular.
Example: Pinecone
Sample code using Pinecone:
```python import pinecone from sentence_transformers import SentenceTransformer
# Initialize Pinecone pinecone.init(api_key="your-api-key", environment="your-environment") index_name = "conversational-ai"
# Create index if it doesn't exist if index_name not in pinecone.list_indexes(): pinecone.create_index(index_name, dimension=384, metric="cosine")
# Connect to the index index = pinecone.Index(index_name)
# Load a sentence transformer model model = SentenceTransformer('all-MiniLM-L6-v2')
# Function to upsert conversation def upsert_conversation(conversation): for i, message in enumerate(conversation['messages']): vector = model.encode(message['content']).tolist() unique_id = f"{conversation['conversation_id']}_{i}" metadata = { "conversation_id": conversation['conversation_id'], "sender": message['sender'], "timestamp": message['timestamp'] } index.upsert([(unique_id, vector, metadata)])
# Upsert a conversation conversation = { "conversation_id": "12345", "messages": [ { "sender": "user", "content": "Hello, how are you?", "timestamp": "2023-11-20T10:00:00Z" }, { "sender": "ai", "content": "Hello! I'm functioning well, thank you. How can I assist you today?", "timestamp": "2023-11-20T10:00:05Z" } ] }
upsert_conversation(conversation)
# Query similar messages query_vector = model.encode("How are you doing?").tolist() results = index.query(query_vector, top_k=5, include_metadata=True)
for result in results['matches']: print(f"Score: {result['score']}, Content: {result['metadata']}") ```
Workflow Considerations:
1. Data Ingestion: Set up a pipeline to continuously ingest conversational data into your chosen database.
2. Data Processing: Implement pre-processing steps (e.g., tokenization, embedding generation) before storage.
3. Indexing: For faster retrieval, create appropriate indexes on frequently queried fields.
Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.