Lecture: Databases and Big Data for Cloud DevOps and AI Application Development
Introduction
Welcome to today's lecture on databases and Big Data in the context of Cloud DevOps and AI application development. As the world becomes more data-driven, understanding the role of databases and Big Data in developing AI applications is crucial.
Note: These concepts will be covered on the Mid Term and Final exams!!
This lecture will provide you with a solid understanding of the fundamental concepts and tools necessary to work with databases and Big Data effectively in the context of building AI Solutions.
A practical example:
Student Sarah was enthusiastic about applying Big Data techniques to construct her own ChatGPT generative AI language model for a class project.
She followed these detailed steps to accomplish her goal:
Data Collection: Sarah began by gathering a large and diverse dataset of text, which included conversational data, articles, and various other sources [2]. She ensured that the dataset was extensive enough to train her generative AI model effectively.
Data Preprocessing: Sarah cleaned the data by removing irrelevant content and inconsistencies [3]. She also tokenized and encoded the text, converting it into a format that her machine learning model could understand.
Exploratory Data Analysis (EDA):
The technique of EDA means we do hypothesis testing on our dataset. A hypothesis is an assumption you are making about what is true: And then you appy that guess to the Real World and find out if it is true.
Sarah analyzed the dataset's characteristics, such as word frequency, sentence length, and common themes, to gain insights into her training data [3].
Mathematical Concepts factor into this. We apply the math of statistics and probability to build Baysian Cause → Effect graphs.
Model Selection: Based on her research, Sarah chose a generative pretrained transformer (GPT) for her language model, as it's known for its effectiveness in generating human-like text [2].
Model Training: Sarah split her dataset into training and validation sets and began training her GPT model using the training data.
She carefully monitored the model's performance during the training process [3].
Hyperparameter Tuning: To optimize her model's performance, Sarah experimented with various hyperparameter settings, such as learning rate and batch size [3].
Model Evaluation: Sarah evaluated her GPT model's performance on the validation set to ensure that it was generating coherent and contextually relevant text [3].
Iterative Improvement: Based on the evaluation results, Sarah fine-tuned her model by adjusting its parameters and retraining it, aiming to achieve better performance.
In the real world deployment, you will provision elements in your Project Plan to identify and correct for Model Drift.
Deployment: Once Sarah was satisfied with her ChatGPT model's performance, she built a user-friendly interface using Gradio and deployed it as a web app for others to interact with and test [3].
Throughout the project, Sarah applied Big Data techniques to process, analyze, and manage the massive dataset that she used for training her ChatGPT generative AI language model.
By doing so, she successfully built an AI model capable of generating human-like text to provide intelligence, insightful, and context appropriate answers to user questions about its training data set..
Large Social Media companies use Big Data Techniques to discover your personal secrets, to sell you advertizing content and to control you in ways such as how you vote in Political Elections:
A database is an organized collection of structured data designed to be easily accessed, managed, and updated. Databases allow us to store, retrieve, and manipulate data efficiently, making them essential for various applications, including AI development.
Big Data refers to the enormous volume, variety, and velocity of data that is too complex to be effectively processed using traditional data management tools.
We cannot assign a primary key in the context of building a generative AI language model - how would that even work? So we wrap up our data in special boxes called Json Documents.
It involves the collection, storage, analysis, and visualization of massive datasets, which often require cloud computing techniques and resources: distributed computing and advanced analytics techniques.
This is our introduction to Cloud Devops in building the Generative AI Language Model.
2. Types of Databases
There are two main types of databases:
2.1 Relational Databases
Relational databases - centric around the primary key and Codd’s 13 Laws - use tables to store data, with each table consisting of rowsets: rows and columns. They are based on the relational model, where data is organized into relationships between tables: called Predicate Joins.
SQL (Structured Query Language) is commonly used to interact with relational databases. These databases live as In Memory Data Structures created and maintained by the Database Server.
Examples include Oracle, MySQL, PostgreSQL, and Microsoft SQL Server.
2.2 NoSQL Databases
NoSQL databases, or "not only SQL," are designed to handle unstructured data and are more flexible and scalable than relational databases.
JSON : javascript object notation: Is the “container” of data. Because Json is text: We can under program-code control change the shape of the data container by changing the text in the { json key:value pair } documents.
JSON Schema databases can be categorized into four main types:
Document-based: Store data in JSON or BSON format (e.g., MongoDB, Couchbase). { key:value }
Key-value: Store data as key-value pairs (e.g., Redis, Amazon DynamoDB).
Column-family: Store data in columns instead of rows (e.g., Apache Cassandra, HBase).
Graph: Basis of Facebook, Collection of Nodes: Store data as nodes and edges in a graph (e.g., Neo4j, Amazon Neptune).
3. Database Management Systems (DBMS): Database File and the set of front end tools that let us manipulate the data.
Database is the file that stores the tables (in SQL), or the JSON documents.
The database by itself is only a little bit of the story. To do useful work, we must have a “front end” user interface to do CRUD (create, read, update, delete).
A DBMS is the user interface software that allows users to create, maintain, and interact with databases.
It provides a set of tools to manage data (CRUD: create, read, update, delete), enforce data integrity with use of predicate jois, and ensure data security.
Popular DBMS options include:
MySQL
PostgreSQL
Microsoft SQL Server
MongoDB: JSON database
Oracle Database
4. Big Data Technologies
When dealing with Big Data, traditional databases may not be sufficient.
Some popular Big Data technologies include:
4.1 Hadoop
Apache Hadoop is an open-source framework that allows distributed storage and processing of large datasets using a cluster of computers.
Hadoop consists of two main components: HDFS (Hadoop Distributed File System) for distributed storage and MapReduce for distributed processing.
4.2 Spark
Apache Spark is an open-source data processing engine designed for large-scale data processing, machine learning, and graph processing. It is faster and more flexible than Hadoop and can be used with various data storage systems.
4.3 NoSQL Databases
As mentioned earlier, NoSQL databases are well-suited for handling Big Data due to their flexibility and scalability.
5. Databases in Cloud DevOps
In the context of Cloud DevOps, databases play a crucial role in providing a seamless development and deployment experience. Cloud-based databases offer scalability, flexibility, and ease of management, making them ideal for DevOps environments. This is where Cloud Dev Ops comes into our Story.
Another big advantage of cloud-based databases is the subscription pay-per-use model which lets us access what we need when we need it and paying only for what we use. We don’t need to pay to install and operate computers in our own work site.
Some popular cloud-based DBMS options include:
Amazon RDS (Relational Database Service)
Google Cloud SQL
Microsoft Azure SQL Database
Amazon DynamoDB (NoSQL)
Google Firestore (NoSQL)
6. Big Data and AI Application Development
Big Data is essential for AI application development, as it provides the massive datasets needed for training machine learning models.
Technologies like Hadoop and Spark facilitate the processing and analysis of these datasets, while databases and cloud-based storage solutions provide the necessary infrastructure to store and manage the data on a pay-as-you-go basis.
By understanding the role of databases and Big Data in AI application development, you'll be better equipped to design, develop, and deploy effective AI solutions. We are now equipped to understand what our Application Development tools like PyTorch, based on the Baysian Language Model, as doing under the hood.
Conclusion
In summary, databases and Big Data play a critical role in Cloud DevOps and AI application development. By understanding the different types of databases, DBMS, and Big Data technologies, you'll be better equipped to manage and manipulate the vast amounts of data required to train, validate, and prevent model drift in the AI Applications that we build.
As the world becomes increasingly data-driven, having a solid grasp of these concepts will prove invaluable in your career as an AI developer.
Want to print your doc? This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (