Hi, I’m Janvi, a Software Engineer at Coda. I’m excited about the applications of generative AI, and especially Coda AI.
Within the last few months, I’ve watched ChatGPT, DALL-E, and GPT-4 take the internet by storm. Even though AI research has been around for 80 years, the latest developments have captured the curiosity of millions inside and outside the tech industry, finally making AI mainstream #basic. The access to large language models has caused a surge in AI development for both large companies and startups: one in every five companies in Y-Combinator’s 2023 winter cohort built in the generative AI space. Knowing that AI is going to be widely available across many tools inspired me to learn more about how it works, and why generative AI has become so disruptive. If you are also interested in having a deeper understanding of how the latest machine learning models work, this post is for you. This article aims to establish an intuition around key machine learning terms to demystify some of the inner workings of these developments. Ever wondered how computers understand concepts that humans perceive with our senses, like language and images? For example, when users input text into ChatGPT, how does it understand what the user is saying? Embeddings. Computers don’t understand English; they understand numbers, vectors, and matrices. Embeddings are a way to map concepts to meaningful numbers. “Meaningful” in that similar concepts are physically close to each other on a graph. For example, if you want to build a movie-recommendation model, you would need a way to embed movies. A potential embedding could consider where movies lie on a graph defined by kid-friendliness and humor: Voila, the movie The Hangover can now be meaningfully represented by numbers. Based on this representation, if you’ve watched The Hangover, the recommendation system is more likely to recommend The Incredibles to you than Soul, because it’s closer on the graph. It is a pretty bad recommendation, but hey, it’s a start. To properly represent a movie, many more traits would need to be represented, resulting in a longer embedding vector. In GloVe, an online package with embeddings for English words used in Wikipedia, the embedding for the word “king” is a 50-digit vector:Source: jalammar word2vec
Images require embeddings as well. One very basic way to embed images is to represent each pixel by its color. Now an image is just a matrix of numbers like so:
source: Romain Beaumont’s Medium
Determining how to embed a model’s input is part of the training process. This takes us to our next concept: what is a model and what does it mean to train it? Models are fundamental to machine learning. A machine learning model is a computer program that finds patterns or makes decisions on data it hasn’t seen before. For example, a model could be built to predict how likely a startup is to IPO. A computer-vision model could be built to detect dogs in images. A model can be mathematically represented as something that takes an input and spits out an output. For example, the equation y=mx is a model. A model must be trained (i.e. taught) to return the right output. Once the model has been trained, it must be tested on data it has never seen before to see how accurate it is. If the accuracy is high enough then the model is ready to be used. If not, the model must be retrained. What is happening when a model is being trained? At its core, training involves finding parameter values (aka - weights) that result in the highest accuracy on unseen data. Parameters can be understood through this simple example: y=mx is a model with one parameter (namely, m). If you wanted to model how much people pay for ice cream and you had the following data, then you would set the parameter to 3 during the training process. Now that the model is trained by setting the parameter to 3, even though we have no data about the amount paid in a group of 100, we can make a good guess. When models are “large,” it means that they have a lot of parameters. It doesn’t necessarily mean that they are trained on a lot of data — though the two are correlated. Further, a model’s size determines how much space (KB, MB, etc) it takes up. Contrary to common belief, increasing a model size does not always lead to better results. Instead, it might cause overfitting where the model “memorizes” the training data as opposed to learning something general. Hence, finding the goldilocks number for a model’s size is an art. While talking about inputs, it’s worth mentioning tokens. You might have heard that “GPT-3 only accepts 4000” tokens. Tokenization cuts the input data into meaningful parts (or tokens). When dealing with images, tokenization involves splitting the image into patches. When dealing with language, tokenization involves splitting text into frequent words/phrases. If you’re confused between tokens and embeddings, here is a comparison to help:Comparing tokens to embeddings
How iPhone identifies plants
To wrap up the discussion on models, let's consider how an iPhone can identify the species of plants/animals in your photos. It’s a pretty cool feature if you haven’t used it before: I used to think that Apple was comparing my image to millions of other images of plants and animals and identifying the closest match, a huge computational feat. 😵💫 That is not what is happening. Apple has already trained a model that is good at identifying species on powerful computers. After the training is complete, they are left with a complicated equation (a model) that accepts photos as inputs. This model is either stored locally on my phone or on the cloud depending on whether the species-identification feature needs to work offline or not. When I use this feature, my image gets inputted into this model. The photo gets tokenized, and each token gets embedded and turned into vectors. Then, a bunch of linear algebra functions happen with the parameters chosen during training, finally returning the species. 🌱
In summary, parameters can be thought of as numbers that encode what the model learned from being exposed to the training data. A model’s size is the number of parameters it has and it determines how much space the model takes up. Further, Model = embedded(input) * parameters = output. The example model y=mx has 1 parameter, but GPT-3 has 175 billion parameters. 🤯 Having addressed key concepts used across all forms of machine learning, we can delve into a niche of AI that is particularly hot these days: natural language processing (NLP). NLP focuses on giving computers the ability to understand written and spoken language in the same way human beings can. In short, it is AI x Linguistics. A specific type of model in NLP is language models, a notable example being GPT-3. Language models are built to predict the next word given part of a sentence. When a computer predicts the next word accurately again and again, it ends up generating meaningful content (ex: blog posts written by ChatGPT). The language model that most of us interact with on a day-to-day basis is autocomplete on your smartphone keyboard. It follows from what we already know that large language models or LLMs are just models that predict the next word and have a lot of parameters (on the scale of hundreds of millions). Some of the most famous LLMs are GPT, BERT, and T-5. Here is how they compare in size:Source: Hugging Face LLMs
Thus far we have treated models as black boxes — you input some data and boom you get the output. Looking inside this black box, one model structure frequently used in language models is neural networks. They are named as such because they’re structured to loosely mimic how neurons in our brains work.Diagram of a neural network
The neural network diagram can seem intimidating, but it is just a sequence of layers. The first layer accepts the input, the final layer is the output, and everything in between is called hidden layers. The input data enters the first layer and gets passed through the nodes in each layer while getting processed until it reaches the final layer. The model’s weights/parameters which are chosen during the training process are represented by the connections between the nodes. Looking at how the data gets passed between nodes in each layer, every node in the neural network has a value between 0 and 1, called activation. The activation is the sum of the weights and activations from the incoming nodes in the previous layer. If the activation is below a threshold value, the node passes no data to the next layer. If the number exceeds the threshold value, the node “fires,” which means it passes the processed data to all its outgoing connections.
And that’s a wrap! I hope this post helped you gain a deeper understanding of how machine learning developments are fueling exciting AI-enabled technologies. Learn more about Coda AI. This article was adjusted and reposted with permission from Janvi Kalra.