Companies are already using sentiment analysis to gauge consumer mood towards their product or brand. By mining tweets, reviews, and other sources, companies can easily derive sentiment from natural language. But what about when consumers are not online? Customer representatives in stores can see when a customer is frustrated or angry, but they can’t be everywhere at once. However, companies have been using in-store cameras to monitor customer behavior
. Orwellian as it may seem, companies use this data to track hot-spots in the store, the paths customers take, and even what they’re specifically looking at. So why not track emotions as customers shop as well?
The primary issue is that it’s difficult to translate contortions of 43 facial muscles into emotions. It’s easy for humans because we’ve had years of practice, but computers see the world as a grid of numbers that represent pixel values. We’re able to look at an image of a person’s face and easily differentiate between a smile and a frown, but for a machine learning model, it’s a much more difficult task. To solve this problem, we’re going to use a deep convolutional neural net implemented in a machine learning framework called
A convolutional neural net extracts features from 2D data and assigns weights to those features, eventually resulting in a prediction. For example, if we wanted to train a CNN to recognize handwritten numbers, we would have a dataset of 100x100 px images of numbers. The CNN would recognize curves and straight lines in 10x10 px sections, and after detecting these features, the model would learn that combinations of certain curves and lines are indicative of certain numbers. An especially curvy number like 8 would be distinguishable from a straight number like 1 or 7. In the case of facial emotion detection, the upward curves of a smile would be associated with happiness.
Why deep learning?
Currently, researchers use the distances between facial landmarks to detect emotion. A face is represented as the position of the nose, eyes, mouth, cheeks, and other areas, and then the distances between those points are calculated. Then, thresholds are established to detect the emotion. If the face in the image is smiling, the cheek positions would be closer to the eyes, the mouth would be stretched, and the eyes would be squinted. This approach works in controlled settings, but what if you can only see half the face in the image? What if the face is slightly turned? To get accurate facial landmarks, you would have to artificially transform the image so that the face is centered and looking straight at the camera. With a deep learning approach, the model can learn to be flexible and detect features in faces no matter how they’re oriented. All you need is data.
datasets of images. The images are labeled with emotions happy, sad, disgusted, angry, surprised, fearful, and neutral, and to normalize the images and prepare them for training the model, I converted all of them to grayscale, cropped and scaled them to 192x192 px, and normalized pixel intensity to values between 0 and 1. Then I converted the images to numpy arrays to be used in training. Because there weren’t enough images in the anger and disgust categories, I merged them into one group, resulting in 6 categories.
. It makes building and testing models easy by allowing you to simply stack layers. What would’ve taken hundreds of lines of code in TensorFlow or thousands in Python, took about 30 with Keras.
The first convolutional layer sees a 192x192 px image and looks at 5x5 px portions of the image. It has 32 filters, meaning there are 32 different patterns the layer will look for in the 5x5 px portion. The patterns are determined as the model learns from the data, but they’ll typically look like this:
Each pattern has a weight, and those weights are also adjusted as the model learns. Because the 5x5 portions of the image have 32 different representations, this layer is 188x188x32. You’ll notice that this does not match the original dimension of the image (192x192). This is because the 5x5 portions are created with a sliding window, moving one pixel at a time across the image.
MaxPooling2D
This is a subsampling layer, which takes the max value of a 2x2 window. This prevents the model from overfitting, which is important when it has to predict emotions on images it’s never seen before. Each 188x188 slice of the convolutional layer is pooled, resulting in a layer with dimensions 94x94x32.
Flatten, Dense, and Dropout
The last pooled layer is flattened and used in another layer with a softmax activation function. This is the layer which produces the classification of the emotion, which is why it is a 1x6 vector. The largest value of this vector corresponds to one of six emotions: happy, sad, fearful, angry, surprised, or neutral. Dropout is used to prevent the model from overfitting to the data. A dropout value of .5 means that in every cycle of training, half the “neurons” are left out, so the model will be able to generalize better to new images.
Training
To train the model, I created an automation script that took “experiments” and created models out of them. I experimented with using larger image sizes, adding more convolutional layers, increasing and decreasing dropout, among other things. The results of the experiments can be found on my
After 15 epochs of training (the model had seen all 40,000 images 15 times), the model was able to guess the correct emotion 60% of the time. This test was run on images the model had never seen before.
Conclusion
I hope this guide made “deep learning” into less of a buzzword and more of a tangible concept to use in your business. If you’re interested in going deeper (pun unintended), check out Andrew Ng’s new course on