My journey with the team unofficially started one night, at a gathering in the ThoughtWorks office. Andy, who’s taken note of some fellow ThoughtWorkers with interests in Artificial Intelligence / Machine Learning, also noticed that I was in-between projects. Having gone through a few immersive theatre / “choose-your-own adventure” experiences before, when he explained that RIOT is an installation with a narrative that takes a turn depending on how you react to it with your facial expression as determined by a machine, it sounded both familiar enough to click and at the same time refreshingly interesting.
Just like that, the next day, I started pairing with Angelica, as our mission in the overall project became clearer: improve the accuracy of the facial emotion recognition portion of the installation, to subsequently help improve the overall experience of the installation.
Karen had collaborated with Dr. Hongying Meng and a few of his students on the prototype RIOT installation before, and we used that as a starting point. They had two approaches implemented on Matlab: one using a time-delay neural network (TDNN; architected based on this paper), and another using deep learning. We did not have the source code of these Matlab implementations, but have access to a trained model (presumably from the TDNN implementation) on Matlab.
The Big Role of Data
It soon became clear to us that, like in most machine learning projects, data plays a huge part in our success / failure. This is one of our biggest challenges – there aren’t many large datasets of facial expressions with emotion labels in the first place, and the relevant datasets we found are typically made available for research purposes. But more than that, we need data to benchmark our experiments and tweaks against the system used in the prototype, because without that, we don’t even know if we’re getting closer to achieving our goal or moving away from it. We just won’t be effective.
The dataset used for this benchmarking should also ideally be representative of the installation, as that is where we want this system to eventually be used and to perform well.
Planning the data collection and lining things up for it to happen takes a while, so Angelica and I started exploring other machine learning approaches.
Facial Emotion Recognition: State of the Art
For us as non-specialists in the field of facial emotion recognition, what’s helped us get a sense of the best-performing approaches to the task has been mostly research papers we found on arXiv, especially ones that mentioned that their approach has done well in a competition, such as Emotion in the Wild. Emotion in the Wild is particularly relevant to us as the competition entries are benchmarked against facial expressions taken in the wild (not in a lab setting). They’ve been semi-automatically extracted from movies, and therefore come in various lighting conditions, background, head poses, etc.
A technique that I think is going to be key for us is transfer learning, which allows us to take advantage of pre-trained neural network architectures that’s been proven to work well in general image classification tasks, and retrain them to classify emotions instead. An example is Inception-V3 which is easily trainable via a few scripts in TensorFlow. Keras has made it straightforward to do this with even more neural network architectures.
What Lies Ahead
Having explored state-of-the-art techniques in facial emotion recognition, what we’re preparing ourselves for is rapid experimentation, to find the one that works the best in our specific situation — the RIOT installation. As described above, determining which is best will require us to have our custom data set that is representative of the eventual RIOT environment.
With that nailed, we will experiment further with various techniques to synthetically expanding our training dataset (random scaling, crops, brightness adjustments, etc.), feature extraction, and hyperparameters of the system. Lastly, as we’re likely to come up with a system that classifies emotion based on static images but need it to eventually analyze real-time video feed, we will experiment with various techniques to smooth out our emotion predictions within a given time window.