How I improved a Human Action Classifier to 80% Validation Accuracy in 6 Easy Steps
How many of you are master procrastinators? If you are, you have come to the right place.
In this post, I would like to share with you guys some tips and tricks I have picked up during my time as a Data Scientist and how I used them to quickly beef up my model. You shall also see an ensemble approach of performing Human Action Classification on the University of Texas at Dallas Multimodal Human Action Dataset (UTD-MHAD). The ensemble achieved a validation accuracy of 0.821 which is a significant improvement from the baseline paper’s accuracy of 0.672.
—
Background (The Problem)
I was tasked to apply data fusion on UTD-MHAD to build a model to classify 27 different human actions, and like all master procrastinators, I left it to the last week to start doing it. *MAXIMUM PRESSURE = HIGHEST PRODUCTIVITY!*
The UTD-MHAD is an open dataset collected from a Kinect camera and one wearable inertial sensor. The dataset contains 27 actions performed by 8 subjects (4 females and 4 males) with each subject repeating each action 4 times. After removing 3 corrupted sequences, the dataset is left with 861 data sequences. The dataset contains 4 data modalities, namely:
- RGB videos (spatio-temporal)
- Depth videos (spatio-temporal)
- Skeleton joint positions (spatio-temporal)
- inertial sensor signals (temporal)
All 4 modalities were time synchronized and stored in .avi and .mat format respectively.
Task: Beat the baseline accuracy of 0.672
The dataset came with a paper (C.Chen, 2015) which uses a Collaborative Representation Classifier (CRC) that had a validation accuracy of 0.672. This was calculated on a train-validation split where subjects 1, 3, 5, 7 were used for training, and subjects 2, 4, 6, 8 for validation and it was also the baseline accuracy I have to beat!
All fired up, I immediately went on to the web to start looking for past codes and tutorials. After spending roughly 30 mins on the web, I soon came to realize that there is no re-useable code! *STRESS LEVEL INCREASE*. It then dawned upon me that I had to start doing this all from scratch. I quickly took out a pen and notebook, and started devising my strategy.
Overview of the 6 steps
- Understand the data
- Quickly prototype
- Performance metrics
- Automate the parts you can, and ship your training to Google Colab
- Google the web and discuss with colleagues to get inspiration
- Ensemble your models
—
Step 1: Understand the data
Before you begin anything, it is important to know what you are dealing with. In this case, the best way is to plot it! I used NumPy, SciPy, and Matplotlib libraries to efficiently achieve these. Below are the plots of the Depth, Skeleton, and Inertial data of a subject performing a tennis swing. For more details, please refer to my code posted in my GitHub repo here.
We have just finished one full iteration of going from Step 1 -> 4 in the above flow chart and we got a first validation accuracy of 0.238. This is no where ideal, but it is a pretty good first start. We have set ourselves up with a highly iterative Data Science Pipeline where we could efficiency explore, build, and evaluate our project. Ask any practitioner and they would all agree that Data Science is a highly iterative journey.
With this foundation formed, we can now get creative and try different stuff to improve our model. I shall spare you guys the agony of seeing all the different trials I tried, so in the following sections, I shall just show you all the key results I found using this iterative pipeline.
Pre-processing
With this pipeline, I also found that re-sampling the sequences to the mean of 180 leads to better convergence compared to zero padding. Normalization of the amplitude led to no obvious improvement of the model performance, so we would skip it to prevent unnecessary calculation.
Step 4. Automate the parts you can, and ship your training to Google Colab
Since we would most probably be repeating certain steps quite often, it is worthwhile to take some time and automate them. We can convert certain frequently used code into scripts and perform functional abstraction on them. Your not-so-future self would be highly grateful for you doing this.
Keras Callbacks
The Keras callbacks are one of the best things that can happen to anyone who is trying to dabble their feet into deep learning. They are tools which would automate your model training and I shall share 3 of my favorite callbacks which greatly aid me in my various projects.
First, the TensorBoard. This allows Keras to save an event log file which constantly updates during the training and can be read and viewed by TensorBoard. This allows for a real-time, graphical visualization of your model training and I highly recommend it as an alternative then just viewing it from Keras’s model.fit()
output.
Second, the ModelCheckpoint. This allows your Keras model to save weights to a given file directory. There are useful arguments such as `monitor` and save_best_only
which give you some control over how you want Keras to save your weights.
Last but not the least, the EarlyStopping callback. Having this would allow Keras to stop your training based on the condition you specify. For my case, as shown below, I set min_delta=0
and patience=5
. This means that Keras would stop the training if it finds that the model’s validation accuracy is not increasing after 5 epochs.
With these 3 callbacks set in place, we can safely leave our model training while we head out for lunch.
Google Colaboratory
As we all know, training Deep Learning models is a very GPU intensive process. Luckily for us, Google Colaboratory has provided powerful TPU kernels for free! For those who cannot afford a powerful GPU can consider shipping your training to Google Colab. Google Colab also provides a familiar Jupyter notebook-like interface, making it very intuitive to use. It is also mounted on your Google Drive, so you can easily read your data into Colab. Weights and logs can also be easily saved.
Step 5. Google the web and discuss with colleagues to get inspiration
With a semi-automated pipeline of fast prototyping and evaluation done in sections 2–4, it is time to get inspiration and find creative ways to improve our model’s validation accuracy. Google different search terms, or going to portals like Google Scholar, Science Direct and Pubmed could give us insights. Chatting with colleagues about your problem could give us serendipitous, “Eureka” moments.
I was chatting with a colleague who was working on a Natural Language Processing (NLP) project that gave me the inspiration to try a Bi-Directional LSTM (BLSTM) (M. Schuster et al., 1997). The BLSTM reverses the original hidden layers and connects them, allowing a form of generative deep learning, resulting in the output layer getting both information from the past and future states simultaneously. Just by adding a layer of BLSTM, doubled my validation accuracy to 0.465.
Network Diagram of Bi-Directional LSTM model
Network Diagram of Conv LSTM model
Conv LSTM model
The main breakthrough came when I added Convolutional layers for feature extraction. As the input data is a 1D Signal, this model uses a series of 1D Convolutional and 1D Maxpooling layers to extract higher dimensional, latent features before feeding them into 2 LSTM units which capture the temporal information. The output of the LSTM units is then flattened out and we attached a Dropout layer with a dropout rate of 0.5 before adding a Dense layer with a softmax activation to classify all 27 actions.
This has got my validation accuracy to 0.700 just on the Inertial data which is the first time we beat the CRC model baseline of 0. For all our models, we used the AdamOptimizer (D. P. Kingma et al.m 2014) with a
learning rate of 1e−4 , β1 of 0.9, and β2 of 0.999. We initialize our trainable parameters using the Xavier Glorot initializer (X. Glorot et at. ,2010), and set our batch size to 3 to allow our model our model to generalize better (E. Hoffer et al., 2017).
UNet LSTM model
The UNet (O. Ronneberger et al., 2015) is a Fully Convolutional Neural Network (FCNN) that is almost symmetric in the contraction and expansion path. In the contraction path, the input was is being fed through a series of convolutions and max-pooling, increasing the feature maps and decreasing the resolution of the image. This increases the “what” and decreases the “where”. In the expansion path, the high dimensional features with low resolution is being up-sampled via convolutional kernels. The features maps were reduced during this operation. A novel feature of UNet is that it implements a concatenation of high dimensional features in the contraction path to the low dimensional feature maps of the expansion layers. Similarly, I added the extracted features from the convolutional networks into 2 LSTM units, flattened the output and attached a Dropout layer with a dropout rate of 0.5 finishing off with a Dense layer with a softmax activation to classify all 27 actions. I have attached the Network Diagram in the Appendix below.
The UNet LSTM model achieved a validation accuracy of 0.712 on the Inertial data.
Step 6. Ensemble your models
With both Conv LSTM and UNet LSTM performing pretty well on the validation data, we can combine their softmax outputs by taking the average. This immediately increase the validation accuracy to 0.765!
For most Supervised Learning problems, the ensemble method tends to outperform a single model method. This is currently understood to be because of its ability to transverse the hypothesis space. An ensemble is able to derive a better hypothesis that is not in the hypothesis space of its single models from which it is built.
Empirically, ensembles tend to yield better results when there is diversity among the model (L. Kuncheva et al., 2003). From the Confusion Matrices shown below, we can see that the Conv LSTM is able to pick up actions like swipe to the right and squat better, while the UNet LSTM is able to pick up actions like basketball shoot and draw x better. This indicates that there is model diversity among the two models and true enough, by ensembling them together, we got the validation accuracies from 0.700 and 0.712 to 0.765!
Confusion Matrices of Conv LSTM (left) and UNet LSTM (right) on Inertial data
Below is the equation I used to create the ensemble. For code implementation, please refer to the repo.
Combining with the Skeleton data
To achieve the promised 80% validation accuracy as stated in the title, I added the Skeleton data by also resampling it to a period of 180 units. After fusing this with the 6 channel Inertial data, we have an input shape of (N, 180, 66), where N is the number of samples. A table of all the validation accuracies are compiled below.
Lo and behold, the confusion matrix of our best performing model with a validation accuracy of 0.821 is shown below.
—
Summary
Congratulations on making it all the way here! If you have followed these steps thoroughly, you would have successfully built your very own ensembled Human Action Classifier!
Model zoo
Some key takeaways
- Plotting is a quick and easy way to understand your data
- Data Science is a highly iterative process
- Automate the things you can
- Ensemble is a quick way to get the best bang for your buck of our trained models
- Use Google Colab to increase your training speed
- Keras is the framework of choice for quick prototyping of deep learning models
If you are up for a challenge and feel that 0.821 is not enough, you may read the following subsection to improve your model.
—
What more could be done
A. Issue of over-fitting
Throughout our training, over-fitting at early epochs seems to be the main recurring challenge that we faced. We tried adding Dropout layers and ensembling to make our model more generalized but we can still go further. Over-fitting tends to happen when our model tries to learn high frequency features that may not be useful. Adding Gaussian Noise with zero mean and data points in all frequencies might enhance the learning capability of our
model. Similarly, the time sequences of different subjects are quite varied even for the same activities. Performing data augmentation using time scaling and translation would increase the amount of training data, allowing our model to generalize better.
On a side note, our model could also be trimmed further to reduce its complexity, and also its risk of over-fitting. With the recent Neural Architecture Search papers like, NAS (B. Zoph et.al, 2016), NASnet (B.Zpoh et.al, 2017) and Efficient-NAS (H. Pham et.al, 2018), gaining traction, we could also try applying them since this is also a classification task.
B. Data Fusion of RGB and Depth Data
We played with the Inertial, and we added the Skeleton towards the end to get us more information to find our data-hungry models. In order to push our model more, we would have to find ways to fusion it with the Depth and RGB data. This would allow for more input training variables to learn and extract features from, hence improving the validation accuracies.
C. Try other Ensemble Learning Techniques
Instead of doing a simple average, we could try more advanced ensemble learning approaches such as Boosting and Bagging.
—
Special thanks to Raimi and Derek for proof reading and giving me feedback on this article.
—
For original post, visit me here!
Feel free to connect with me via twitter, LinkedIn!
If you are interested in other projects that I have worked on, feel free to visit my Github!
—
Appendix
Network Diagram of UNet LSTM model
Thanks to Derek Chia and Raimi Bin Karim.