How I improved a Human Action Classifier to 80% Validation Accuracy in 6 Easy Steps

How many of you are master procrastinators? 🙌 If you are, you have come to the right place.

In this post, I would like to share with you guys some tips and tricks I have picked up during my time as a Data Scientist and how I used them to quickly beef up my model. You shall also see an ensemble approach of performing Human Action Classification on the University of Texas at Dallas Multimodal Human Action Dataset (UTD-MHAD). The ensemble achieved a validation accuracy of 0.821 which is a significant improvement from the baseline paper’s accuracy of 0.672.


Background (The Problem)

I was tasked to apply data fusion on UTD-MHAD to build a model to classify 27 different human actions, and like all master procrastinators, I left it to the last week to start doing it. *MAXIMUM PRESSURE = HIGHEST PRODUCTIVITY!*

The UTD-MHAD is an open dataset collected from a Kinect camera and one wearable inertial sensor. The dataset contains 27 actions performed by 8 subjects (4 females and 4 males) with each subject repeating each action 4 times. After removing 3 corrupted sequences, the dataset is left with 861 data sequences. The dataset contains 4 data modalities, namely:

  1. RGB videos (spatio-temporal)
  2. Depth videos (spatio-temporal)
  3. Skeleton joint positions (spatio-temporal)
  4. inertial sensor signals (temporal)

All 4 modalities were time synchronized and stored in .avi and .mat format respectively.

Task: Beat the baseline accuracy of 0.672

The dataset came with a paper (C.Chen, 2015) which uses a Collaborative Representation Classifier (CRC) that had a validation accuracy of 0.672. This was calculated on a train-validation split where subjects 1, 3, 5, 7 were used for training, and subjects 2, 4, 6, 8 for validation and it was also the baseline accuracy I have to beat!

All fired up, I immediately went on to the web to start looking for past codes and tutorials. After spending roughly 30 mins on the web, I soon came to realize that there is no re-useable code! *STRESS LEVEL INCREASE*. It then dawned upon me that I had to start doing this all from scratch. I quickly took out a pen and notebook, and started devising my strategy.


Overview of the 6 steps

  1. Understand the data
  2. Quickly prototype
  3. Performance metrics
  4. Automate the parts you can, and ship your training to Google Colab
  5. Google the web and discuss with colleagues to get inspiration
  6. Ensemble your models


Step 1: Understand the data

Before you begin anything, it is important to know what you are dealing with. In this case, the best way is to plot it! I used NumPy, SciPy, and Matplotlib libraries to efficiently achieve these. Below are the plots of the Depth, Skeleton, and Inertial data of a subject performing a tennis swing. For more details, please refer to my code posted in my GitHub repo here.

  Video screenshot of a Tennis Swing
Depth videos of a Tennis Swing

Skeleton joint positions of a Tennis Swing
Inertial sensor signals of a Tennis Swing

So now that we have plotted them, we have to convert them to a suitable format to feed our model. My choice is NumPy array. For this post, I would focus mainly on just using the Skeleton and Inertial Data. For the RGB and Depth videos, special care in creating a VideoDataGenerator is required to read them from disk as they are too big to load on memory.

The Skeleton and Inertial Data have varying periods and for the Inertial sensors, varying amplitudes. Histogram plotting is an effective way to show the distribution of these.

Period Distribution

Period Distribution of Inertial sensor data

This should not come as a surprise as these are various actions performed by different subjects. The experiment also did not specify how a particular action should be carried out, so I am guessing the subject would just execute the action based off their own experience.

Having these varying periods simply would not fly as our models would require a fixed input shape. We have two strategies to treat this:

  1. zero-pad the signals all to the max length of 326
  2. resample the signals to a mean period of 180

Amplitude Distribution

Amplitude Distribution of 3-axial Gyroscope data (min on the left, max on the right)

Amplitude Distribution of 3-axial Accelerometer data (min on the left, max on the right)

The distribution of the amplitude resembles greatly to a long tail. Since the amplitude does not affect the shape of our input data, we could choose not to apply any pre-processing techniques on it. Otherwise, normalization techniques such as the mean-variance normalization could be applied for pre-processing.


Step 2: Quickly prototype

As the Lean Startup approach preaches, “Fail Fast, Fail Cheap”. The next step is to build a light-weight model that allows for quick iteration. Keras, the high-level neural network wrapper written in Python, would be the framework of choice for this task. Keras allows a clean, minimalist approach for you to build huge deep learning models with just a few lines of code. You could see how easy it is in the code implementation in the repo. P.S. We would also be using it with a Tensorflow backend.

We shall first begin by only using the Inertial data. Since the data is a sequence of 6 channels (3-axis for accelerometer + 3-axis for gyroscope), the very first model we would be building is a Simple LSTM (S. Hochreiter et al., 1997) with a LSTM cell of 512 hidden units.

Minimalist Keras Code to implement the Simple LSTM model 

  Network Diagram of Simple LSTM model

Step 3. Performance metrics

With the model created, we now need a reliable feedback system to inform us on how the model is performing. As this a classification task with a well-balanced class distribution, the accuracy would suffice as the sole performance metric, without the need of calculating precision, recall or F1-score.

To see if our model is over-fitting, we can also get Train-Validation Accuracy-Loss plot. A 27-Class confusion matrix can also be plotted to see which are the actions that are often misclassified as another.

Loss (Left) Accuracy (Right) plots of the Training (Blue) and Validation (Green) set of the Simple LSTM

From the Accuracy-Loss plots, we can see that our model is over-fitting at very early epochs, with our validation accuracy plateauing after the 4th epoch. At epoch 15, we got a model with a validation accuracy of ~ 0.238, which is pretty far off from the baseline of 0.672 we have to beat.

This suggests that we would have to either change strategy or apply more regularization techniques such as Dropout layers.

Confusion matrix of the Simple LSTM model on Inertial data

Oh gosh! This confusion matrix looks like a screenshot of a minesweeper game! The only saving grace were the “stand to sit” and “sit to stand” actions which the model predicted 16 (perfect score) and 13 correctly, respectively. The other 25 actions had very poor performance.

Before we go off stressing ourselves out, let us take a step back and look what we have done thus far.

Data Science Pipeline (adapted from Practical Data Science Cook Book)

We have just finished one full iteration of going from Step 1 -> 4 in the above flow chart and we got a first validation accuracy of 0.238. This is no where ideal, but it is a pretty good first start. We have set ourselves up with a highly iterative Data Science Pipeline where we could efficiency explore, build, and evaluate our project. Ask any practitioner and they would all agree that Data Science is a highly iterative journey.

With this foundation formed, we can now get creative and try different stuff to improve our model. I shall spare you guys the agony of seeing all the different trials I tried, so in the following sections, I shall just show you all the key results I found using this iterative pipeline.



With this pipeline, I also found that re-sampling the sequences to the mean of 180 leads to better convergence compared to zero padding. Normalization of the amplitude led to no obvious improvement of the model performance, so we would skip it to prevent unnecessary calculation.

Step 4. Automate the parts you can, and ship your training to Google Colab

Since we would most probably be repeating certain steps quite often, it is worthwhile to take some time and automate them. We can convert certain frequently used code into scripts and perform functional abstraction on them. Your not-so-future self would be highly grateful for you doing this.


Keras Callbacks

The Keras callbacks are one of the best things that can happen to anyone who is trying to dabble their feet into deep learning. They are tools which would automate your model training and I shall share 3 of my favorite callbacks which greatly aid me in my various projects.

First, the TensorBoard. This allows Keras to save an event log file which constantly updates during the training and can be read and viewed by TensorBoard. This allows for a real-time, graphical visualization of your model training and I highly recommend it as an alternative then just viewing it from Keras’s output.

Second, the ModelCheckpoint. This allows your Keras model to save weights to a given file directory. There are useful arguments such as `monitor` and save_best_only which give you some control over how you want Keras to save your weights.

Last but not the least, the EarlyStopping callback. Having this would allow Keras to stop your training based on the condition you specify. For my case, as shown below, I set min_delta=0 and patience=5. This means that Keras would stop the training if it finds that the model’s validation accuracy is not increasing after 5 epochs.

With these 3 callbacks set in place, we can safely leave our model training while we head out for lunch.

Useful Keras Callbacks


Google Colaboratory

As we all know, training Deep Learning models is a very GPU intensive process. Luckily for us, Google Colaboratory has provided powerful TPU kernels for free! For those who cannot afford a powerful GPU can consider shipping your training to Google Colab. Google Colab also provides a familiar Jupyter notebook-like interface, making it very intuitive to use. It is also mounted on your Google Drive, so you can easily read your data into Colab. Weights and logs can also be easily saved.

Step 5. Google the web and discuss with colleagues to get inspiration

With a semi-automated pipeline of fast prototyping and evaluation done in sections 2–4, it is time to get inspiration and find creative ways to improve our model’s validation accuracy. Google different search terms, or going to portals like Google Scholar, Science Direct and Pubmed could give us insights. Chatting with colleagues about your problem could give us serendipitous, “Eureka” moments.

I was chatting with a colleague who was working on a Natural Language Processing (NLP) project that gave me the inspiration to try a Bi-Directional LSTM (BLSTM) (M. Schuster et al., 1997). The BLSTM reverses the original hidden layers and connects them, allowing a form of generative deep learning, resulting in the output layer getting both information from the past and future states simultaneously. Just by adding a layer of BLSTM, doubled my validation accuracy to 0.465.

Network Diagram of Bi-Directional LSTM model


Network Diagram of Conv LSTM model


Conv LSTM model

The main breakthrough came when I added Convolutional layers for feature extraction. As the input data is a 1D Signal, this model uses a series of 1D Convolutional and 1D Maxpooling layers to extract higher dimensional, latent features before feeding them into 2 LSTM units which capture the temporal information. The output of the LSTM units is then flattened out and we attached a Dropout layer with a dropout rate of 0.5 before adding a Dense layer with a softmax activation to classify all 27 actions.

This has got my validation accuracy to 0.700 just on the Inertial data which is the first time we beat the CRC model baseline of 0. For all our models, we used the AdamOptimizer (D. P. Kingma et al.m 2014) with a
learning rate of 1e−4 , β1 of 0.9, and β2 of 0.999. We initialize our trainable parameters using the Xavier Glorot initializer (X. Glorot et at. ,2010), and set our batch size to 3 to allow our model our model to generalize better (E. Hoffer et al., 2017).


UNet LSTM model

The UNet (O. Ronneberger et al., 2015) is a Fully Convolutional Neural Network (FCNN) that is almost symmetric in the contraction and expansion path. In the contraction path, the input was is being fed through a series of convolutions and max-pooling, increasing the feature maps and decreasing the resolution of the image. This increases the “what” and decreases the “where”. In the expansion path, the high dimensional features with low resolution is being up-sampled via convolutional kernels. The features maps were reduced during this operation. A novel feature of UNet is that it implements a concatenation of high dimensional features in the contraction path to the low dimensional feature maps of the expansion layers. Similarly, I added the extracted features from the convolutional networks into 2 LSTM units, flattened the output and attached a Dropout layer with a dropout rate of 0.5 finishing off with a Dense layer with a softmax activation to classify all 27 actions. I have attached the Network Diagram in the Appendix below.

The UNet LSTM model achieved a validation accuracy of 0.712 on the Inertial data.

Step 6. Ensemble your models

With both Conv LSTM and UNet LSTM performing pretty well on the validation data, we can combine their softmax outputs by taking the average. This immediately increase the validation accuracy to 0.765!

For most Supervised Learning problems, the ensemble method tends to outperform a single model method. This is currently understood to be because of its ability to transverse the hypothesis space. An ensemble is able to derive a better hypothesis that is not in the hypothesis space of its single models from which it is built.

Empirically, ensembles tend to yield better results when there is diversity among the model (L. Kuncheva et al., 2003). From the Confusion Matrices shown below, we can see that the Conv LSTM is able to pick up actions like swipe to the right and squat better, while the UNet LSTM is able to pick up actions like basketball shoot and draw x better. This indicates that there is model diversity among the two models and true enough, by ensembling them together, we got the validation accuracies from 0.700 and 0.712 to 0.765!

Confusion Matrices of Conv LSTM (left) and UNet LSTM (right) on Inertial data

Below is the equation I used to create the ensemble. For code implementation, please refer to the repo.

Average of the softmax output for the Conv LSTM and UNet LSTM
Softmax output for an action j



Combining with the Skeleton data

To achieve the promised 80% validation accuracy as stated in the title, I added the Skeleton data by also resampling it to a period of 180 units. After fusing this with the 6 channel Inertial data, we have an input shape of (N, 180, 66), where N is the number of samples. A table of all the validation accuracies are compiled below.

Summary of Validation Accuracy of the Different Models

Lo and behold, the confusion matrix of our best performing model with a validation accuracy of 0.821 is shown below.



Congratulations on making it all the way here! If you have followed these steps thoroughly, you would have successfully built your very own ensembled Human Action Classifier!


Model zoo


Some key takeaways

  • Plotting is a quick and easy way to understand your data
  • Data Science is a highly iterative process
  • Automate the things you can
  • Ensemble is a quick way to get the best bang for your buck of our trained models
  • Use Google Colab to increase your training speed
  • Keras is the framework of choice for quick prototyping of deep learning models

If you are up for a challenge and feel that 0.821 is not enough, you may read the following subsection to improve your model.


What more could be done

A. Issue of over-fitting

Throughout our training, over-fitting at early epochs seems to be the main recurring challenge that we faced. We tried adding Dropout layers and ensembling to make our model more generalized but we can still go further. Over-fitting tends to happen when our model tries to learn high frequency features that may not be useful. Adding Gaussian Noise with zero mean and data points in all frequencies might enhance the learning capability of our
model. Similarly, the time sequences of different subjects are quite varied even for the same activities. Performing data augmentation using time scaling and translation would increase the amount of training data, allowing our model to generalize better.

On a side note, our model could also be trimmed further to reduce its complexity, and also its risk of over-fitting. With the recent Neural Architecture Search papers like, NAS (B. Zoph, 2016), NASnet (B.Zpoh, 2017) and Efficient-NAS (H. Pham, 2018), gaining traction, we could also try applying them since this is also a classification task.

B. Data Fusion of RGB and Depth Data

We played with the Inertial, and we added the Skeleton towards the end to get us more information to find our data-hungry models. In order to push our model more, we would have to find ways to fusion it with the Depth and RGB data. This would allow for more input training variables to learn and extract features from, hence improving the validation accuracies.

C. Try other Ensemble Learning Techniques

Instead of doing a simple average, we could try more advanced ensemble learning approaches such as Boosting and Bagging.

Special thanks to 
Raimi and Derek for proof reading and giving me feedback on this article.

For original post, visit me here!

Feel free to connect with me via twitterLinkedIn!

If you are interested in other projects that I have worked on, feel free to visit my Github!


Network Diagram of UNet LSTM model

Thanks to Derek Chia and Raimi Bin Karim.