For the first phase of the AI Apprenticeship Programme (AIAP Batch 6), all apprentices were to go through a 2 months training phase. Subsequently, we would be assigned a project from our preferred AI Project Tracks for the next 7 months. By the twist of fate, the 3 of us were selected for one of the 100E projects, where we help an external client company solve real-world business problems using AI.
Our project description as follows:
While the project description is simple, doing the actual project was not. Coming from different backgrounds without any prior ML/AI experience, none of us was really equipped with what was needed to deliver the Minimum Viable Product (MVP) on our first day. Nevertheless, we are all excited to be able to work on this project.
To get started, we broke down our project into several key questions:
- What annotation formats should we use for object detection?
- What to do for sequential data?
- Which model algorithm to choose?
- How to choose the set of augmentations to use
- What evaluation metrics to use?
By addressing these questions, we believe that we would ultimately be able to tackle this project sufficiently. For the rest of the article, we will go through them in more detail.
To begin, we will need labelled data from our client. Hence, our first task is to propose the data format to collect from our client. For object detection, the data generally comes with 2 files:
- Image File
- Annotation File
A specific format should be used and eventually, we agreed upon
- Image File in JPEG format
- Annotation file in Pascal VOC format stored in XML
For every image, there will be a respective XML file with the same naming convention. And for the XML file, information of every labelled object in that particular image will be stored. The key information to take note would be the class label and the bounding box coordinates. A screenshot example of the image and annotation file that we use can be found below:
To find out more about other common annotation formats for object detection, visit this link:
Next, as with any model training, preparation of quality data is of utmost importance if we want to get the most out of our model performance. To ensure the quality of our data, a key problem to resolve was the issue of sequential data in our project.
Due to our data inputs consisting of images taken from video streams, a number of images are similar to one another as they are likely taken from the same video sequence of slightly varied frame numbers. An illustration of sequential data can be seen below:
As we would have to split the data into train/validation/test sets eventually, doing so with sequential data would cause data leakage. Data leakage means information from training data is leaked into validation or test sets, causing model test results to be overly optimistic compared to production results.
To mitigate data leakage, we would have to group the images of the same sequence before splitting them into the various sets. A function is written specifically in our data pipeline to address this.
After setting up the data, the follow-up question to ask would be the model selection. And for an object detection problem, you cannot really go wrong with either Faster R-CNN or the YOLO (You Only Look Once) variants. In our case, we have looked at both Faster R-CNN and YOLOv4.
More often than not, which model to select is directly linked to which evaluation metric is of greater importance. In general, there are two ways to evaluate an object detection model: accuracy vs inference speed.
For accuracy, Faster R-CNN has the advantage most of the time. Being one of the earliest algorithms, they are your so-called two-shot object detection model where it involves two main stages: Region Proposal + Classification. Being a region-based detector, it tends to be more accurate than other models.
However, the trend is shifting towards using single-shot detectors e.g. YOLO variants, for object detection instead. Known for its near real-time inference speed, YOLO is at their 4th recognised version as of early 2021. It is worth noting that since the first version released in 2015, many have worked on improving the YOLO algorithm for better accuracy and inference speed.
While the constant improvements made to YOLO is certainly for the better in the long run, it indirectly makes applying the algorithm in the short run comparatively more complicated. This is due to the fact that we would expect less reference materials or applications for any particular version of YOLO, in comparison to other more stable algorithms such as Faster R-CNN. This is something to remember when using algorithms that are constantly being tweaked.
● Faster R-CNN paper released in 2015
● Two-shot architecture (Region Proposal + Classification)
● Generally more accurate
● YOLOv4 paper released in 2020
● YOLOv1 released in 2015
● Single-shot architecture
● Higher inference speed
*Note that there is an unofficial YOLOv5 version out there since June 2020 but it has not been recognised by many members of the community yet. To read more about the controversy, visit the link:
To further improve our model performance, we also used augmentations in our training data.
Image augmentation is a process of creating more data with the existing data, which further boosted the diversity of our data. An illustration of image augmentation is shown below:
The augmentation library we used was albumentations. The illustration example above was taken from their github. More information about the library can be found here: https://github.com/albumentations-team/albumentations
Understanding Evaluation Metrics
As mentioned in the earlier section, there are two types of evaluation metrics:
- Inference Speed
For measuring inference speed, it is straightforward as Frames Per Second (FPS) is really the only accepted metric to measure this. Here, we measure how frames/images that the object detection model can infer in one second.
In comparison, there are a lot more ways to measure accuracy. But before we talk about these accuracy metrics, we will have to understand a few basic concepts first.
1) Two sets of information should be present before we can evaluate the model
- Ground Truth Information
- Prediction Output
2) Next, define what is a good prediction. Commonly, we refer a prediction matching when:
- IOU > IOU Threshold, against a Ground Truth
- Matching Class Label with Ground Truth
- Prediction Confidence > Confidence Threshold
- Note that a Ground Truth cannot be matched twice. In reality, Non-Maximum Suppression (NMS) is commonly used to reduce the number of poor predictions which include the duplicate cases. A NMS Threshold will have to be set as well.
Evaluation Metrics (mAP)
Deciding specific values of these thresholds to best evaluate the model is not an easy task. Hence, people have come up with mean average precision (mAP) to evaluate the model instead. Averaged by each class, mAP calculates the average of the precision at all possible confidence thresholds at a given IOU Threshold.
Following the COCO standard, which is the widely accepted convention to measure and present the accuracy for object detection models, we use 6 methods to calculate mAP (at various IOU Thresholds and object sizes) altogether. These 6 metrics are often generated together and among them, AP@IOU = 0.5:0.05:0.95 is the common and also our choice for the decision metric when selecting between models.
Being by far the most complete mAP calculation out of the 6, the algorithm is as follows:
- Starting from IoU = 0.5,
- with steps of 0.05, increase to an IoU = 0.95,
- calculate the average of all these mAP (AP=0.5,AP=0.55,…,etc)
In summary, these are the 6 standard COCO metrics:
|1||AP@IOU = 0.5:0.05:0.95||Primary Challenge Metric|
|2||AP@IOU=0.5||Pascal VOC Metric|
|4||AP@Small||AP for small objects: area|
|5||AP@Medium||AP for medium objects: area 32×32|
|6||AP@Large||AP for large objects: area > 96 x 96|
* Note that in COCO, AP = mAP.
To read up more on these evaluation metrics, you can visit this link:
Using mAP, we can determine our most accurately trained object detection model. But can we really use this model to infer new images right away? No. If we don’t use any filtering thresholds, the raw inference output would not be as useful as we would expect. There would be many predictions that are either “empty” or duplicated. To have a meaningful inference instead, refining the inference output is essential and we can do so by setting the NMS and confidence thresholds.
An illustration of the inference output with and without filtering is shown below:
To determine the optimal threshold values to use, we can make use of two universal machine learning concepts:
|Precision: TP/(TP+FP)||Precision is the ability of a classifier to identify only relevant objects.|
|Recall: TP/(TP+FN)||Recall is a metric which measures the ability of a classifier to find all the ground truths.|
To do so, we first filter the predictions with various threshold values of confidence and NMS. Then, for each set of threshold values, we calculate their given TP/FP/FN using these definitions:
True Positive (TP)
● IOU > IOU Threshold
● Correct Class Label
Count number of matches
False Positive (FP)
● IOU ≤ IOU Threshold or
● IOU > IOU Threshold but Wrong Class Label
Count number of predictions that is not a match
False Negative (FN)
Ground Truth not detected by the model
Count number of ground truths not detected
True Negative (TN)
Every part of image where we did not predict an object
Not possible to calculate for object detection.
Hence, we ignore True Negative (TN).
Finally, we measure the precision and recall of the trained model at various level of
- IOU Threshold*
- Confidence Threshold
- NMS Threshold
Depending on whether precision or recall is more important, we can then decide the threshold values of confidence and NMS to use for our final inference output.
*Note that we do not have to tune the IOU Threshold. This is because in production, there would not be any labelled ground truth to compare against the predictions for IOU. The introduction of the IOU Threshold is to allow the calculation of precision and recall.
Indeed, there are many calculations to be made when trying to find for the optimal threshold values for both confidence and NMS. Visualizing the filtered inference output onto the test images would also provide valuable insights into our model’s inference capabilities. In our project, we were able to do so using a visualization dashboard created by Python Streamlit. Various insights could also be drawn from the dashboard.
Below are some features of the dashboard*:
* Note that the figures in the dashboard are masked due to NDA.
- Visualization of the Filtered Inference Output
On the left, we are able to visualize the filtered inference output onto a test image.
On the right, we have a detailed breakdown of every single ground truth and prediction.
- Analysis of Mislabel Class
From this chart, we were able to find out what objects’ class labels the model was often confused with. For example, the 124 counts for [6,3] means that the prediction did not classify the matched ground truth with the correct class label (predicted the class label as 3 when it is supposed to be 6).
In conclusion, object detection always has been an interesting AI problem to tackle. The problem is often simple to explain, yet difficult to tackle sufficiently. There were many issues to address along the way as mentioned in the various chapters above. We have tried our best for the project and the journey has been a fulfilling one.
We hope that through reading about our project, you will be able to have an easier path when trying to do a problem involving object detection. Cheers!
This article was contributed by the team of graduates from batch 6 of the AI Apprenticeship Programme (AIAP)® consisting of Andy Au, Alvin Kwee and Calvin Neo.