Small Object Detection: An Image Tiling Based Approach

Bingi Nagesh
19 min readOct 16, 2021

--

Source: [5]

Contents

Abstract
1. Introduction
2. Related Work
3. Data Collection
4. Exploratory Data Analysis
5. Preprocessing
6. Experimentation and Results
7. Error Analysis
8. Data Pipeline
9. TensorFlow and TensorFlow Lite Conversion
10. Deployment and GUI
11. Conclusion
12. Future Work
Code Reference
Contact Links
References

Abstract

Computer vision is a challenging task as machine sees numbers, unlike we humans. The computer has to detect and classify objects present in an image to perceive as we humans do. Most of the popular object detectors like RCNN-family or YOLO-family detect medium and large objects well but they find difficulty in detecting small objects. Most of these detectors use Convolutional Neural Networks to extract features for Object localization and Object Classification. The features of smaller objects may disappear in deeper layers and it becomes difficult for the detector to detect small objects. One solution is to use high-resolution images for small objects detection. But training models with high-resolution images will be slow and needs huge GPU memory. The inference time is also more for high-resolution images. To overcome this difficulty we have used an image tiling-based approach to detect small objects. Custom YOLOv4 (small) is used for transfer learning and detections are performed on CPU for performance evaluation. The metric used for evaluation is mAP. Finally, the model is deployed in the local cloud and a GUI is developed to do object detections locally.

1. Introduction

Computer Vision is a challenging task because computers see images as numbers. Object detection is Object localization plus Object classification. In real-world applications, object detection is one of the crucial steps, like detecting traffic lights, road signs, pedestrians in a self-driving car. CNNs are at the heart of modern object detection algorithms. The features of smaller objects may disappear in deeper layers and it becomes difficult for the detector to detect small objects. Example: In the hidden layers of the object detection architecture, an image of ~512 x 512 is reduced to ~30 x 30. There will be few features for small objects, so, the features of small objects disappear in a deeper network and never get detected.

Source: SOD-MTGAN: Small Object Detection via
Multi-Task Generative Adversarial Network

From the above table, we can see that mAP of small objects is significantly less when compared to medium and large objects.

One solution is to use high-resolution images for small objects detection. But to train the models with high resolution will be slow and needs huge GPU memory. The time and computational resources needed at inference will also be huge.

If an object has dimensions 30 x 30 in an image of 2048 x 2048, and if we divide the image into blocks (tiles) of 512 x 512, then the object has the same dimensions (30 x 30) in 512 x 512. If we compare the area of the objects in both cases, in 1st case (original image), the object occupies 0.0214% of the original image, whereas in the 2nd case, the object occupies 0.343% of the split image block. So, we have divided the image into blocks (tiles) of 416 x 416 and used custom YOLO v4 (small) for training the images. At inference, when an image is given as input, we divide the image into 416 x 416 blocks, and each block of the image is given is input and object is detected (if any) and finally the image blocks (tiles) are stitched to form an original image with objects detected.

Note: If the image dimensions are not multiples of 416, then the image is resized to the nearest multiple of 416. Example: If the image has dimensions 5616 x 3744, then the image is resized to 5824 x 3744. (5616/416 = 13.5, 3744/416 = 9, 5824/416 = 14).

2. Related Work

In [2], the author collected data for detecting birds around the wind farm. Each bounding box in this dataset is labeled as b (bird), n (non-bird), and u (unknown flying object). The author divided the problem into two sub-tasks. The first one is the detection of b vs n and the second one is bird species classification of b objects into crows and hawks. The author used two algorithms for detection and classification. The author calculated Haar-like and HOG features of the background-subtracted image and used AdaBoost for object detection. Next CNN architecture was used. Haar-like features outperformed in the detection task while CNN outperformed in the classification task. the author used TPR vs FPR curve as a metric.

The architecture in [3] (YOLO-fine) is inspired by YOLOv3 architecture. YOLOv3 is not able to detect objects less than 8px because the detection grids are subsampled with factors of 32, 16, and 8. Instead, the author used subsampling factors of 4 with skip connections as shown in the image. Based on experiments, the author found that the last two convolutional blocks of Darknet-53 have a huge number of parameters and are not useful in small object detection, so, the author removed these two convolutional layers. The author used the same 3-level detection system used in YOLOv3 which helps the model to search and detect objects at different scales. The author used 3 datasets to assess detection performance namely VEDAI, MUNICH, and XVIEW. On all the 3 datasets, YOLO-fine achieved superior mAP when compared to its peers like faster RCNN, YOLOv2, YOLOv3, YOLO-tiny.

YOLO-fine architecture [3]

High-resolution Detection Network (HRDNet) [4] internally has two networks namely Multi-Depth Image Pyramid Network (MD-IPN), Multi-Scale Feature Pyramid Network (MS-FPN). As shown in the architecture below, the input image I₀ dimensions are reduced by a factor-alpha, and features are extracted. MD-IPN generates multi-scale and multi-level features. To deal with this MS-FPN was proposed. Information propagates not only from high-level features to low-level features but also from the deep stream (low-resolution) to shallow stream (high-resolution). The author used two datasets for experiments. The first one is small images from COCO and Pascal VOC dataset and the second one is VisDrone2019. Most of the images in the VisDrone2019 dataset are very small. The author got superior mAP when compared to peers like RCNN family, SSD, etc.

HRDNet architecture [4]

In [5], given an input image, it is scaled down and scaled up and using shared CNN, and response maps are obtained. Object detection is made on these response maps and is merged for final detection. The author used FDDB and WIDER FACE datasets for experiments. The faces in these datasets were in the range of 40–140px in height. The A-type template is for faces with a range of 40–140px in height and the B-type template is for faces less than 20px. The author got a superior Precision-Recall curve and TPR vs FPR plot when compared to peers.

Finding Tiny Faces architecture [5]

SlimNet [6] is a modified version of Faster RCNN architecture. The feature extraction network of Faster RCNN is replaced by SlimNet. The dataset used is the Urban Object Detection dataset. The author achieved superior mAP compared to its peers.

Faster RCNN framework [6]
Network structure of SlimNet [6]

The architecture in [7] is the simplest architecture for small object detection. The feature map from the 3rd, 4th, and 5th layers is concatenated and fed to a fully connected network. Images with object dimensions between 16 x 16 and 42 x 42 from the Pascal VOC dataset are considered for the experiment. The author compared results from this model to faster RCNN and achieved more mAP than faster RCNN.

Small Object Detection with Multiscale Features [7]

[8] is a blog on medium. The author of this blog mentioned some techniques to improve small object detection:

  • Splitting image into tiles
  • Use of temporal nature of images for background subtraction
  • Modifying anchor box sizes in the detector

The author of the blog mentioned specialized models for small object detection, of which, some of which are discussed above.

3. Data Collection

Vehicle Detection in Aerial Imagery (VEDAI) dataset is used for object detection. The model built on VEDAI is tested on other datasets like MUNICH and VisDrone2019.

4. Exploratory Data Analysis

VEDAI dataset has images spread over 5 zip files. Each zip file has images taken in the visual and infrared domains. We considered only images in the visual domain for analysis. Each image has dimensions of 1024 x 1024. The annotations for each image are stored in a separate zip file. Each target in the dataset has been annotated by one human operator in the following way. First, the coordinates of the center of the vehicle in the image are given, as well as its orientation (the angle it makes with the horizontal line, modulo π or 2π), the coordinates (in pixels) of its 4 corners, and its class label. In addition, there are also two binary flags stating if the vehicle is occluded or not and if it is fully contained in the image.

Let us find the distribution of classes and percentage area of objects in the dataset.

# get the percentage area and classes from dataset
area, classes = know_data_("Vehicules1024/", "Annotations1024/")
# plotting the distribution of percentage area
sns.displot(x = area)
plt.xlabel("Percentage area of occupied by objects in an image")
plt.title("Distribution plot of percentage area of objects in an image")
plt.grid()
plt.show()
# quantile information of percentage area
s = pd.Series(area)
s.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9, .99])

Output:

0.10    0.064087 
0.20 0.076103
0.30 0.086117
0.40 0.097713
0.50 0.110149
0.60 0.126457
0.70 0.146866
0.80 0.178375
0.90 0.250206
0.99 0.913994

From above, we can see that 99.9% of the objects occupy an area of less than 1% in the entire image. 90% of the objects occupy an area of less than or equal to 0.25% in the entire image.

# distribution of classes
plot_distribution(classes, "Entire dataset:")

Output:

Entire dataset: Counter({1: 1399, 11: 950, 5: 395, 2: 304, 10: 204, 4: 188, 23: 170, 9: 101, 31: 48, 7: 4, 8: 3})
Class names1 - car
2 - truck
4 - tractor
5 - camping car
7 - motorcycle
8 - bus
9 - van
10 - other
11 - pickup
23 - boat
31 - plane

From the above output, we can see that there are 4 and 3 objects of motorcycle and bus in the entire dataset. So, we ignored these two classes for the object detection task.

sample images from truck, pickup, and van classes

Observation: From the above sample images, we can see that truck, pickup, and van look similar

sample images from other class

Observation: From above images, we can see that other class has different types of construction vehicles.

5. Preprocessing

Let us split the data into train, cross-validation (cv), and test sets.

# test
test_sample = random.sample(range(len(os.listdir("Vehicules1024/"))), int(0.2 * len(os.listdir("Vehicules1024/"))))
make_datasets("Vehicules1024/", "/content/Annotations1024/", "test/", test_sample)
# cv
cv_sample = random.sample(range(len(os.listdir("Vehicules1024/"))), int(0.25 * len(os.listdir("Vehicules1024/"))))
make_datasets("Vehicules1024/", "/content/Annotations1024/", "cv/", cv_sample)
# train
make_datasets("Vehicules1024/", "Annotations1024/", "train/")
print(len(os.listdir("train")), len(os.listdir("cv")), len(os.listdir("test")))

Output:

(1500, 502, 490)

There are 11 classes for object detection in the dataset. We have ignored two classes namely motorcycle and bus. Let us make class labels from 1 to 9 and plot the distribution of classes in the train, cv, and test sets.

# plotting class distribution
train_classes = know_data("train/")
cv_classes = know_data("cv/")
test_classes = know_data("test/")
# plotting distribution
plot_distribution(train_classes, "train set:")
plot_distribution(cv_classes, "cv set:")
plot_distribution(test_classes, "test set:")

Output:

train set: Counter({1: 780, 7: 562, 5: 239, 2: 192, 8: 145, 4: 126, 6: 107, 9: 61, 3: 22})cv set: Counter({1: 313, 7: 195, 5: 81, 2: 62, 4: 30, 8: 30, 6: 27, 9: 19, 3: 11}) test set: Counter({1: 306, 7: 193, 5: 75, 2: 50, 6: 36, 4: 32, 8: 29, 9: 21, 3: 15})

From the above plots, we can see that distribution of classes in all three train, cv, and test datasets is almost similar.

Let us convert annotations to YOLO format.

to_yolo("train/")
to_yolo("cv/")
to_yolo("test/")

Let us sanity check our conversion to YOLO format.

display_images_and_labels("train/00000073.png")

Let us convert the images into blocks of dimensions 416 x 416.

ext = ".png"
size = 416
train_src = "train/"
cv_src = "cv/"
test_src = "test/"
train_imnames = [train_src + name for name in os.listdir(train_src) if name.endswith(ext)]
cv_imnames = [cv_src + name for name in os.listdir(cv_src) if name.endswith(ext)]
test_imnames = [test_src + name for name in os.listdir(test_src) if name.endswith(ext)]
tiler(train_imnames, "train_tiled/", None, size, ext)
tiler(cv_imnames, "cv_tiled/", None, size, ext)
tiler(test_imnames, "test_tiled/", None, size, ext)

6. Experimentation and Results

6.1 Custom YOLOv4 (small) training: Custom YOLOv4 (small) model is trained for about 11200 epochs and achieved an mAP of 49.57%. The AP of different classes and mAP of the cross-validation dataset are shown below.

Machine confidence vs IOU is shown below.

Observations:

  • As conf_thresh increases, IOU increases
  • truck, pickup and van look similar. So, their AP is less.
  • other contains different types of construction vehicles. So, this class AP is also less.

On the test set, an IOU of 54.93% is achieved when the machine confidence threshold is 0.25 and mAP is 57.29%.

6.2 Custom YOLOv4 (small) training: Now, custom anchor boxes for our dataset are calculated and used in the training. The AP of different classes and mAP of the cross-validation dataset are shown below.

Machine confidence vs IOU is shown below.

On the test set, an IOU of 50.85% is achieved when the machine confidence threshold is 0.25 and mAP is 61.14%.

Observations:

  • As conf_thresh increases, IOU increases
  • There is a slight improvement in the mAP of the cv dataset in this case. But the IOU for various confidence thresholds is less when compared to the previous model.

So, our best model will be the YOLO v4 model with the original anchor boxes.

6.3 Results: Our best model is tested on 3 different datasets namely VEDAI, MUNICH, and VisDrone2019 in three different scenarios.

6.3.1 VEDAI dataset: Image dimensions: 1024 x 1024

6.3.1.1 First Scenario: In the first scenario, the width and height in the config file (.cfg) are set to the nearest multiple of 32 of the original image dimensions. Example: The dimensions of images in the MUNICH dataset are 5616 x 3744. So, the width and height in the config file are set to 5632 and 3744 respectively. Object detection output:

The time taken for object detection is 11.8s. RAM usage is 1.57 GB (OS + darknet build occupies 0.92 GB).

6.3.1.2 Second Scenario: In the second scenario, the width and height in the config file are set to 416 x 416 irrespective of the image dimensions, i.e., the image is compressed into 416 x 416 and fed to the model. Object detection output:

The time taken for object detection is 2.24s. RAM Usage is 1.23 GB.

6.3.1.3 Third Scenario: In the third scenario, the images are divided into blocks (tiles) and each image is fed to the model and finally, the blocks are stitched into the original image with the detected objects. Object detection output:

The time taken for object detection is 10.29s. RAM Usage is 1.23 GB.

6.3.2 VisDrone2019 dataset: Image dimensions: 1360 x 725

6.3.2.1 VisDrone2019 dataset: In the first scenario, the width and height in the config file (.cfg) are set to the nearest multiple of 32 of the original image dimensions. Object detection output:

The time taken for object detection is 11.5s. RAM usage is 1.58 GB. Most of the car objects are not detected in the above image by our model. One thing to keep in mind is that we trained our model on the VEDAI dataset where car objects are different.

6.3.2.2 Second Scenario: In the second scenario, the width and height in the config file are set to 416 x 416 irrespective of the image dimensions, i.e., the image is compressed into 416 x 416 and fed to the model. Object detection output:

The time taken for object detection is 1.73s. RAM Usage is 1.23 GB.

6.3.2.3 Third Scenario: In the third scenario, the images are divided into blocks (tiles) and each image is fed to the model and finally, the blocks are stitched into the original image with the detected objects. Object detection output:

The time taken for object detection is 8.87s. RAM Usage is 1.23 GB.

6.3.3 MUNICH dataset: Image dimensions: 5616 x 3744

6.3.3.1 MUNICH dataset: In the first scenario, the width and height in the config file (.cfg) are set to the nearest multiple of 32 of the original image dimensions. Object detection output:

The time taken for object detection is 4min 8s. RAM usage is 11.67 GB.

Observation: From the above, we can see that, when a high-resolution image is given as input in this scenario, the model takes a lot of time and a lot of RAM.

6.3.3.2 Second Scenario: In the second scenario, the width and height in the config file are set to 416 x 416 irrespective of the image dimensions, i.e., the image is compressed into 416 x 416 and fed to the model. Object detection output:

The time taken for object detection is 4.46s. RAM Usage is 1.23 GB.

Observation: From above, we can see that as the image is compressed to 416x416, almost no object is not detected.

6.3.3.3 Third Scenario: In the third scenario, the images are divided into blocks (tiles) and each image is fed to the model and finally, the blocks are stitched into the original image with the detected objects. Object detection output:

The time taken for object detection is 2min 15s. RAM Usage is 1.23 GB.

7. Error Analysis

Object detection is done on random 10 images and observations are documented. The default threshold of 0.25 is used but modified in some cases.

Note: The left side is ground truth and the right side is detections by the model.

Observation: There are only two objects in the image and both of them are correctly detected with a very good machine confidence

Observation: Only one object is present in the image, but our model found two objects. The car class detected has machine confidence of 0.25 and pickup has very good machine confidence.

Observation: No object is detected even if the machine confidence is made to 0.01.

Observation: Two objects are detected with good machine confidence.

Ground truth
The left image is detections with a machine confidence threshold of 0.25 whereas the right image is detections with a machine confidence threshold of 0.15.

Observations:

  • One important thing to note is when images are tiled, a single object may atmost be divided into 4 pieces and each object piece will be present in one sub-image tile.
  • We can see that when the confidence threshold is 0.25, the back portion of the truck is not detected but when we reduce confidence to 0.15, the back portion of the truck is also detected.

Observation: The pickup is detected correctly but the van is detected as the car with a fair amount of confidence.

Observation: we can see that the car object is detected correctly with very good confidence.

Ground truth
The left image is detections with a machine confidence threshold of 0.25 whereas the right image is detections with a machine confidence threshold of 0.10.

Observations:

  • Overlapping objects are detected but one of the detected overlapping objects (car) has 0.3 confidence and the other (pickup) has 0.87 confidence.
  • tractor and truck object is not detected.
  • One of the truck object is detected as a pickup object.
  • There is no ground truth label, but a pickup object is detected with confidence = 0.41.
  • When the confidence threshold is decreased to 0.1, a tractor object is detected and there are other overlapping objects detected.

Observation: There is a single truck object and that is detected with good confidence.

Observation: There is a single truck object and that is detected with good confidence.

Performance Evaluation: Modified notation in (source)

TP: if IoU ≥0.5, classify the object detection as True Positive(TP)

FP: if IoU <0.5, then it is a wrong detection and classify it as False Positive(FP), and when the wrong object is detected.

FN: When ground truth is present in the image and model failed to detect the object, classify it as False Negative(FN).

For the above calculation of TP, FP, FN, we will consider a machine confidence threshold of 0.25

TP = 18

FP = 4

FN = 4

Observation: For a sample of 10 images, we can see that most of the objects are getting detected by our model and there are some False Positives and Negatives.

Overall Observations:

  • When images are tiled, a single object may atmost be divided into 4 pieces and each piece will be present in one sub-image tile.
  • In EDA, it was observed that truck, pickup, and van look similar.
  • The other class has different types of construction vehicles.

8. Data Pipeline

# confidence threshold = 0.25
datapipeline("test/", "00001115.png")

9. TensorFlow and TensorFlow Lite Conversion

Till now we used darknet repo for training the object detection model and inference. This Github repo is used to convert YOLO Darknet v4 to TensorFlow and TensorFlow-lite versions. Our best model, i.e., YOLOv4 (small) is converted to TensorFlow and TensorFlow-Lite. Let us find the inference time in TensorFlow and TensorFlow-Lite versions.

# inference time for TensorFlow version
detect_tiles()
# inference time for TensorFlow-lite version
detect_tiles_lite()

Output:

time taken by tf:  61.815514087677s
time taken by tflite: 1.6593375205993652s

The dimensions of the input image are 1024 x 1024 which is broken into 9 blocks of 416 x 416. Inference time by TensorFlow for a single image is 6.87s (=61.815/9). Inference time by TensorFlow-lite for a single image is 0.184s (=1.659/9).

10. Deployment and GUI

Both TensorFlow and TensorFlow-Lite models are deployed in the local environment. The demo video of the deployment is shown below.

A Graphical User Interface (GUI) is developed for those who don't want to compromise their data. The demo video of the GUI is shown below.

11. Conclusion

Computer Vision is a challenging task because computers see images as numbers. CNNs are at the heart of modern object detection algorithms. The features of smaller objects may disappear in deeper layers and it becomes difficult for the detector to detect small objects. Experiments are done with YOLO v4 architecture. The best YOLO v4 model is converted to TensorFlow but it takes significantly more time when compared to Darknet. This is expected as Darknet is run directly in C whereas TensorFlow has Python wrappers around C/C++ code. TensorFlow model is converted to TensorFlow-Lite version to run on low compute Edge devices. The model is deployed in the local environment and a demo video is shown. For people who don't want to compromise their data a GUI is developed, which can be run on the local instance like a computer, low compute edge devices, etc.

12. Future Work

  • When images are divided into blocks, a single object may also fall into 4 different blocks of sub-images. In this way, a single object may be detected twice. An algorithm has to be developed to identify and club sub-objects into a single object while stitching the image. Similar to the video below.
  • Instead of training a small model, we can train a large model and knowledge distillation can be used to train the small model to improve mAP. We need to have good compute resources for this.

Code Reference

https://github.com/nagi1995/Small-Object-Detection-An-Image-Tiling-Based-Approach
https://github.com/nagi1995/yolo-tiling
https://github.com/nagi1995/darknet
https://github.com/nagi1995/tensorflow-yolov4-tflite

Contact Links

Email: binginagesh@gmail.com
LinkedIn: https://www.linkedin.com/in/bingi-nagesh-5b0412b7/

References

[1] appliedaicourse.com

[2] Evaluation of Bird Detection using Time-lapse Images around a Wind Farm

[3] YOLO-Fine: One-Stage Detector of Small Objects Under Various Backgrounds in Remote Sensing Images

[4] HRDNet: High-resolution Detection Network for Small Objects

[5] Finding Tiny Faces

[6] Detecting Small Objects in Urban Settings Using SlimNet Model

[7] Small Object Detection with Multiscale Features

[8] Small objects detection problem

--

--