Identify and Track a Target

Project Info

  • DATE: 2017

In this project, we will train a deep neural network to identify and track a target in simulation. So-called “follow me” applications like this are key to many fields of robotics and the very same techniques you apply here could be extended to scenarios like advanced cruise control in autonomous vehicles or human-robot collaboration in industry. Here’s the video result.


Network Architecture

I initiate with the 2 encoder layers, 1×1 convolution and then 2 decoder layers and it fails to detect the target when it’s far away from the quadrotor. Then, I increase each encoder and decoder layers, and finalize the model architecture as 3 encoder layers, the 1×1 convolution with 256 filters, and then 3 decoder layers as following…

def fcn_model(inputs, num_classes):
    # Add Encoder Blocks. 
    layer_1 = encoder_block(inputs, 32, 2)
    layer_2 = encoder_block(layer_1, 64, 2)
    layer_3 = encoder_block(layer_2, 128, 2)
    # Add 1x1 Convolution layer using conv2d_batchnorm().
    layer_4 = conv2d_batchnorm(layer_3, 256, kernel_size=1, strides=1)
    # Add the same number of Decoder Blocks as the number of Encoder Blocks
    layer_5 = decoder_block(layer_4, layer_2, 128)
    layer_6 = decoder_block(layer_5, layer_1, 64)
    layer_7 = decoder_block(layer_6, inputs, 32)
    # The function returns the output layer of your model. "layer_5" is the final layer obtained from the last decoder_block()
    return layers.Conv2D(num_classes, 1, activation='softmax', padding='same')(layer_7)

Fully Connected Layer

In principle, fully connected layers are same as the traditional multi-layer perceptron neural network (MLP). The fully connected layers are generally consists at least three or more layers which is an input, an output and at least one hidden layers. It connected every single neuron from the previous layer to the next layer and this is the reason people name it as “fully connected layer”. In fully connected layers, it doesn’t preserve spatial information. In fact, we can integrate convolutional directly into the layer to create fully convolutional network(FCN) with the change from fully connected layer to convolutional layer because convolution layer preserve spatial information throughout the entire network.

1-By-1 Convolutions

In our fully convolutional networks, we noticed that fully connected layer doesn’t preserve any spatial informations. So, we implement 1×1 convolutional layer to preserve the spatial informations of the input. With implementation this technique, it also can be use to change the filter space dimensionality of the layer either increase or reducing the dimensionality of the layer. So that, A fully connected layer of the same size would cause the same number of features. However, replacement of fully connected layers with convolutional layers presents an added advantage that during inference (testing you mode), you can feed images of any size into your trained network.

Encoder and Decoder

The encoder is usually a pre-trained classification network like VGG/ResNet followed by a decoder network. The decoder network/mechanism is mostly where these architectures differs. The task of the decoder is to semantically project the discriminative features (lower resolution) learned by the encoder on the pixel space
(higher resolution) to get a dense classification. Sources

The goal of encoder and decode

Encoder : To extract features from the image.

Decoder : To up-scales the output of the encoder to the same size as the original image the features

Skip Connections

Skip connections allow the neural networks to use information from multiple resolutions scales or resolutions. In the above we encoder the image to extract the features from the images and decoder back the output of the encoder back to the original size, but some of the informations might be lost. We can implementation of skip connections to retain the information easily. The bellow shows the difference between with and without implementation of skip connections.


I’ve tuned a few scenarios of hyper-parameters, and this is my final decision because it can detect an object that is far away from the vision.

Learning rate : 0.002 (From my observation when tuning the learning rate, It resulted the neural networks learned faster when higher learning rate is implemented.

Because I’m going to use high numbers of learning rate, I a minor change increased from 0.001 to 0.002. )

Batch size: 32 (Stay Default)

Numbers of Epoch : 100 (This numbers of approached is a bit high, it might cause over-fitting. However, I still decided to select this because I’m going to modify this project when I have time)

Steps per Epoch : 200 (Stay Default)

Validation Steps : 50 (Stay Default)

Workers : 5

Training Machines

Deep learning is a high computation neural network which doesn’t run efficiently in CPU. So, I decided to use my GPU (Nvidia GTX 1070) which faster than my CPU over 10 times to train this fully convolution neural-networks. Below shows my PC specs that I used for training.

CPU : Intel I7 6700K
GPU : Nvidia GTX 1070
RAM : 32GB
Average times per epoch : 55 sec

Final Score : 0.41

Images while following the target

This works pretty good while there’s a tiny errors exists on the third image.

Images while at Patrol without target

As you can see from the image below, this works pretty well and detects nothing when the patrol is without target!

Images while at patrol with target

This is challenging when detecting the target at very far away from the quadrotor vision. I feel surprise it works to detect the target.

Quad Result3

Future Enhancement

Nothing is perfect, there are always lots of fun works to improve.

  1. Model Architecture and the Hyper-parameters can be improved because the result is just slightly past requirements
  2. Implementation Data augmentations technique to increase the numbers of data
  3. Use more Visualize tools for further analysis
  4. Collecting other target data like dogs, cats, or other animals to train difference segmentation. Our current model might not work well to detect other objects like dogs, cars, trees, buildings. However, if we train our model from scratch with new data, it might work when the object is close enough.
  5. The training curve appears that the model is over-fitting with a final epoch of training loss is 0.0098 and validation loss is 0.0274. Increasing the amount of data might improve it.