Colourizing the World with AI
I believe there is no deep difference between what can be achieved by a biological brain and what can be achieved by a computer. It, therefore, follows that computers can, in theory, emulate human intelligence — and exceed it — Professor Stephen Hawking
Indeed! The above quote by Stephen Hawking is very much accurate. Training a Machine Learning model to do a task is similar to teaching a child to do the same job. At first, the model performs very poorly, but gradually, with time and more practice (seeing more data), it improves. In this article, I will be explaining how we can leverage the concept of Auto-Encoders to train a model on Colouring any Grayscale Image.
The applications of the model could be endless. The model could be used for restoring old grayscale photographs into colored RGB images. The model could also be tweaked for converting old monochrome films into high-resolution colored films. I have shown below an example where the model was used to convert a grayscale image (left image) into high-resolution coloured image (right image)
Before I start with the explanation of the model, we need to understand the concept of Autoencoders.
AUTOENCODERS
Autoencoders are a type of artificial neural network. An autoencoder consists mainly of 3 parts, “Encoder,” Bottleneck,” and “Decoder.”
The encoder block consists of an input layer and many hidden layers, primarily a combination of Convolution Layers, Pooling Layers, and Dropout Layers. In the Encoder block, the autoencoder tries to learn the representation for the input images. It stores this representation in the latent space with very few dimensions compared to the input image.
The Decoder block consists of an output layer that gives the reconstructed image and some hidden layers, generally a combination of Convolutional Transpose Layers and Upsampling layers. In the Decoder block, the autoencoder learns the reconstruction of the original image from the previously learned latent space. A simple autoencoder network is shown in the figure below, consisting of all three parts.
INTUITION
Now, with the understanding of Autoencoders, we can tweak the Neural Network in such a manner that while training, we input Grayscale Images and set the output as their corresponding color images. In this way, the autoencoder layers would learn how to convert the latent space representation of grayscale images into colored images. The below neural network can visualize the above intuition :
TRANSFER LEARNING
Transfer Learning will be the next driver of ML success — Andrew Ng (NIPS 2016 Tutorial)
Transfer Learning is something that comes naturally to humans. Humans have a built-in capability to transfer knowledge across various tasks. When learning one task, we can seamlessly utilize what we acquire as knowledge to solve related tasks. The more similar the tasks are, the easier it is for us to transfer our existing knowledge to the new task.
For instance,
- If we learn how to write code in one programming language, like C++, it would be much easier to learn a new programming language, say, Python.
- If we learn how to ride a bike, it would be easier for us to learn how to ride a car.
Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.
It is a popular approach in deep learning where pre-trained models are used as the starting point on computer vision, and natural language processing tasks given the vast compute and time resources required to develop neural network models on these problems and from the huge jumps in skill that they provide on related problems.
CREATING THE MODEL
ENCODER: For the Embedding part of our Autoencoder, we can use a pre-trained Convolutional Neural Network for embedding the information in images into latent space. So, for the Encoder, I chose a pre-trained VGG16 model trained on ImageNet Dataset to classify 1000 classes. Shown below is the architecture of the VGG16 model used.
Since we only need to obtain the embedded information from the hidden Convolutional Layers, we can remove the final output classification layer from VGG16 architecture. The code for using the pre-trained VGG16 model is given below :
DECODER: For the decoder part of the autoencoder, we need to upsample the latent representation up to the size of the original image. For this process, we can use UpSampling Layers. Using TensorFlow framework having Keras backend, the code for the decoder can be easily written as :
TRAINING THE MODEL :
For Training the model, only the Decoder network was trained; the encoder network was not trained since a pre-trained VGG16 model was already used to extract features from the input images. The dataset we use for training the model is extremely important. Suppose if our main aim is restoring old photographs of people, then it would be wise to use a training dataset that has people’s images. Similarly, if we are interested in coloring flowers, then we must choose a dataset that has a lot of images with flowers.
The model was finally trained for 100 epochs on NVIDIA Tesla K80 GPU and gave excellent results shown in the graph below :
RESULTS :
Shown below are some of the results from the above-trained model.
Left Side Image: Model Output (Coloured Image)
Right Side Image: Model Input (Grayscale Image)