This write up was featured on

VQA can yield more robust visual aids by adding complexity to intelligent systems-based “perception”; this technique allows people to ask open-ended, common sense questions about the visual world, setting the stage for more flexible, personalized engagement.

Automated object detection in image and video content continues to get better and better, thanks to ongoing advances in deep learning. We’re already seeing incredible applications of object detection in our daily lives, from Netflix to medical imaging diagnoses. However, the field is currently constrained by human ability to effectively model the problem space of computer vision - leaving room for even more progress. Much of our ongoing research, especially in the areas of image classification, has been made possible by the publicly-available ImageNet database - which contains over four million images labeled with over a thousand categories. Over the last few years, research groups around the globe have tried innovative approaches to image classification tasks on the ImageNet database. One such interesting application is Visual Question Answering. It is a new and upcoming problem in Computer Vision where the data consists of open-ended questions about images. In order to answer these questions, an effective system would need to have an understanding of “vision, language and common-sense.”

This post will first dig into the basic theory behind the Visual Question Answering task. Then, we’ll discuss and build two approaches to VQA: the “bag-of-words” and the “recurrent” model. Finally, we’ll provide a tutorial workflow for training your own models and setting up a REST API on FloydHub to start detecting objects in your own images. The project code is in Python (Keras + TensorFlow). You can view my experiments directly on FloydHub, as well as the code (along with the weight files and data) on Github.

Let’s say we’ve got an image of a high-speed train.

Like that one.

Now, let’s say that we wanted to develop a model that would be able to answer some questions about this picture. For example, I’d like to be able to ask the model: Which type of vehicle is it?, and I’d expect it to confidently tell me that it’s a picture of a train.

In order to do this, the model would need to understand several things - let’s break them down into sub-tasks:

  • Identifying the various objects in the image (the train, traffic signals, tracks, pavement, person, etc).
  • The text of the question itself, which is processed as a ‘sequence’ of words.

  • After that, mapping the appropriate sections of the image (in this case - the train) to the input text question.
  • Finally, generating natural language text in the form of an answer with an acceptable certainty.

If that sounds like fun to you, then let’s keep going!


The open-source VQA dataset contains multiple open-ended questions about various images. All my experiments were performed with V2 of the dataset (though I’ve processed v1 of the dataset as well- much smaller in size), which contains:

  • 82,783 training images from COCO (common objects in context) dataset.
  • 443,757 question-answer pairs for training images.
  • 40,504 validation images to perform own testing.
  • 214,354 question-answer pairs for validation images.

As you might expect, this dataset is huge (12.4 GB of training images). I’ve provided a helpful script that can be used to process the questions and annotations (src/ As for the images, I decided to use a pre-trained model of VGG16 architecture trained on COCO itself by Andrej Karpathy for his Image Captioning project. I’ve created a public dataset on FloydHub called vgg-coco to store this dependency (VGG-Net) - if you’re following along, you can simply mount this dataset to your jobs (there’s no need to upload it again yourself).

Upon processing the data, we’ll eventually obtain the following preprocessed text files that’ll be used for training-

├── preprocessed                          # Text files used for training.
    ├── questions_train2014.txt            # Training questions.
    ├── questions_lengths_train2014.txt    # Length of each question (later used for padding sequences of words)
    ├── questions_id_train2014.txt         # Map ques-imgs. (Format: imageID-quesNumber)
    ├── images_train2014.txt               # Image IDs (used for mapping)
    └── answers_train2014_modal.txt        # Answers associated with training questions.
├── data                  # Data used and/or generated
   ├──        # Execute first to download all preprocessed data- ready for training.
├── src                   # Source Files
   ├──   # Train the feed-forward model.
   ├──      # Train the recurrent model.
   ├──           # Utility methods reqd for training.
   ├─       # Prepare data for training from the VQA source.
   ├──        # Determines accuracy of the model.     
   ├──            # Test file to run with your own image-question pairs.
├── preprocessed          # Preprocessed Data for reference(described bellow)
├── models                # Stored model files required for execution of tests

Like any supervised learning project, our core task is to frame a set of input features and feed them through some model weights in order to get the output. In its most basic form, VQA is no different.

As a result, this is first approach that we’ll take – simply coalesce the feature vectors of our image and the text question input to feed them into a fully connected network that can predict an answer to the question.

Baseline: Bag of Words!

There are many ways to represent text data with machine learning. One of most simple – and elegant – approaches is the Bag of Words model. The Bag of Words approach is simple to understand, straightforward to implement, and has seen great success in problems such as language modelling and document classification. You can check out this post for a quick intro to BoW model.

It’s important to observe that the feature vector for our question (which is being fed into the network) is the sum of all the feature vectors of the words of that particular question. Therefore, regardless of the word order of the original question, this feature vector remains the same – thus, the Bag of Words. The topology of this network is defined as follows–

num_hidden_units = 1024
num_hidden_layers = 3
batch_size = 128
dropout = 0.5
activation = 'tanh'
img_dim = 4096
word2vec_dim = 300
num_epochs = 100

When training your model, it’s a good idea to add a TensorBoard integration to visualise your training process more effectively. This is quite simple when using FloydHub. All you have to do is export your TensorBoard logs to /output/Graph

model = Sequential()
model.add(Dense(num_hidden_units, input_dim=word2vec_dim+img_dim,
for i in range(num_hidden_layers):
    model.add(Dense(num_hidden_units, kernel_initializer='uniform'))
model.add(Dense(nb_classes, kernel_initializer='uniform'))

tensorboard = TensorBoard(log_dir='/output/Graph',

Another subtle implementation observation is that the questions are of variable length. In order to represent the question vectors as Tensors, the smaller questions should be zero padded.

Note: FloydHub does not contain internal dependencies (Word Vectors) of SpaCy installed, so in order to run the script on FloydHub execute

` – floyd run –data sominw/datasets/vgg-coco/1:input –tensorboard “python -m spacy download en && python src/ –logdir /output/Graph”`

Our baseline performed fairly well with about 48% accuracy. This architecture should took around 400 sec/epoch to complete on FloydHub’s K80 GPU instances. It is also interesting to note that this baseline architecture flattens out after 35-40 epochs.

Here’s what I learned in the process of building this baseline model:

  • Dealing with this much data requires a lot of patience - Starting out on a task like this with and (pre)processing certain elements like textual questions, generation of image features & lastly the hyperparameter tuning part requires significant cognitive effort along with a lot of patience. But you learn a LOT in the process. For me, engaging in things like manipulation of text/image data made me realise how powerful is the vectorized computation with NumPy.
  • Start Simple - Start out with the absolute basic approach. Because vision & NLP are growing exponentially you’ll see a ton of ways to for feature generation & textual modelling and it’s easy to get overwhelmed with all the exotic implementations. But the key is to not get distracted by all the novel state-of-the-art approaches until you have something concrete working.
  • Start out small, then scale - Always start with a subset of data, it is not wise to train a model on the entire data set (spend a good deal of time doing so) only to dismantle everything in the end. For instance, you can develop a baseline approach and test it out on 20% of the data. If you see your accuracy go up (loss decrease), then go ahead and train on the entire data.

The Recurrent Model

As a next step, let’s try to improve the accuracy of our model through a posterior processing of text. This approach is called a Recurrent Model.

In our previous approach, the order and structure of the words in the question were discarded — it was just a “bag” of words or we can say that the words in our question were “independent” of each other. In a Recurrent Model, this sequence (order) of words is preserved. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. This nature of preserving long sequences is what makes RNNs perfect for NLP related tasks.

We choose to go ahead with LSTMs to avoid a fundamental limitations of vanilla RNNs – Vanishing Gradient Problem. If you do however, wish to know more about RNNs (limitations, strengths & types), you can refer to this great article by Andrej Karpathy.

The neural network has 3 stacked 512-unit LSTM layers to process questions which is then merged with the image model. One fully-connected regular layer takes the merged model output and brings it back to the size of the vocabulary (as depicted in the figure above).

The LSTM network in Keras expects the input (Tensor X) to be provided with a specific array structure in the form of: [samples, time steps, features]. Currently our data is of the form [samples, features]. So we define two different models – an image model to process image feature vector (len: 4096) & a language model to process the sequences of the question text (len: 300, timestep:30 – max length of question available with us).

number_of_hidden_units_LSTM = 512
max_length_questions = 30

### Image model
model_image = Sequential()

### Language Model
model_language = Sequential()

### Merging the two models
model = Sequential()
model.add(Merge([model_language, model_image],

This architecture should took around 460-500 sec/epoch to complete on FloydHub’s K80 GPU instances and the performance flattened out after 50 epochs.

-- floyd run --data sominw/datasets/vgg-coco/1:input --tensorboard "python -m spacy download en && python src/ --logdir /output/Graph"

There still remains a lot of scope for hyperparameter tuning in both the architectures i.e number of layers, percentage of dropout, timesteps in case of LSTM etc.

Takeaways & Tests

The evaluation performed on the validation set yielded the following results –

Model Accuracy (%)
Baseline (MLP) 48.33
Recurrent Model (LSTM) 54.88

The results obtained are in line with the original VQA Paper, though we have evaluated it on the validation set.

Our key takeaways from this project were -

  • Working with very large data: Everything involved with VQA - from provided data, pretrained image models to our output models - were substantially large in size. Managing and moving this data in itself is a task. For starters, you’d need a tremendous amount of compute and storage.
  • Starting with a baseline saves time: This may seem counterintuitive at first, but as you progress through the various tasks involved, having a quick reference is always going to help while building bigger complex architectures.
  • Learning is constant: Be it your grasp of the problem/solution you’re working or the training of your model. For instance, even though we knew the working of a RNN, but implementing them for our purpose still required us to go through the theory & different implementations (e.g. FastAI) to make the task-at-hand easier. So it’s never a bad idea to gain a thorough understanding of the concepts before applying them.


After building these two potential solutions to the VQA problem, we followed the Keras REST API tutorial to create a serving endpoint so that we can test out live using new images. A simple API call of the following form will start the server from where you can directly fetch the results on your new images through a simple CURL form request.-

Note: To do the same on FloydHub, all you have to do is create a new server initiation script- which handles the incoming request, executes the code in and returns the output on port 5000. You can checkout their documentation for specific details. Once you have the above directory structure figured out, a simple floyd run on serve mode will take care of the rest.

floyd run --data sominw/datasets/vgg-coco/1:input --mode serve

Once you have the server running on your machine or the cloud, you can send any image-question pair as request to this API and it will return the top 5 predicted answers.

curl -F image=@<image_path> -F ques="<Question>" ''

Here’s an example:

Sincere thanks to J. Raunaq for being a part of this project all along & contributing to it.

If you try to replicate or run some of your own tests and have some feedback, get stuck or pursue a similar project in this domain, do ping @me or @Raunaq on twitter. We’d love to see what you’re up to.


  1. VQA Paper
  2. Recent advances in Visual Question Answering
  3. KDnuggets Blog on VQA - For Images & Evaluation Ideas