VLAD- An extension of Bag of Words

Recently, I was a participant at TagMe- an image categorization competition conducted by Microsoft and Indian Institute of Science, Bangalore. The problem statement was to classify a set of given images into five classes: faces, shoes, flowers, buildings and vehicles. As it goes, it is not a trivial problem to solve. So, I decided to attempt my existing bag-of-words algorithm on that. It worked to an extent, I got an accuracy of 86% approximately with SIFT features and an RBF SVM for classification. In order to improve my score though, I decided to look at better methods of feature quantization. I had been looking at VLAD (Vector of Locally Aggregated Descriptors): A first order extension to BoW for my Leaf Recognition project.

So, I decided to attempt to use VLAD using OpenCV and implemented a small function based on the BoW API currently in OpenCV for VLAD. The results showed remarkable improvement with an accuracy of 96.5 % using SURF descriptors on teh validation dataset provided by the organizers.

What is VLAD

Recalling BoW, it involved simply counting the no. of descriptors associated with each cluster in a codebook(vocabulary) and creating a histogram for each set of descriptors from an image, thus representing the information in a an image in a compact vector. VLAD is an extension of this concept. We accumulate the residual of each descriptor with respect to its assigned cluster. In simpler terms, we match a descriptor to its closest cluster, then for each cluster, we store the sum of the differences of the descriptors assigned to the cluster and the centroid of the cluster. Let us have a look at the math behind VLAD..

Mathematical Formulation

As with bag of words, we first train a codebook from the descriptors from our training dataset, as C=\{c_1,c_2,...c_k\} where k is the no. of clusters in K-means. We then associate each d-dimensional local descriptor, x from an image with its nearest neighbour in the codebook.

The idea behind VLAD feature quantization is that, for each cluster centroid, c_i, we accumulate the difference x-c_i where for each x, c_i = NN(x)

Representing the VLAD vector for each image by v, we have,

v_{ij} =\sum_{x|x=NN(c_i)} {(x_j - c_{ij})}

where i=1,2,3...k and j=1,2,3..d

The vector v is subsequently normalized with its L_2 norm as v=\frac{v}{\|v\|_2}

Comparison with BoW

The primary advantage of VLAD over BoW is that we add more discriminative property in our feature vector by adding the difference of each descriptor from the mean in its voronoi cell. This first order statistic adds more information in our feature vector and hence gives us better discrimination for our classifier. This also points us to other improvements we can   adding higher order statistics to our feature encoding as well as looking at soft assignment,i,e. assigning each descriptor multiple centroids weighed by their distance from the descriptor.

Experiments

Here are a few of my results on the TagMe dataset.

results

Improvements to VLAD:

There are several extension possible for VLAD, primarily various normalization options. Arandjelov and Zissermann in their paper, All about VLAD, propose several normalization techniques, including intra normalization and power normalization alonging with a spatial extension – MultiVLAD. Delhumeau et al, propose several different normalization techniques as well as a modification to the VLAD pipeline to show improvements to almost state of the art.

Other references also stress on spatial pooling i.e. dividing your image into regions to get multiple VLAD vectors for each tile to better represent local features and spatial structure. A few also advise soft assignment, which refers to assignment of descriptors to multiple clusters, weighed by their distance from the cluster.

Code:

Here is a link to my code for TagMe. It was a quick has job for testing so it is not very clean though I am going to clean it up soon.

https://github.com/ameya005/VLAD-Implementation

also, a few references for those who want to read the papers I referred:

1.Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(9), 1704-1716.

2. Delhumeau, J., Gosselin, P. H., Jégou, H., & Pérez, P. (2013, October). Revisiting the VLAD image representation. In Proceedings of the 21st ACM international conference on Multimedia (pp. 653-656). ACM.

3. Arandjelovic, R., & Zisserman, A. (2013, June). All about VLAD. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (pp. 1578-1585). IEEE.

Well, that concludes this post. Be on the lookout for more on image retrieval – improvements to VLAD and Fisher Vectors.

 

Advertisements

4 comments

  1. Hey, first of all very cool. Thanks a lot.

    Can you tell me if the vlad response .xml files should be way larger than the normal xml files generated in the leafdex approach?

    My sizes are:
    leafdex: 700kb – 1.5Mb
    Vlad: 96Mb – 228Mb

    This seems a bit crazy, and now my program exits saying it has insufficient memory.

    1. Hey,

      Thanks for the appreciation :).

      As for the file sizes, yes, they will be significantly bigger. I am not sure if they will be as big as 200mB as I do not have any on my PC right now to check. Basically, the size increase is due to the increased dimensionality of each vector. As for insufficient memory, even I tested this out on cluster at my university. Why don’t you try PCA on the SIFT data in order to reduce dimensionality? It should make it possible for you to run it on your PC. It might also boost your accuracy by a significant amount if you get the right parameters.

      1. Thanx a lot, I found that the issue was that I was saving the descriptors into the response file and not the response historgram.

        I have another question though:

        This is my second attempt to classify road traffic signs. In my previous attempt I trained 4 one-vs-all SVM classifiers to distinguish between 4 specific signs, but my accuracy was quite low. Unfortunately my accuracy is not much better with this implementation. (I cant figure out why, I am not an expert on SVMs)

        But I was wondering why dont you use negatives in training the SVM?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s