# Deep Networks: A Rebooot

It’s been ages since I last posted here. But it is time to reboot this blog.

Since, i have always been all about computer vision, and my current job involves using the ‘deep’ and ‘wide’ now to solve medical imaging problems, here I am back to talk about deep learning, recent advances in computer vision and my trysts with it.

So, this being a quick update post, I will keep it short… but do await the new series of posts on me tangling with the latest and greatest of computer vision research.

# Kernels Part 1: What is an RBF Kernel? Really?

An intriguing article. To look at an RBF kernel as a low pass filter is something novel. It also basically shows why RBF kernels work brilliantly on high dimensional images. Given that your image features generally lie in a continuous domain, an RBF kernel generally can fit smooth solutions and thereby create more relevant separating hyperplanes,especially in case of multiple classes.

My first blog on machine learning is to discuss a pet peeve I have about working in the industry, namely why not to apply an RBF kernel to text classification tasks.

I wrote this as a follow up to a Quora Answer on the subject:

http://www.quora.com/Machine-Learning/How-does-one-decide-on-which-kernel-to-choose-for-an-SVM-RBF-vs-linear-vs-poly-kernel

I will eventually re-write this entry once I get better at Latex.  For now, refer to

Smola, Scholkopf, and Muller, The connection between regularization operators and support vector kernels  http://cbio.ensmp.fr/~jvert/svn/bibli/local/Smola1998connection.pdf

I expand on one point–why not to use Radial Basis Function (RBF) Kernels for Text Classification.  I encountered this  while a consultant a few years ago eBay, where not one but 3 of the teams (local, German, and Indian) were all doing this, with no success  They are were treating a multi-class text classification problem using an SVM with an RBF Kernel.  What is worse, they were claiming the RBF calculations…

View original post 686 more words

# An Interesting History of Computer Vision

Dr. Fei Fei Li from Stanford discusses the advent and growth of computer vision in recent years. Particularly intersting is her recent research on multimodal interactions and large scale visual recognition. This has been primarily made possible due to the growth in GPU technology. I hope to try out Theano and Caffe for deep learning in this scenario soon.

Video:

Recent Publications from L. Fei Fei’s group:

http://vision.stanford.edu/publications.html

# Giving eyes to a micro controller: DCMI interface on an STM32F4

It has been another long hiatus between posts. But, I have managed to learn and do quite a bit of stuff in these last few months and it has been rewarding to say the least.

Recently, I have had to work on an embedded platform for image processing. It was quite a big deal as I had never worked with any sort of embedded platform before and the kind of work is quite different from what I have done before. So, my first task was to interface a camera with a microcontroller. After consulting with my friends, Shrenik and Vinod, I decided to use a microcontroller which provides a hardware camera interface instead of writing the complete firmware from scratch for an ATMEGA as I had planned on doing earlier.

After some research, I ended up selecting the well known STM32F4 series of microcontrollers. The STM32F407 is a high powered μC with an ARM Cortex M4 processor running at 168 MHz. The development board available has 1 Mb of onchip flash and 192 kB of SRAM. After playing with GBs of RAM, it sure was tough to be excited about a few kB of SRAM, but it was a different challenge to solve the problem using as few resources as possible. The STM32F407 features a a hardware camera interface known as DCMI ( Digital Camera Interface). It is compatible with a huge range of camera modules on the market.

I had also decided to use the OV2640 camera module as it features an on-board JPEG encoder and is quite well documented. After a few days of familiarizing with the basic concepts and fiddling around with the standard peripherals library from  STM, I came across this amazing implementation, OpenMV. I found it extremely helpful to understand the intricacies of image processing on embedded systems.

The primary issues that I faced while working on this project were:

• Understanding DCMI and DMA interfaces of the STM32 controller.
• Understanding the communication and synchronization between the camera and the controller.
• Clocks and STM’s unique proposition of allowing us to turn off peripheral clocks when required for low power usage.

I am going to be blogging about my experiences regarding my foray into the world of micro-controllers and camera control soon. Until then, here are a few images that I captured with my setup.

# Back to Basics: Sparse Coding?

A good introduction to Sparse Coding. Hope to do some stuff regarding this in the future.

by Gooly (Li Yang Ku)

It’s always good to go back to the reason that lured you into computer vision once in a while. Mine was to understand the brain after I astonishingly realized that computers have no intelligence while I was studying EE in undergrad. In fact if they use the translation “computer” instead of  “electrical brain” in my mother language, I would probably be better off.

Anyway, I am currently revisiting some of the first few computer vision papers I read, and to tell the truth I still learn a lot from reading stuffs I read several times before, which you can also interpret it as I never actually understood a paper.

So back to the papers,

Simoncelli, Eero P., and Bruno A. Olshausen. “Natural image statistics and neural representation.” Annual review of neuroscience 24.1 (2001): 1193-1216.

Olshausen, Bruno A., and David J. Field. “Sparse coding with an…

View original post 237 more words

# VLAD- An extension of Bag of Words

Recently, I was a participant at TagMe- an image categorization competition conducted by Microsoft and Indian Institute of Science, Bangalore. The problem statement was to classify a set of given images into five classes: faces, shoes, flowers, buildings and vehicles. As it goes, it is not a trivial problem to solve. So, I decided to attempt my existing bag-of-words algorithm on that. It worked to an extent, I got an accuracy of 86% approximately with SIFT features and an RBF SVM for classification. In order to improve my score though, I decided to look at better methods of feature quantization. I had been looking at VLAD (Vector of Locally Aggregated Descriptors): A first order extension to BoW for my Leaf Recognition project.

So, I decided to attempt to use VLAD using OpenCV and implemented a small function based on the BoW API currently in OpenCV for VLAD. The results showed remarkable improvement with an accuracy of 96.5 % using SURF descriptors on teh validation dataset provided by the organizers.

Recalling BoW, it involved simply counting the no. of descriptors associated with each cluster in a codebook(vocabulary) and creating a histogram for each set of descriptors from an image, thus representing the information in a an image in a compact vector. VLAD is an extension of this concept. We accumulate the residual of each descriptor with respect to its assigned cluster. In simpler terms, we match a descriptor to its closest cluster, then for each cluster, we store the sum of the differences of the descriptors assigned to the cluster and the centroid of the cluster. Let us have a look at the math behind VLAD..

### Mathematical Formulation

As with bag of words, we first train a codebook from the descriptors from our training dataset, as $C=\{c_1,c_2,...c_k\}$ where $k$ is the no. of clusters in K-means. We then associate each $d$-dimensional local descriptor, $x$ from an image with its nearest neighbour in the codebook.

The idea behind VLAD feature quantization is that, for each cluster centroid, $c_i$, we accumulate the difference $x-c_i$ where for each $x$, $c_i = NN(x)$

Representing the VLAD vector for each image by $v$, we have,

$v_{ij} =\sum_{x|x=NN(c_i)} {(x_j - c_{ij})}$

where $i=1,2,3...k$ and $j=1,2,3..d$

The vector $v$ is subsequently normalized with its $L_2$ norm as $v=\frac{v}{\|v\|_2}$

### Comparison with BoW

The primary advantage of VLAD over BoW is that we add more discriminative property in our feature vector by adding the difference of each descriptor from the mean in its voronoi cell. This first order statistic adds more information in our feature vector and hence gives us better discrimination for our classifier. This also points us to other improvements we can   adding higher order statistics to our feature encoding as well as looking at soft assignment,i,e. assigning each descriptor multiple centroids weighed by their distance from the descriptor.

### Experiments

Here are a few of my results on the TagMe dataset.

There are several extension possible for VLAD, primarily various normalization options. Arandjelov and Zissermann in their paper, All about VLAD, propose several normalization techniques, including intra normalization and power normalization alonging with a spatial extension – MultiVLAD. Delhumeau et al, propose several different normalization techniques as well as a modification to the VLAD pipeline to show improvements to almost state of the art.

Other references also stress on spatial pooling i.e. dividing your image into regions to get multiple VLAD vectors for each tile to better represent local features and spatial structure. A few also advise soft assignment, which refers to assignment of descriptors to multiple clusters, weighed by their distance from the cluster.

### Code:

Here is a link to my code for TagMe. It was a quick has job for testing so it is not very clean though I am going to clean it up soon.