16-824: Learning-based Methods in Vision (S'15): Reading for Monday 2/23

Friday, February 20, 2015

Reading for Monday 2/23

A. Krizhevsky, I. Sutskever and G. Hinton. Imagenet classification with deep convolutional neural networks, NIPS, 2012.

And additionally:

K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv preprint arXiv:1409.1556 (2014).

M. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks, ECCV, 2014.

8 comments:

UnknownFebruary 22, 2015 at 4:27 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownFebruary 22, 2015 at 4:32 PM
I read the paper "A. Krizhevsky, I. Sutskever and G. Hinton. Imagenet classification with deep convolutional neural networks, NIPS, 2012." and some of its references. I am sharing my thoughts here.

The authors summarize the architecture of a convolutional neural network which when trained with 1.2 million images from ImageNet was able to classify the test dataset from the ILSVRC-2010 competition with a top-1 error rate of just 37.5% which is 8.2% lesser than the winning algorithm.

What does it explain well?
----------------------------------
i) The authors give a very good in-depth description of the final architecture of the CNN along with the many optimizations they did to reduce its training time. Right from giving the specification of (and the number of) the GPUs they used while training, to the kind of optimizations they used and the number of days they trained the network, they have comprehensively covered every aspect required to reproduce this implementation.
ii) Except for resizing and cropping the training image to a smaller 256 x 256 image and using tricks to expand the training set, there doesn't seem to be any other hand-made artifact in the entire pipeline. No assumptions are built into the architecture of the CNN and any bias the final model learns is from the training data alone (and not from any other man-made artifact).
iii)The authors give a fairly good explanation as to how optimizations can be made using ReLU and dropouts to improve the training time of the CNN. It seems that these generic optimizations can be used in CNNs for other applications as well.

What does it not explain so well (in my opinion)?
----------------------------------------------------------------
i) While the authors have taken a lot of effort to precisely document every detail in the architecture of the CNN and how to train them, a reader tends to have a lot of "why" questions that remain unanswered in the paper. After reading the paper, one gets the feeling that the final architecture of the CNN was a result of trial-and-error. While the work has a strong experimental backing, it fails to talk about "why" it works. For example, they say, "we found that removing any convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in inferior performance". However, they don't talk about why they think removing a single layer could affect the performance to such an extent. Therefore, while this paper acts a very good performance benchmark for future works, it fails to provide any room/direction for others to improve the proposed architecture owing to its lack of a theoretical backing.
ii) They say, "All of our experiments suggest that our results
can be improved simply by waiting for faster GPUs and bigger datasets to become available." It would have been nice if they had attempted to quantify by how much they expect the performance to improve as a function of dataset sizes and GPU speed. This is, again, a result of a lack of a theoretical backing (i. e) a lack of a theoretical explanation as to "why" the proposed architecture results in good performance.

I have another, probably unrelated, comment. The authors talk about methods they employed to avoid overfitting to the training dataset. I am surprised that with a training dataset of about 1.2 million images (that were collected almost independently and from different sources), overfitting is still a problem. Is it possible that we would have another dataset in the future similar to ImageNet where this algorithm and other start-of-the-art would fail to perform as well as they did in ImageNet or LabelMe?
ReplyDelete
Replies
Karanhaar SinghFebruary 22, 2015 at 6:09 PM
This comment has been removed by the author.
ReplyDelete
Replies
Karanhaar SinghFebruary 22, 2015 at 6:10 PM
Taking a look at this paper from an application perspective, the results look nothing short of excellent. More specifically, the authors trained a deep CNN that achieves a top-1 and top-5 error rate that is significantly better than prior art. However, one of my concerns with the final CNN (and CNNs in general) is that this NN requires a lot of convolutions, which potentially inhibits real-time performance on many platforms. To mitigate this issue, we can go in the author's direction and optimize towards fast platforms, such as many-core processors like the GPU, however, what if we approach this from a different angle? Rather than performing convolutions with kernels, we can utilize a variety of approximation techniques to more efficiently obtain the approximate convolution result. For example, the following paper: http://cvlabwww.epfl.ch/~lepetit/papers/rigamonti_cvpr13.pdf discusses a method of approximating filters with a linear combination of separable filters. Another approach is to approximate filters with a linear combination of box filters as discussed in this paper: http://www.contrib.andrew.cmu.edu/~bpires/pdfs/box_filters.pdf. This paper directly addresses the issue by exploiting redundancies between different filters: http://arxiv.org/pdf/1405.3866.pdf. What are your thoughts?
ReplyDelete
Replies
UnknownFebruary 22, 2015 at 11:05 PM
The paper presents a useful summary of how the authors solve the problem of visual recognition on a large dataset using a customized CNN.

--> There where a few technical points unclear to me:
1) In section 3.2, its unclear why the trick "the GPUs communicate only in certain layers" reduces the top-1 and top-5 error rates. The authors intended to use this trick to minimize communication across the 2 GPUs so that they could train a large network. While training a large network would allow one to learn more parameters, its unclear how skipping outputs from certain kernel maps works well, "... kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU".

2) In section 3.3, the authors don't specify how they chose a validation set for learning the hyper-parameters k, n, alpha, and beta in the response normalization formula.

3) In section 4, the authors mention that is difficult to learn a large number of parameters
without overfitting. Why didn't the authors use unsupervised learning from different datasets (other than the ImageNet) so that they could combat overfitting, along with the other approaches described in section 4?

--> I agree with Venkatesh M, that the authors don't really discuss "why the proposed architecture gives a good performance". An explanation of this sort can help one understand and evaluate when a certain tweak to the neural network is useful.
ReplyDelete
Replies
UnknownFebruary 22, 2015 at 11:21 PM
Good paper. Questions that I had while reading them were:

1. The authors failed to explain the intuition behind letting layer 3 obtain inputs from all kernels from the previous layer, while layers 2, 4, and 5 would get them from the same GPU. What was the reasoning behind such an approach? Seems to me that their publication was based on a purely emperical approach, which seemed to work really well. Why not try the same "trick" on all the layers? Error rates could possibly be reduced.

2. Why not allow max-pooling for all layers as well? Max pooling summarizes results for the layer, so it should be a useful input for the layer following the current one.

3. Following Karan's comment, why not utilize a shared memory pattern for the convolutional kernels for a given image? Certainly, this would only involve a few changes in the algorithm implementation; all one needs to do is load an apron (tile) of data into shared memory, and adjust the indicies of kernels such that they overlap with faster RAW (read-access-write) from the L1 cache memory.
ReplyDelete
Replies
UnknownFebruary 23, 2015 at 8:53 AM
This is a well-written, easy-to-read paper that elaborates the architecture of the handcrafted CNN and the resulting performance under each setting. It's definitely a convincing paper and I'm really impressed by Alex! However, similar to the comments above, even though Alex always mentioned the benefits of doing certain things/using certain tricks (e.g. the performance will increased by x% by doing xxx), most of the time he didn't explain the intuition behind. For example, he said that using overlapping pooling (which is different from the past) would reduce the probability of overfitting and leads to a better performance (reduce the error rate by 0.4%). I've been keep thinking about this but I just can't get it. Also, I'm curious about the reason why Alex chose to put the max pooling layer behind layer 1, 2, and 5, but not others.

I know Professor Gupta have mentioned that doing deep learning is like conducting experiments (does this means we cannot explain everything theoretically?), but I'd really love to know the intuition behind designing those deep architectures, since this would definitely help me get onto the right track faster.
ReplyDelete
Replies
UnknownFebruary 23, 2015 at 9:14 AM
This is a very good paper which sets up the milestone the current deep learning trends in computer vision.

I'm not clear about the multi-GPU implementation. I saw most of the current multi-GPU implementation use a exact same network but distribute different batch of the images to different GPUs. But alexnet split the network into two parts. I don't quite understand the intuition behind this "split" idea, since most memory burden goes to the intermediate weights in each batch of the image input and the network model is relative small.
ReplyDelete
Replies

Add comment