16-824: Learning-based Methods in Vision (S'15): Reading for Monday 4/6

Saturday, April 4, 2015

Reading for Monday 4/6

Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean and A. Ng. Building high-level features using large scale unsupervised learning, ICML, 2012.

And additionally:

B. Russell, A. Efros, J. Sivic, B. Freeman, A. Zisserman. Using Multiple Segmentations to Discover Objects and their Extent in Image Collections, CVPR, 2006.

Y. Lee and K. Grauman. Object-Graphs for Context-Aware Visual Category Discovery, CVPR, 2010.

21 comments:

UnknownApril 5, 2015 at 6:20 PM
Regarding "the cat paper". (Q. Le). As far as I understand, the main contribution of this paper was to demonstrate a deep auto-encoder on a much greater scale than previous attempts (parameters, dataset, cluster). This was done through Le's distributed asynchronous SGD on the Google network.

They managed to learn pretty convincing deep features and show improvement over random classifiers and shallower architectures.

Some points to discuss:
-This is one of the main unsupervised deep learning papers, but it discovers only 3 classes (faces, cats, bodies)
-There is only slight improvement on image-net when fine-tuning on top of their network? I guess it is enough to show that it works..
-It is not immediately clear to me why they did not use convolutional layers?
-Is there a motivation for only looking at single neurons in isolation? Why is there not chosen the best combination of neurons to detect a face?
-Did they not look at fixing the lower layers and only fine-tuning the logistic classifiers? It looks like their fine-tuning methods considers the network and logistic classifiers separately.
ReplyDelete
Replies
UnknownApril 5, 2015 at 6:27 PM
Also, a link to Russell's paper if anyone is interested (other was broken for me at least): http://www.robots.ox.ac.uk/~vgg/publications/2006/Russell06/russell06.pdf
ReplyDelete
Replies
UnknownApril 5, 2015 at 7:22 PM
I also read the “cat paper” (Building high-level features using large scale unsupervised learning). This paper presents a model for learning class specific high-level features in an unsupervised fashion. As pointed out already by Gunnar, the main contribution of the paper seems tone to show a sparse deep encoder using local receptive fields on a truly large-scale unlabeled data. They have shown convincing results on 3 categories and considerable improvements over state-of-the-art in unsupervised methods.

Regarding Gunnar’s point about not using Conv layers: I think the author’s justification to do that to allow greater invariances by not sharing parameters across different location of image (as in conv layers) does seem plausible to me. However, I too have my concerns about the scalability of the method to other object categories.
ReplyDelete
Replies
LerrelApril 5, 2015 at 8:36 PM
The paper on unsupervised learning of faces is an interesting paper that brings out a lot of questions.
Firstly how good is this method in learning stuff that isn't too frequently occurring. For e.g. cars is a frequent thing in videos,... although not as much as cats. So how well would this perform on cars.
Would this have performed better if conv layers were used along with a deeper architecture (such that the number of parameters remain the same).
ReplyDelete
Replies
Karanhaar SinghApril 5, 2015 at 8:45 PM
The paper “Building High-level Features Using Large Scale Unsupervised Learning” discusses a very interesting approach on developing object detectors using only unlabeled data with a deep-learning approach. The authors acknowledge that training deep learning algorithms can be very time intensive. To handle this issue, the authors utilize asynchronous gradient descent to exploit data-parallelism on a cluster consisting of 1000 nodes with 16000 cores. Experimental results show that the authors were able to train a feature that is selective of faces in a translation and out-of-plane rotation-invariant manner. It would be interesting to see how fast the network would train if we utilized fewer nodes (say on the order of tens) but with each node sporting a GPU to run its respective portion of training. This solution could possibly reduce the resource cost while still enabling fast training.
ReplyDelete
Replies
UnknownApril 5, 2015 at 9:27 PM
The Quoc V. Le paper introduces an extremely huge model and learns it unsupervisedly. For the classification result it is much worse than the Alexnet actually, but it came out before Alexnet.
ReplyDelete
Replies
UnknownApril 5, 2015 at 10:44 PM
I read the "cat paper", and it is very intriguing. For a paper published in 2012, it is indeed surprising to see that high-level features can be learnt from unlabeled images. The model utilized by the authors was a autoencoder consisting of three parts: local receptive fields, L2 pooling, and contrast normalization. The model was made into a deep autoencoder by replicating those parts three times, and therefore their overall model consisted of nine layers. The local receptive fields were not convolutional; the authors argued that by not having a convolution layer, the model learned features, which were invariant to rotation and translation. The optimization problem they try to minimize is called Topographic Independent Component Analysis, and it consisted of two terms: a term that encodes important input data information, and a second term which allowed L2 pooling to group similar features.

Just as Karan had pointed out, I was surprised that the authors had not used a GPU to allow model parallelism. This would have allowed them to utilize the convolutional layers in their model architecture. Perhaps, they could have simultaneously utilized a conventional convolutional neural network to learn high-level features, and the model proposed in the paper to perform unsupervised learning. Had they done so, it would have been fascinating to see what the results would have been.
ReplyDelete
Replies
Rohit GirdharApril 5, 2015 at 11:03 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownApril 5, 2015 at 11:11 PM
The main paper was one of the first papers to explore deep learning with an extremely large dataset. It does a lot worse than standard deep learning classification, however the point of the paper is that it can classify better than the standard k-means methods, and even standard deep autoencoder methods. I think this paper doesn't cover convolution techniques, they represent the same idea using locally receptive fields. As Lerrel talked about, it would be interesting if they could use an architecture with convolution and pooling layers instead of locally receptive fields. Overall I think the main contribution of this paper is that with enough data and a better parallel structure that deep networks can outperform the state of the art methods on a lot of image techniques.
ReplyDelete
Replies
Rohit GirdharApril 5, 2015 at 11:20 PM
I read the first paper. I see that they train an autoencoder in an unsupervised manner and then test each neuron to see how well it can classify face from non-face (for some 20 thresholds of it's activation value) and take the best one. Fortunately, one of them gets really fixated on faces and fires almost everytime when there's a face in the image, hence becoming a 'face detector'. So I guess the reason is faces occur frequently, and that neuron is responsible with reconstructing faces. So as Lerrel points out, I wonder how it'll do for things that are not so frequent in the visual world
ReplyDelete
Replies
UnknownApril 6, 2015 at 12:53 AM
I read the paper " Building high-level features using large scale unsupervised learning". I understand the basic intuition of the paper to replicate human brain in visualizing some objects(faces here). Is the optimization function mentioned similar to a loss fn. on which weights are learned? As mentioned by other people here its working may be because of the frequency of faces in visual word, can we learn a neuron with similar accuracy for a different category? I felt the paper as a precursor to other deep learning methods. Would the accuracy improve if labels were given and a loss fn. is minimized to learn weights?
ReplyDelete
Replies
UnknownApril 6, 2015 at 1:39 AM
I read the main paper "Building high-level features using large scale unsupervised learning".
The authors successfully apply the idea of local receptive fields of neurons to work on large input images. Also, the lack of use of a GPU seems justified when the authors describe the importance of using Asynchronous SGD optimization (to speed up independent parts for optimization).

I think the result of the paper - learning a neuron that is really good at recognizing faces is amazing given the use of unsupervised training data. However, I don't think the authors present enough information about the data to judge whether the network is really learning what it should. The use of unsupervised learning leaves a loophole, what if there are other things, like different types of cats (instead of just cats) or different objects in the dataset with higher frequencies? The use of random samples from YouTube videos is great for demonstrating that the particular network architecture presented in the paper can actually learn interesting high-level filters. However, it doesn't help in proving the robustness of the network - what if the network is supposed to learn something else from the given data?
ReplyDelete
Replies
UnknownApril 6, 2015 at 5:37 AM
Building high-level features using large scale unsupervised learning
====================================================
This paper tries to tackle a problem that very few have succeeded in before. Although unsupervised learning was successful to a certain extent in learning low level features, it did not prove to be very useful in learning high level features like faces, humans, objects and cats. This paper tries to learn high level features by training a deep network with unlabeled data. They call it a deep sparse encoder. More specifically, it explains how their deep network can be used to detect faces in images. They also test its performance for detecting objects in ImageNet. They were able to achieve an accuracy of 15.8% in ImageNet.

Strengths
-------------
* The authors have made a very bold attempt and have invested plenty of resources (time and computing) to attempt this problem which many have shied away from earlier.
* They have got pretty good results and have beaten the previous state of the art by a significant margin.
* They have backed their architecture with facts about how our brain works.

Weaknesses
-----------------
* They claim that babies are able to detect faces without any form of reinforcement or reward. However, it is clear from their results that babies are much better in detecting objects (or faces in this case).
* I am not sure, even with the significant improvement in results, that this work will rekindle the faith of other researchers in unsupervised learning for high level features and object detection.

Using Multiple Segmentations to Discover Objects and their Extent in Image
===========================================================
Collections
=========
This a relatively old paper whose goal is to segment object classes from a large array of visually similar images. The authors explain that while the problem of detecting object classes has been tackled, the problem of detecting and segmenting these object classes has not been tackled until this paper. The idea is to segment an image and use visual bag of words in each segment to identify similarities. Using segments introduces spatial information into the bag. To reduce the issue of bad segments, they perform multiple segmentations of the image with different parameters and use all the segments from all the segmentations.

Strengths
-------------
* The idea is very simple and the results were good during the time of its publication.

Weaknesses
-----------------
* The results could have been more extensive. They have not really compared their segmentations with competing algorithms (probably because none existed).
* It would have been nice if they had shown us how the model represents each class. For example, how does the model represent the class "car" which can have a number of different visual characteristics.

Object-Graphs for Context-Aware Visual Category Discovery
===============================================
In this paper, the authors use semi-supervised learning to detect object categories in images. They reason that if we have labelled data for some object categories, we can reasonably estimate other unknown categories by identifying interaction patterns with the known categories apart from the visual similarities. They also use multiple segmentations similar to the previous paper to avoid issues with incorrect segmentations.

Strengths
-------------
* The idea is very simple yet innovative. They encode the object level occurrence patterns in a layout graph.
* The usage of multiple segmentations alleviates issues around incorrect and bad segmentations.
* They have specified the exact parameters they used in their experiments which make it very easy to reproduce results.

Weaknesses
-----------------
* While the results are extensive in terms of the number of datasets tested, I feel that they have not sufficiently compared their results with competing algorithms. Again, this could probably be because they are the first ones to tackle this problem.
ReplyDelete
Replies
UnknownApril 6, 2015 at 7:19 AM
I read the second paper: Object-Graphs for Context-Aware Visual Category Discovery.
They propose an object-graph descriptor that finds out the likely categories within the neighboring segments (known) and their proximity to the unknown base element, by creating a histogram that forms localized counts of object presence weighted according to each class's posterior. Also, they used a multiple segments to secure a robust estimate of posterior. After they construct an object-graph and compute affinities of unknown regions, cluster to discover the categories. Except for some unclearness in write-up, I found their approach pretty straightforward in order to obtain their goal, context-aware category discovery in unsupervised manner.
ReplyDelete
Replies
UnknownApril 6, 2015 at 7:47 AM
I read the first paper on large scale unsupervised learning. The model of used here is mostly inspired by observations in biological models, such as "stacking a series of uniform modules switching between selectivity and tolerance layers". The use of the receptive field concept is beneficial for parallelism. For an autoencoder-based network, it is expected and proved that the model itself can learn the visual concept. But it has raised a question: how much data is needed to learn one visual concept. Also, I don't quite understand the reason why the author chose to use models whose parameters are not shared across different locations. After all, an object can appear at different locations in an image or frame. So it is not intuitive to choose a spatially variant detector.
ReplyDelete
Replies
UnknownApril 6, 2015 at 8:03 AM
Building high-level features using large scale unsupervised learning:

This paper simulates the "grand mother neurons" of the human brain (neurons sensitive to object categorization) with a significant large amount of unlabeled data. Their result shows a leap of accuracy from previous state of the art unsupervised learning results.

There are some arguments in previous posts on parallelizing the training process, which will add to the contribution of the contribution in real life implementation but it is not the main focus. Also, I have doubt on generating more complicated category-specific neurons. The high-level feature of the three categories selected for their experiment seemed too simple.
ReplyDelete
Replies
UnknownApril 6, 2015 at 8:09 AM
I've read the "cat paper". I'm not clear about the paper's dataset and goals.

Does the dataset only contain faces? It seems that it contains other images as well. So why is this system able to learn faces not other objects?

Also, is a reconstruction system can also do the same work? The reconstruction system just needs to label the image that has a lower reconstruction error as face (or cat or body).

ReplyDelete
Replies
Sahil ShahApril 6, 2015 at 8:37 AM
I read the paper "Building High-level Features Using Large Scale Unsupervised Learning". The problem that they are trying to solve seems very difficult to solve to the uninitiated. How can one possibly have class-specific high level features without any kind of supervision at all? Its like a chicken and egg problem. Although ideas from clustering techniques like visual words come to the mind, yet getting high level features is hard as compared to low level features.

I enjoyed reading about the parallels they draw with the human brain. The human brain's biggest advantage is data scale and computing power. They go towards that direction by using data and computing power an order of magnitude more than the standard deep models.

Their results are promising and beat the state of the art on ImageNet. This is encouraging and it will motivate other researchers to move in this direction.
ReplyDelete
Replies

Add comment