Sunday, March 22, 2015

Reading for Monday 3/23

S. Singh, A. Gupta and A. Efros. Unsupervised discovery of mid-level discriminative patches, ECCV, 2012.

And additionally:

L. Bourdev and J. Malik. Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations, ICCV, 2009.

Y. Li, L. Liu, C. Shen, A. Hengel. Mid-level Deep Pattern Mining, arXiv preprint arXiv:1411.6382 (2014).


  1. Paper 1 gives an intuitive and easy to understand algorithm for image representation. The authors propose an unsupervised approach to discovering representative and discriminative patches to replace the existing unsupervised image representation methods such as keypoint features. They use an iterative approach of training discriminative classifiers using current clusters and using this detector for adding patches to the cluster. They also extend the method to a supervised task of scene classification by simply discovering patches only within to a class of scenes.

    I think the biggest draw of this paper is that it discovers parts of image that (frequently) have some semantic meaning, compared to earlier approaches of SIFT etc which would only give some keypoints. Also, I found their extension to finding doublets to be interesting. In fact, I think it can be extended to n-number of patches and enforcing spatial consistency using fundamental matrix over these patches (just as we did it over SIFT). Many scenes like theaters, stadiums etc can have multiple discriminative patches that almost always occur with some spatial consistency.

  2. Paper 3 is a deep approach to discovering discriminative patches. I'd like to point 2 interesting observations they present on deep features (activations of BVLC reference network): (i) Using a sparsified feature, by keeping only highest k values and setting all the rest to 0 works reasonably well for classification (ii) Even binarizing the above feature has little effect on accuracy => discriminative information of a CNN feature is hidden in the indices of the largest magnitudes. This helps them represent the image as a 'transaction' in their pattern mining approach.

    It was also interesting to see how they adapted this problem to a generic pattern mining (I've seen this for the first time, though they refer a paper by Luc Van Gool that seems to do something similar). They represent each image as a 'transaction', by representing it by high valued indices in the deep feature, and then discover patterns through association rule mining (i.e. find those transactions which have high support value, i.e. occur frequently in that class, and have high confidence of of association of the class label with that transaction). It seems to me quite a contrived method of discovering the patches, but I guess they needed to go with that than maybe something like Singh et al. but with deep features, probably because it would have been computationally infeasible (and this pattern mining is designed for big-data).

  3. I read paper2. The author propose an iterative discriminative clustering method to create mid-level visual patches. At each iteration, patches in a cluster is used to train a linear SVM classifier against a natural world dataset and find out new cluster, and eventually it outputs the top n patches, which is regarded as mid-level visual patches. In the process, they divided the datasets into two's and swapped each other, which I found interesting. I didn't quite understand about what they call "purity" and its justification. Hopefully the presenter addresses this and clears out my confusion.

  4. I read the paper on Unsupervised Discovery of Mid-Level Discriminative Patches.

    I feel that the introduction is well-written. It goes in to a good amount of detail describing various approaches for visual representation and why mid-level primitives are most appropriate for the task at hand. More specifically, low-level, bottom-up representations tend to have mixed generalization results while high-level representations require a considerable amount of hand-labeled data. I wish more papers wrote such detailed and informative introductions, they really do facilitate understanding of both the task at hand and other approaches in the field.

    The authors motivate their methodology by the following two requirements for good discriminative patches: 1) Frequently occurring and 2) Sufficiently different from the rest of the world. They show that K-means, a standard approach is not appropriate in the context of mid-level patches. To satisfy both of these requirements, the authors propose an iterative approach where they first perform clustering to find frequently occurring patches then perform SVM-based detection to adjust the memberships which ensures patches in different clusters are different from each other. Detection in this context means that a cluster is trained to be discriminative against the rest of the visual world, which is represented by a natural world dataset. Furthermore, to reduce overfitting from the SVM classifiers, the authors incorporate cross-validation which significantly improves cluster consistency.

    The authors generate patches by computing the HOG feature vectors at various resolutions for each image. Due to the frequency of extracting patches, this algorithm seems to be very computationally expensive which potentially inhibits applications in the real-world. To mitigate this problem, the authors can consider using the integral HOG optimization which speeds up feature extraction especially if the size of the HOG patches are large. Due to the computational redundancy of performing patch extraction at different scales, parallelizing patch extraction on the GPU also seems to be a viable way to improve performance. It would also be interesting to see how this algorithm performs with different features such as LBP or simpler histogram-based approaches.

    Overall, the results presented by the authors look great, both visually and quantitatively. The methodology used for quantitative evaluation does seem a bit unnatural (using PASCAL semantic category annotations for visual similarity), however, the authors do explicitly mention that this metric is not the best. To address this issue, the authors offer another smaller-scale experiment using human labelers to measure visual consistency. It would be interesting to see this experiment at a larger scale, perhaps utilizing Amazon's Mechanical Turk.

  5. The first paper approaches the problem of finding meaningful patches in a discriminative but unsupervised way. This overcomes the major limitations of poselets (described in the second paper) which require significant manual labelling to provide additional information for clustering patches that are close in both appearance space and 3D configuration space. In lieu of these additional labels, an iterative algorithm is employed for refining cluster assignments by enforcing that they be discriminative in the sense that a linear svm can separate them well from the natural world.

    While this approach appears to work very well, I have a few critiques/questions:

    1. The algorithm appears to be highly dependent on initialization; would multiple instantiations run on the same data set result in similar discriminative patches, or could the algorithm get stuck in poor local optima. But perhaps this wouldn't actually result in reduced performance in the final task, e.g. scene classification.

    2. As a (somewhat ad hoc) description of an algorithmic procedure, it's difficult to understand exactly what is being optimized. Is there some joint, global optimization problem that can be solved/approximated by this procedure?

    3. What kinds of biases are introduced by relying on HoG features and linear SVMs? The authors say that some clusters have poor purity because they contain multiple concepts that can be generalized simultaneously by a linear classifier. When might we expect this to happen? Would other features/classifiers work better?

    4. It's interesting that such a simple procedure tends to produce semantically-related clusters with no supervision. Why is this? Some property of the statistics of natural images or the types of objects/classes that people tend to be interested in?

    1. 1. I guess as long as there are enough samples in the initialization it is just approximate K-means, which should be reasonable. Furthermore, this looks like a mode seeking procedure which might explain the reasonable optima.

      2. As far as I can tell, this is an alternating optimization of their described goal, which are their approximate descriminativeness and purity. And the iterative descriminative clustering looks kind of like mode seeking.

      4. Why? They are directly optimizing that the patches should be descriminative. That should give semantic patches?

  6. I read paper 1 and found it very interesting and easy to follow. I found the way they introduce the problem by arguing about why mid level primitives are better than pixel or visual words (or letters as they later call it) very intuitive. It make sense that mid level patches will be work better than visual words.

    Also, I liked how they reuse techniques from visual words by using clustering but then modify it to resolve the problem of distance manifold not being straightforward. They use a detector to solve the problem which is a nice combination of unsupervised and supervised approach.

    The simple but effective solution of using cross validation training (which I think is a misnomer) to solve the overfitting issue also makes sense but I don't know how much computationally efficient it would be to keep repeating it till convergence. I am not sure if they mention it. Also, in this regard but not so relevant- I was pleasantly surprised to read the word 'Alas' by the authors in the paper. In the limited number of papers I have read this is perhaps the first time I noticed authors expression emotions in a research paper!! :)

    Lastly, I like the fact the authors then go ahead and try the mid-level discriminative patches approach in various supervised settings as well.

  7. I read the first paper "Unsupervised Discovery of Mid-Level Discriminative Patches". I think this is a very well written paper and a very easy read.

    In this paper authors presents a new representation for visual data/ images. They argue that they want the image representation to be representative i.e. it occurs frequently in the image and discriminative (it provides information to discriminate between various high level visual phenomenons). So they want to find these kinds of patches in an image at various scales in an unsupervised manner. They use an iterative approach to find these patches(clusters) by first using kmeans to initialize the clusters and the use svm to improve the clustering. the svm is learned to differentiate a cluster from all the other visual world which is a set of large number of patches taken from random image from internet. To reduce overfitting they divide their data into two parts training and validation and alternate between them to learn the svm. They show improvements in the scene classification tasks using these patches.

    I have some questions:
    Is it better to use svm in this case to improve the clusters or a distance metric learning would have been a better choice.

    After finding clusters of discriminative patches do they use the mean patch as feature or any random patch from that cluster

  8. Paper 1 aims to find representative and discriminative middle level features for visual representation. The authors proposed a discriminative clustering method to compute the cluster of the visual patch. Unlike the discriminative k-means method, this paper suggested to use a natural world dataset as the common negative data in the discrimative training step.

    Although this paper achieved the state of art in MIT Scene 67 dataset, there are few drawbacks in this paper.

    1) The algorithm is very dependent on the initialization, while for k-means clustering, the selection of parameter k is not a trivial.

    2) Every iteration, it needs to train a svm, which is very time consuming.

    3) There are a lot empirical parameters, such as the number random sampled patch; the selection top 5 patches from svm and threshold to prune out small cluster. Replacing these empirical parameters using some learning based method would make this algorithm cleaner.

  9. Paper 1 talks about how to find mid level patches are representative to use to describe an image. The paper argues that these patches are more useful and describe an image better than the standard visual words. I does make sense in the algorithm that they describe here that it would be better than visual words due to each cluster from K-means being agumented to be more discrimitive by training a linear SVM. Like the others have pointed out this algorithm does depend heavily on the intialization of K-means as well as the number of clusters and the standard problems that come along with basic clustering.

    I think that there are better options to use rather than a linear SVM to describe whether theses patches are discriminative, since the authors also point out that sometimes the linear SVM can generalize serveral concepts. Also is there some sort of bias to when the SVM does this since the paper soley relies on HOG features to do clustering and SVM?

  10. Few questions regarding the Mid level patches paper,
    1. Apart low low energy rejection, were other patch rejection techniques tried? Is there something like a patch proposal framework that helps in getting good initial patches.

    2. I feel that these mid level patches learn something very similar to the 4-5th conv layer in deep nets. This seems quite interesting...

    3. The model seems like a simple but strongly engineered method. There are many parameters that have been set. Have any of these parameters been learnt. How did the authors figure out these parameters.

  11. I read paper 1, "Unsupervised Discovery of Mid-Level Discriminative Patches".

    This is a well written paper that delivers an interesting extension of "visual word" representation of visual information. The authors describe a new representation, namely "mid-level discriminative patches". This representation is "mid-level", because it describes visual features like body parts, object parts or whole objects, and the patches are representative of these mid-level features, and hence "discriminative".

    The authors present a "discriminative-clustering" algorithm to identify mid-level patches of objects within a dataset D. The algorithm is a revision of the k-means algorithm, and has 2 steps. For a given set of initial clusters representing the mid-level patches, first, the algorithm converts each cluster into a disriminative detector, treating the "natural world" as a negative set. Second, the algorithm updates each cluster to be its top "m" detections. These two steps are repeated
    iteratively (4-5 times). To help improve the accuracy of these clusters and also to prevent the SVM (discriminator) from memorizing examples, the authors use cross-validation training in their discriminative clustering algorithm. Finally, the purity of each cluster (in that each cluster represents only the information present in its members) and the discriminative-ness of each cluster (in that each cluster doesn't represent any other visual information) decide the score of each cluster.
    And thereafter the algorithm selects the top "n" discriminative mid-level patches.
    The authors also present an extension of the algorithm so that the learnt discriminative mid-level patches only represent one single visual aspect (see the section on detecting "doublets").

    Points for discussion:
    While the algorithm is interesting and yields good results, I am interested to know why the author relied on a k-means clustering algorithm. Whether the authors considered using agglomerative clustering or affinity propagation for example. Or any algorithm where they didn't have to decide #clusters k.
    Further the natural world dataset is a random sample of internet images. I am curious as to whether this "random" sample was in any way biased towards some category that aided the results.

    Overall, the paper is quite instructive and easy to follow.

  12. The first paper is a very easy to read, and it clearly explains many things. The goals of the paper are well defined, and the authors provide very interesting results with their approach.

    A couple of comments/queries that I had:

    1. I wonder why the cluster purity would be lesser if the number of cluster members is higher than 5. Since the approach they follow basically uses a linear SVM to learn splits, it would have probably taken the algorithm a few more iterations to converge. Shouldn't a larger number of patches per cluster (probably) allow better generalization to a wider range of validation image patches?

    2. The idea of doublets is interesting. It was good to see two spatially co-occurring patches being grouped together as one, even though they vary in terms of appearance/illumination. The authors did well to adopt a HoG-based approach to model patches, and detect other clusters with similar patches computed across multiple scales.

  13. "Mid-level Deep Pattern Mining" explores a very simple yet very useful idea to use mid-level features for image classification (scene and object). I understand that it is the current state-of-the-art in mid-level visual element discovery.

    Without creating a new deep network, they use the existing caffe reference model pre-trained on ImageNet data. They extract activations from the first fully connected layer and observe that they are useful not only to recognize visual similarities but also semantic similarities. While this is interesting, I wonder how semantic features are available in the first fully connected layer. Going by intuition, I would expect the performance to improve if they extracted activations from the later fully connected layers as more non-linearity is possible. The authors have not explained the rationale behind choosing the first fully connected layer.

  14. I read the first paper ‘Unsupervised Discovery of Mid-Level Discriminative Patches’. I think its a very well written paper and easy to follow. In this paper, the authors have proposed a new representation for visual data, namely mid-level discriminative patches. They have presented an iterative clustering method using svm to compute the clusters for the patches in an unsupervised fashion and achieve state of the art performance in scene classification tasks. I found the extension of the algorithm to doublets very interesting too. On the drawbacks, I agree with the comments made that using kmeans for initialization can be very susceptible. I believe most of the important points regarding the paper have already been covered above.

  15. I read the paper "Unsupervised discovery of mid-level discriminative patches". As Esha mentioned can other clustering methods be used in plave of K-means clustering for detecting patches. The algorithm relies on many aprmameters like 'm', how many clusters etc. are the parameters found empirically? WHile finding the top 'm' firings it is said that SVM score gerater than -1 are considered a firing, but when a SVM score is negative isn't it a negative exmaple. While ranking could rank SVM discussed in attributes paper be used for learning the weights of the linear combination ? I didn't quite understand what is meant by 'second-order spatial co-occurence', is it something similar to SPM. I understand instead of taking a single patch now we are considering two patches making a doublet which could be exteended to a grouplet. On a high level what I understand from the paper if we have one bowling image we can split that image into a number of patches which are represenattive of bowling and we can further establish a relation among these patches to classify the test images.

  16. Regarding the paper "Unsupervised discovery of mid-level discriminative patches", I found that the idea of doublets is quite similar to the Hough transform from the original sift paper. In this implementation, only the highly ranked patches are considered as a root group, and thus at least one must be contained in each of the doublets.

    Would it be possible to examine a combined ranking of any two spatially proximate groups instead? Would this require a different clustering algorithm, such that it would not rely on a fixed number of clusters (and potentially separating very similar data)? What if a constraint for dissimilarity between patches was added?

    I'm thinking of this from the perspective of something like a Kalman filter for 2 inputs. Neither one needs to be particularly clean (so long as it is actually providing data) to filter the output.

    Overall, it was an easy to read and understand paper.

    1. Harrison, could you explain your idea with the Kalman filter for 2 inputs?

  17. I went through the "Poselet" paper. It is similar to other usual detector based HOG and SVM, but the key difference is to use data set with 3D annotation. They present H3D dataset which has both images and 3D pose annotation data. By using the 3D pose information, they can collect all the mid-level patches which contains very similar 3D pose allowing diverse appearance changes. Using this data which are tightly clustered in both 3D joint configuration and 2D appearance, they train SVM classifier, and show comparable performance to DPM in VOC 2007.

  18. Learning Mid-level patches are proven to be effective and the deep learning paper tried to use deep features instead of HOG to learn mid-level patches. I wonder whether it is possible to learn a deep network unsupervisely by stacking up the mid-level clustering procedure.