**H. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui and T. Darrell. On learning to localize objects with minimal supervision, ICML, 2014.**

And additionally:

X. Chen, A. Shrivastava and A. Gupta. Enriching Visual Knowledge Bases via Object Discovery and Segmentation, CVPR, 2014.

C. Wang, W. Ren, K. Huang and T. Tan. Weakly Supervised Object Localization with Latent Category Learning, ECCV, 2014.

I will present the paper On learning to localize objects with minimal supervision, whereas Larrel will present the other two.

ReplyDeleteThis paper presents an approach to learn object detectors from weakly-labled images, i.e. labels of whether the object of interest is present or not.

Overall, they formulate the whole problem as a latent SVM problem, where the bounding box positions in positive images being the latent variables (same as the root filter position in the Deformable Part Model paper).

One problem is that this LSVM problem is nonconvex and highly dependent on good initializations (bounding box positions). In a classical setting where bounding boxes are given, the latent variables are initialized to a very good position (notice that root filter position can change in DPM paper, so even with no part, DPM still has a LSVM problem to solve), and the LSVM can converge to very good local minimum.

So the crucial difference between object detector training with and without bounding box is the initialization of locations of objects.

Paper contribution 1: initialization of bounding boxes via submodular cover. To solve the initialization problem, this paper introduces an initialization to find good candidate bounding boxes, via a submodular cover problem. The basic idea of this approach is as follows. Consider all possible boxes from all images as a graph (say 1000 boxes from 1000 images, resulting in 1M boxes), and weights among boxes being their similarity. The set of true bounding boxes should be similar to each other and they are far away from negative images. They formulated this as a particular submodular cover problem and used a greedy algorithm to solve it approximately.

The result of solving cover problem is getting a set of guessed bounding boxes, that are as good as those ones labeled by humans.

Paper contribution 2: Smooth formulation of LSVM. In the traditional (hard) LSVM, the problem is not smooth and needs some optimization tricks to handle it. This paper solves LSVM by solving a smooth approximation of it, allowing the use of efficient algorithms for solving smooth problems (e.g. LBFGS). It turns out that this particular smooth formulation considers top N (N!=1) scored root locations during optimization, rather than top 1 scored root location in regular LSVM, making optimization robust. This is of particular use in this context, since bounding boxes we find are certainly noisy and not as precise as those hand-lableld ones.

I read the "localize objects with minimal supervision" paper. I think Yimeng has already explained the paper in detail. I have one question in the paper. In the initial graph can a box be connected to two boxes in an image? It seems that in the formulation they described a box can only be connected to one box per image.

ReplyDelete"For each box b, we find its nearest neighbor box in each (positive and negative) image. We sort the set N (b) of all such neighbors of b in increasing order by their distance to b."

DeleteAccording to the above, you are right. However, I think it should be easy to extend the algorithm to allow a box to be connected to two or more boxes in an image without significant impact on time complexity. The algorithm can remember the top 'n' nearest neighbors in an image instead of just one. When the sorting is performed and the top k neighbors are picked, the bad neighbors are automatically removed. I think this will improve the accuracy of the algorithm if there are several images with multiple positive instances in the dataset.

Where do you see this constraint on only one box? The graph of positive images has many box proposals, and the objective tries to find a cover of those?

DeleteI read the paper "On learning to localize objects with minimal supervision" which Yimeng summerizes well. I was wondering with the submodular cover, is the assumption that the data contains bounding boxes that are similar and covers all modes too much for this to always satisfy a good initialization, since LSVM is so dependent on this initialization. It seems that adding annotations from the other sources seem to improve the other algorithms, I wonder how much adding something like difficult annotations would improve this?

ReplyDeleteI read the paper on localizing objects with minimal supervision. The authors have done a good job explaining the intuition behind choosing a discriminative submodular cover, and explaining the attempt to solve the initialization problem. One of my gripes with the paper is that they did not contrast the two strongly convex functions on the probability simplex, Euclidean Norm vs Entropy. Although they explain and justify the choice of smooth Euclidean norm w(u) for their experiment, they did not provide any attributes to the entropy convex function.

ReplyDeleteOverall, a really good paper, which provides some good mat on convex opt, and worthy of an ICML publication.

On learning to localize objects with minimal supervision:

ReplyDeleteGood initialization solves most of the unsupervised learning problem. I find the method for initializing the bounding boxes interesting. Constructing a graph with arbitrty boxes and solving the sub-modular cover problem is a smart approach. Though the greedy method may yield large complexity.

I read the paper "On learning to localize objects with minimal supervision". It is a chicken and egg problem to solve the localization of the detector and the modelling of the detector. To deal with this, the author proposed to start with a good localization of the detectors, ie. the window locations by solving a submodular problem. The intuition here is that the good windows should contain patches that are both distriminative (appears in very few negative images) and representative(their modes in the positive images are limited). Based on this, they define a covering score and optimize it using a greedy method. The threshold in the score seems to be very important. However, I'm not sure how it is selected. For different classes, the modes (for example, which sides of the object is facing the camera) of the object to detect would be very different. Consequently, we should use different threshold value in the covering score. Or it should be optimized during the training process, along with the detectors.

ReplyDeleteI read the paper 'On learning to localize objects with minimal supervision' Though I was able to understand the intuition I was not able to understand the mathemetical proofs presented in the paper. In a high level I think what the paper does is find the candidate boxes with a method which are generally 1000's and the assumption is the bounding boxes will be similar in appropriate feature space. Constructing a graph of all the bounding boxes with edges being the similarity or the distance in the feature space. Is 'V' mentioned just a bounding box and 'U' an entire image? I didn't understand what are U and V and how are they different from each other. Once the initialization is done by this mechanism of set cover they try to smooth the latent SVM completing the learning of object detection. It would be nice if the presenters could go into some details of the smoothing LSVM and set cover part.

ReplyDeleteThe first paper approaches the problem of weakly-supervised object localization using discriminative submodular cover to discover an initial set of likely object locations followed by iterative refinement using a smoothed formulation of latent SVM over CNN features of window proposals.

ReplyDeleteLike most MIL problems, this approach is non-convex and sensitive to initialization. Their initialization approach using submodular cover relies on two essential assumptions: the correct boxes are similar in feature space across positive images and they do not occur in negative images. Thus, the goal is to seek dense subgraphs spanning only positive images, a difficult combinatorial optimization problem that they approximate with submodular cover.

The smoothed version of latent SVM allows for fast quasi-Newton optimization methods and effectively considers the top-N configurations instead of the single maximum, leading to improved robustness. I think it's interesting that in this case the objective smoothing, which you typically think of as being solely for optimization efficiency purposes, actually improves overall performance.

I read "On learning to localize objects with minimal supervision". Yimeng presents a good summary and I agree with all comments written above.

ReplyDeleteHowever, I think this a good paper where the authors focus clearly on both initialization techniques and the actual algorithm. Its interesting that the authors try to actually solve for and figure out how they can start with a good initialization (not very common in papers we have read earlier). I would request the presenter to explain the graph formulation a bit more in detail if possible. The math is a bit challenging to understand right now. Its also interesting that the authors use

features learnt from R-CNN for the weakly -supervised object detection task for improving their detection instead of hand-selected features.

I read both "On learning to localize objects with minimal supervision".

ReplyDeleteThe paper seems like a smart approach to solving the initialization problem. As far as I understand it, no initial boxing boxes are provided, making the MIL minimization approach solve for everything.

Would it be possible to add a small handful of examples bounding boxes at start? Could this accelerate the approach? Or would this completely change the scope of the algorithm?

They actually initialize with region proposals, right?

DeleteCorrect. There are around 1000 proposal boxes per image.

DeleteAlso, I guess if you had actual examples this would be a semi-supervised learning problem?

Delete