16-824: Learning-based Methods in Vision (S'15): Reading for Wednesday 3/4

Wednesday, February 25, 2015

Reading for Wednesday 3/4

P. Felzenszwalb, R. Girshick, D. McAllester and D. Ramanan. Object detection with discriminatively trained part-based models, TPAMI 32, no. 9 (2010): 1627-1645.

And additionally:

R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation, arXiv preprint arXiv:1311.2524 (2013).

C. Szegedy, S. Reed, D. Erhan, D. Anguelov. Scalable, High-Quality Object Detection, arXiv preprint arXiv:1412.1441 (2014).

16 comments:

LerrelMarch 2, 2015 at 10:45 AM
The DPM paper (Object detection with discriminatively trained part-based models) is essentially a heavily engineered, multi modular paper. There is extensive math which in hindsight seems like an overkill given the lack of intuition for the corresponding concepts. For example the latent SVM's training seems sketchy to me. It would be nice if the presenters could explain the math in a more intuitive manner.

Conceptually the DPM seems sound. However there are some intriguing engineering decisions that haven't been explained. For example why is it assumed that 'parts' have twice the resolution of the 'root'. Another instance of incomplete information is how exactly does the "analytic" features derived from PCA on HOG help. How does this practically improve the dense feature map computation.

However the biggest issue I have with the DPM is transparency into what exactly is making the algorithm work so well. Clearly from (http://arxiv.org/abs/1206.3714) we can see that the "deformable" nature of the parts isn't really the secret to object detection success. I personally feel that the mixture models gave the DPM its state of the art (before deep nets) results. Any cues as to why this model does so well?
ReplyDelete
Replies
Ankit LaddhaMarch 2, 2015 at 6:48 PM
@Lerrel
PCA on HOG: This was done to reduce the computation time. The filter response calculation step requires dot product of filter coefficients with feature map. So by reducing the dimensions of feature map the authors are reducing the computation time.

But PCA and then projection of each HOG vector onto principal component also takes time which will reduce the gain in computation. So the authors made an observation about the principal components that they could just be expressed by sparse vectors. So now the projection onto these principal components just became spme analytical operation. This way the authors don’t have to do the PCA and projection step which results in saving computation time.
ReplyDelete
Replies
UnknownMarch 2, 2015 at 6:57 PM
I started reading the paper titled "Scalable, High-Quality Object Detection" and found that it made several references to something called an "Inception" network. It sounded like an interesting concept, so I started reading the 20th citation in that paper titled "Going deeper with convolutions" (which I highly recommend to you all). Since it is a generic concept that we can use to build CNNs for several applications, I thought I will give a short summary here that captures the essence of the idea.

We know that the performance of a CNN increases with increase in the size of the network in general. However, this comes with two problems. Since the number of network parameters increases with increase in size, the possibility of overfitting increases. This means that we will need more training images. The second problem is that the computational complexity of training the network can quickly grow with increase in the size. The authors talk about how to design "Inception" modules that can be repeated layer-after-layer to create a CNN that mitigates the above issues.

The resolution, they propose, is to move from fully connected to sparsely connected architecture, even within the convolutions. The authors claim that this better mimics biological systems apart from improving performance while mitigating the above issues.

Step 1 : Cluster neurons whose outputs are highly correlated
When constructing a particular layer, the authors recommend that we cluster neurons from the previous layer whose outputs are highly correlated and wire them together in the current layer. This is better than having uniformly dense convolutions.

Step 2 : Apply dimension reductions
In the convolutions from step 1, we identify those that may be too expensive. Lets say we have 3x3 and 5x5 convolutions. They can be computationally expensive. The authors recommend that we add 1x1 convolutions between the previous layer and these dense convolutions. This is based on the idea that even low dimensional embeddings might contain a lot of information about a relatively large image patch.

Applying steps 1 and 2 would give us a single "Inception module". We can repeat this process layer-after-layer to produce a deep CNN that does not suffer as much from the above two issues as a regular deep CNN but performs similarly.
ReplyDelete
Replies
UnknownMarch 2, 2015 at 10:43 PM
1.)The DPM paper essentially is a smart and intuitive extension of the existing object detection methods like Dalal & Triggs. The intuition lies in the fact that objects are deformable and in order to detect the objects we can come up with a hierarchial structure of different parts of the object. By proposing a score for each part and a loss for the deformation we can effectively know the confidence of the object for a given location. The author expects some prior knowledge about some concepts like Latent SVM, it would be nice if the presenters explain the semi convexity of latent SVM and the reason behind.

2.)An interesting part in the paper Scalable, High-Quality Object Detection I found is about the about the processing of 50 images per sec on a CPU. The loss mentioned in the paper is similar to the one of DPM. There are some interesting aspects of the paper like training with missing positive labels. I think it would be an important takeaway if the presenters can explain it in detail. The paper mentions about calculating the prior of each bounding box which could have been discussed in more detail.
ReplyDelete
Replies
LerrelMarch 3, 2015 at 8:21 AM
Regarding the R-CNN paper (Rich feature hierarchies for accurate object detection and semantic segmentation):
This is a well written paper in which the authors clearly mention the reasons for their decision choices and draw inspirations from previous work and on theories of how recognition works in the brain. However here are my concerns that may require further clarifications. Firstly the authors compare their 'simple' approach to complex methods of the past (DPM). But when you first pre train on the ILSVRC, use region proposals and fine tune your network; then follow that up with an SVM, it seems that simpler methods could exist. For example: Why not just directly perform multivariate regression on the whole image (basically allow the network to find the bounding without spoon feeding region proposals). With the eclectic style of the R-CNN, the onus of success is on all the individual modules. Region proposals, I believe have biases to certain objects which means the overall algorithm also would have biasses to certain objects.

Another issue of concern is the hand wavy way in which warping region proposals to a 224X224 image is justified. Although in Appendix A., the authors mention other methods of forcing their region proposals into their deep network, Im afraid that this does not address the issue of detecting elongated objects that aren't as discriminative as a human body.

An important question, that should arise if you read the DPM paper, is does this method work with deformable objects shown at different poses. One may argue that the deep network understand this and then gives the 4096 feature description of the image patch. But with a trained SVM which doesn't explicitly understand the multi modular behaviour of object categories, I do not think this would generalize well on objects that DPM does a good job on. The biggest strength of the R-CNN in my opinion is great features, what it lacks probably is an unconstrained view of object categories. Would it be interesting to use R-CNN to detect parts and roots and then build a DPM over it, instead of a simple SVM?

ReplyDelete
Replies
UnknownMarch 3, 2015 at 8:37 PM
I have a quick question regarding the DPM paper.

When creating the feature pyramid, the authors set \lambda, the number of levels in an octave, to 5 and 10 respectively in training and testing. In figure 3, the author showed an example where the part filters are taken from the original image and the root filter are taken from one half resolution, with \lambda = 2. However, I don’t understand why we need to set \lambda to different values. Take Fig.3 for example, even if \lambda=5, the resolution of root filter and part filters still remain identical. So whats the usage of this parameter? Any thoughts?
ReplyDelete
Replies
Calvin MurdockMarch 3, 2015 at 10:53 PM
In section 5 of the R-CNN paper, the "fg" strategy computes CNN features only on a foreground mask by replacing the other background pixels with the mean input so that they are zero after mean subtraction. What effect does this have on the final features? With most inputs zeroed out, I can imagine that the final feature vector magnitude could be significantly smaller with small objects. In addition, it would likely not be invariant to translation or pose changes, but this could potentially be beneficial for segmentation tasks (e.g. certain classes tend to be smaller while others tend to be positioned in certain regions of the image.) It would be interesting to see experiments that explore the effects of this particular approach to computing CNN features over arbitrary, non-rectangular regions. (Is there another way...?)

Furthermore, I wonder if there is way to adapt deep learning for detection without relying on bounding box proposals. The fully-convolutional networks from last time address this for segmentation, but they are unable to separate multiple instances of the same object or restrict regions to potentially-overlapping bounding boxes.
ReplyDelete
Replies
UnknownMarch 4, 2015 at 7:02 AM
In the DPM paper:

- The features used in the paper include the latent variable, i.e. where the parts are relatively to the root. Given the parameter \beta, which includes the original location of the parts (v_i) in the learned model and the cost of displacement of parts (d_i) from their original positions, these variables can be estimated using generalized distance transform.

- But the question is, even given the block containing the objects in the training data, how can we find those latent variables and learn the SVM parameters? The solution is the iterative optimization given in this paper. The paramters and latent variables are optimized iteratively, fixing one and optimizing the other.

- Data-mining hard examples: This method is used to deal with the issue that the number of the negative samples is much larger than the positive samples. Instead of training once using all the negative sampls, we solves a sequence of traning problems using a small number of hard examples. This is similiar to the boost methods, e.g. Adaboost, where samples classified incorrectly in the sequence of training of weak learners are weighted more than those classified correctly

-?: I have one question regarding the training of DPM: How is \beta initialized in the training session? That is, we don't know the number and patterns of the filters, we also don't know the cost for part displacement anchor v. In turn, we also don't know the latent varibles (dx, dy) : the displacement of the parts from their positions in the model, since we don't know the part displacement anchor v.
ReplyDelete
Replies

Add comment