Wednesday, February 25, 2015

Reading for Wednesday 3/4

P. Felzenszwalb, R. Girshick, D. McAllester and D. Ramanan. Object detection with discriminatively trained part-based models, TPAMI 32, no. 9 (2010): 1627-1645.

And additionally:

R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation, arXiv preprint arXiv:1311.2524 (2013).

C. Szegedy, S. Reed, D. Erhan, D. Anguelov. Scalable, High-Quality Object Detection, arXiv preprint arXiv:1412.1441 (2014).


  1. The DPM paper (Object detection with discriminatively trained part-based models) is essentially a heavily engineered, multi modular paper. There is extensive math which in hindsight seems like an overkill given the lack of intuition for the corresponding concepts. For example the latent SVM's training seems sketchy to me. It would be nice if the presenters could explain the math in a more intuitive manner.

    Conceptually the DPM seems sound. However there are some intriguing engineering decisions that haven't been explained. For example why is it assumed that 'parts' have twice the resolution of the 'root'. Another instance of incomplete information is how exactly does the "analytic" features derived from PCA on HOG help. How does this practically improve the dense feature map computation.

    However the biggest issue I have with the DPM is transparency into what exactly is making the algorithm work so well. Clearly from ( we can see that the "deformable" nature of the parts isn't really the secret to object detection success. I personally feel that the mixture models gave the DPM its state of the art (before deep nets) results. Any cues as to why this model does so well?

    1. Atleast for pedestrian detection, methods which are not DPM variants achieve better performance than DPM[1]. Other than the case of occlusion there is no particular advantage of using DPM.

    2. Without parts dpm can got around 30% on voc07, it means the root filters are already very powerful. With parts it has 4~5% improvement which is actually huge at that time. I will explain all the tricks leading to the good performance in the class, which can be also transfered to other vision tasks.

    3. I think it is reasonable for the part filters to have more the resolution than root filter. The role of the part filter is to extract small but salient features about the object. Such details won't be detected if the resolution is too coarse, e.g. as with the root filter.

  2. @Lerrel
    PCA on HOG: This was done to reduce the computation time. The filter response calculation step requires dot product of filter coefficients with feature map. So by reducing the dimensions of feature map the authors are reducing the computation time.

    But PCA and then projection of each HOG vector onto principal component also takes time which will reduce the gain in computation. So the authors made an observation about the principal components that they could just be expressed by sparse vectors. So now the projection onto these principal components just became spme analytical operation. This way the authors don’t have to do the PCA and projection step which results in saving computation time.

    1. I do understand the general intuition behind your explanation. What I do not understand is the follows: PCA is an operator right. This means you have an analytic expression of each principal component. The projection operation is also analytic, which means the projection of HoG vector to PCA is analytic. To me it seems like the same thing.
      I guess they have a simpler analytic form and it would have been great if they mentioned that.

      Thanks for the explanation Ankit!

  3. I started reading the paper titled "Scalable, High-Quality Object Detection" and found that it made several references to something called an "Inception" network. It sounded like an interesting concept, so I started reading the 20th citation in that paper titled "Going deeper with convolutions" (which I highly recommend to you all). Since it is a generic concept that we can use to build CNNs for several applications, I thought I will give a short summary here that captures the essence of the idea.

    We know that the performance of a CNN increases with increase in the size of the network in general. However, this comes with two problems. Since the number of network parameters increases with increase in size, the possibility of overfitting increases. This means that we will need more training images. The second problem is that the computational complexity of training the network can quickly grow with increase in the size. The authors talk about how to design "Inception" modules that can be repeated layer-after-layer to create a CNN that mitigates the above issues.

    The resolution, they propose, is to move from fully connected to sparsely connected architecture, even within the convolutions. The authors claim that this better mimics biological systems apart from improving performance while mitigating the above issues.

    Step 1 : Cluster neurons whose outputs are highly correlated
    When constructing a particular layer, the authors recommend that we cluster neurons from the previous layer whose outputs are highly correlated and wire them together in the current layer. This is better than having uniformly dense convolutions.

    Step 2 : Apply dimension reductions
    In the convolutions from step 1, we identify those that may be too expensive. Lets say we have 3x3 and 5x5 convolutions. They can be computationally expensive. The authors recommend that we add 1x1 convolutions between the previous layer and these dense convolutions. This is based on the idea that even low dimensional embeddings might contain a lot of information about a relatively large image patch.

    Applying steps 1 and 2 would give us a single "Inception module". We can repeat this process layer-after-layer to produce a deep CNN that does not suffer as much from the above two issues as a regular deep CNN but performs similarly.

  4. 1.)The DPM paper essentially is a smart and intuitive extension of the existing object detection methods like Dalal & Triggs. The intuition lies in the fact that objects are deformable and in order to detect the objects we can come up with a hierarchial structure of different parts of the object. By proposing a score for each part and a loss for the deformation we can effectively know the confidence of the object for a given location. The author expects some prior knowledge about some concepts like Latent SVM, it would be nice if the presenters explain the semi convexity of latent SVM and the reason behind.

    2.)An interesting part in the paper Scalable, High-Quality Object Detection I found is about the about the processing of 50 images per sec on a CPU. The loss mentioned in the paper is similar to the one of DPM. There are some interesting aspects of the paper like training with missing positive labels. I think it would be an important takeaway if the presenters can explain it in detail. The paper mentions about calculating the prior of each bounding box which could have been discussed in more detail.

  5. Regarding the R-CNN paper (Rich feature hierarchies for accurate object detection and semantic segmentation):
    This is a well written paper in which the authors clearly mention the reasons for their decision choices and draw inspirations from previous work and on theories of how recognition works in the brain. However here are my concerns that may require further clarifications. Firstly the authors compare their 'simple' approach to complex methods of the past (DPM). But when you first pre train on the ILSVRC, use region proposals and fine tune your network; then follow that up with an SVM, it seems that simpler methods could exist. For example: Why not just directly perform multivariate regression on the whole image (basically allow the network to find the bounding without spoon feeding region proposals). With the eclectic style of the R-CNN, the onus of success is on all the individual modules. Region proposals, I believe have biases to certain objects which means the overall algorithm also would have biasses to certain objects.

    Another issue of concern is the hand wavy way in which warping region proposals to a 224X224 image is justified. Although in Appendix A., the authors mention other methods of forcing their region proposals into their deep network, Im afraid that this does not address the issue of detecting elongated objects that aren't as discriminative as a human body.

    An important question, that should arise if you read the DPM paper, is does this method work with deformable objects shown at different poses. One may argue that the deep network understand this and then gives the 4096 feature description of the image patch. But with a trained SVM which doesn't explicitly understand the multi modular behaviour of object categories, I do not think this would generalize well on objects that DPM does a good job on. The biggest strength of the R-CNN in my opinion is great features, what it lacks probably is an unconstrained view of object categories. Would it be interesting to use R-CNN to detect parts and roots and then build a DPM over it, instead of a simple SVM?

    1. While the two methods might look very different, you can actually write the DPM architecture as a CNN architecture (Something like 4 layers: Edges + Pooling + Parts + Roots).

      That being said. Even though the SVM in RCNN might seem naive, it is actually just the top layer in a very flexible model. For example, the 4096 features, are indeed very invariant to different appearances and deformations of its "parts".

      So the are very much the same, except DPM has a more constrained architecture (fewer part filters, no sharing, fixed features, etc)

    2. Regarding RCNN. You are correct in that it feels indeed bizarre that you have to run the network many many times on the same image, and there should be a better way of "sharing" computation. (Which is a useful feature in CNN). However, there is a long history of "coarse-to-fine" approaches to detections. Simply because if you don't use coarse-to-fine you have so many possibilities that is hard to keep the false positive rate down.

  6. I have a quick question regarding the DPM paper.

    When creating the feature pyramid, the authors set \lambda, the number of levels in an octave, to 5 and 10 respectively in training and testing. In figure 3, the author showed an example where the part filters are taken from the original image and the root filter are taken from one half resolution, with \lambda = 2. However, I don’t understand why we need to set \lambda to different values. Take Fig.3 for example, even if \lambda=5, the resolution of root filter and part filters still remain identical. So whats the usage of this parameter? Any thoughts?

    1. My guess for why they used more levels in testing is because they simply don't want to miss a positive detection in test time. So, they have to be more exhaustive.

  7. In section 5 of the R-CNN paper, the "fg" strategy computes CNN features only on a foreground mask by replacing the other background pixels with the mean input so that they are zero after mean subtraction. What effect does this have on the final features? With most inputs zeroed out, I can imagine that the final feature vector magnitude could be significantly smaller with small objects. In addition, it would likely not be invariant to translation or pose changes, but this could potentially be beneficial for segmentation tasks (e.g. certain classes tend to be smaller while others tend to be positioned in certain regions of the image.) It would be interesting to see experiments that explore the effects of this particular approach to computing CNN features over arbitrary, non-rectangular regions. (Is there another way...?)

    Furthermore, I wonder if there is way to adapt deep learning for detection without relying on bounding box proposals. The fully-convolutional networks from last time address this for segmentation, but they are unable to separate multiple instances of the same object or restrict regions to potentially-overlapping bounding boxes.

    1. With regards to your question about replacing background pixels with the mean input, while it is true that the final feature vector's magnitude will still vary with the size of the object, I still believe that the net effect is beneficial in many cases. More specifically, in the case of object detection, the magnitude of all similar-sized objects will become more consistent regardless of the background. Since the background of the object of interest can vary significantly, and in certain cases, independently of the object of interest, having background information can end up hurting a CNN's ability to detect the object in these cases. Even if the object of interest varies in scale and location, the background information will serve as clutter and only affect the magnitude of the feature vector in a non-meaningful way. Of course, this depends on the specific application at hand. In many cases, retaining the background information can actually help. I wonder if there is a way to automatically determine this...

  8. In the DPM paper:

    - The features used in the paper include the latent variable, i.e. where the parts are relatively to the root. Given the parameter \beta, which includes the original location of the parts (v_i) in the learned model and the cost of displacement of parts (d_i) from their original positions, these variables can be estimated using generalized distance transform.

    - But the question is, even given the block containing the objects in the training data, how can we find those latent variables and learn the SVM parameters? The solution is the iterative optimization given in this paper. The paramters and latent variables are optimized iteratively, fixing one and optimizing the other.

    - Data-mining hard examples: This method is used to deal with the issue that the number of the negative samples is much larger than the positive samples. Instead of training once using all the negative sampls, we solves a sequence of traning problems using a small number of hard examples. This is similiar to the boost methods, e.g. Adaboost, where samples classified incorrectly in the sequence of training of weak learners are weighted more than those classified correctly

    -?: I have one question regarding the training of DPM: How is \beta initialized in the training session? That is, we don't know the number and patterns of the filters, we also don't know the cost for part displacement anchor v. In turn, we also don't know the latent varibles (dx, dy) : the displacement of the parts from their positions in the model, since we don't know the part displacement anchor v.