Sunday, March 15, 2015

Reading for Monday 3/16

M. Aubry, D. Maturana, A. Efros, B. Russell, J. Sivic. Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models, CVPR, 2014.

And additionally:

J. Lim, A. Khosla and A. Torralba. FPM: Fine pose Parts-based Model with 3D CAD models, ECCV, 2014.

A. Dosovitskiy, J. Springenberg, T. Brox. Learning to Generate Chairs with Convolutional Neural Networks, arXiv preprint arXiv:1411.5928 (2014).


  1. Paper 2: FPM: Fine pose Parts-based Model with 3D CAD models
    I think I like about the use of CAD models
    (1) Render infinitely large training data
    (2) Parts location are no long latent
    (3) Weight the contribution of each part--> biggest contribution to me

    Things I don't like:
    (1) We knew that in DPM, given high res images, the part filters are not needed. Yet, this paper uses synthetic images, which they can increase the resolution how ever they want, and still uses part filters
    (2) Why is HOG still used given deep learning feature?

    1. Two other comments:
      (1) Since they used CAD model, why not do edge detection on the test image, detect salient location along the edge, and then try to fit the 3D model to either the edge or the salient locations using some deformable fitting criteria such as as rigid as possible (
      (2) They argue that the synthetic image are different from real images. Yet, what happens if they do the intrinsic image decomposition (as in The appearance of shading image can be synthesized exactly given the 3D model. Can we use that to help the detection?

  2. I am wondering about the performance differences between paper 1 and paper 2. It appears that the biggest contribution of those 2 papers are the proposal of the part distinctiveness. Paper 1 takes the parts that are farthest from the mean of the whittened data. Paper 2 marginalizes all the views to compute statistics of self occlusion and hence, part's weighting. To me, the two approaches seem to be very similar.

  3. I found the algorithm in paper 1 simple and easy to understand. To sum up, they generate multiple viewpoints of a 3D model and compute mid level patches over this synthetically generated dataset. The mid level patches (HOG templates) are then calibrated against a negative dataset and used for detection. Hence, for each detection they get an estimate of the fine pose (orientation) along with the bounding box. (I couldn't completely understand the calibration part though, would appreciate if the speakers could talk a bit about that)

    I think the biggest strength of this paper is that they are able to do away with training data (and still show good enough performance on PASCAL). As we know, object labeled training data is sparse and one of the main problems in training end-to-end deep networks for object detection. I guess generating training data using such 3D models for training deep networks could be an interesting problem to pursue.

  4. This comment has been removed by the author.

  5. I read the first paper which attempts to solve 2D-3D alignment problem for chairs. Its an interesting read and the authors have taken some novel approaches to solve this age-old problem. The paper starts with the history of object alignment and then talks about the re-emerging trend towards geometrical methods towards object recognition- something that Professor Gupta had also mentioned in the introduction lectures.

    The authors look at the alignment problem from a data centric approach. This resonates with the theme of the class that more data leads to better prediction. One key difference is that the authors synthesize data for this paper. They take 1300 3D models of chairs and synthesize 62 viewpoints of each chair on a white background. This allows them to model the problem as a style+pose recognition problem in 2D.

    What I also like are their neat little tricks- like using LDA instead of SVM because they have about a million negative examples and an iterative algorithm can be slow. And also, the idea of using the most discriminative of the patches is intuitive and resembles feature selection techniques.

    Even I don't understand the calibration part. It seems like it is used to assign weights to each visual element to tune its contribution to the recognition task but I may be totally wrong. The language is a bit confusing! :)

  6. I read paper 3, which trained a convolutional neural network to generate 2D projections of chairs using high level feature information like orientation, color, viewpoint as input. The network was trained using 3D chair models, and the network learned the manifold of these models, so that it could also interpolate between different models, and generate 2D images of chairs that it hadn't seen before. The key technical insight of this paper is the use of convolution to go from a
    low-res representation of neuron outputs in lower layers to a clean high-res 2D projection of a chair image. Further, the authors extensively analyze how the network works using single and multi-neuron outputs at different layers. This analysis is instructive in helping one understand how effectively deconvolution works.

  7. For paper 1. The authors match CAD models of chairs to 2D images. They use CAD models to render different views of the targets and train a discriminative part-based classifier to find the correspondence.

    Several comments

    1. I'm doubting the speed of this system. The part-based model is known for slow, and they need to exhaustively render every view of the chairs, which will make the system very inefficient. Even they only tried experiment on chairs, I doubt whether they can generalize this method to general object for fast expanding dataset size.

    2. Although they use some coarse-to-fine method, the geometry of the object is not well used. I think further improvement could be done by exploiting the geometry hints, using methods such as pose estimation, to give proper initializations.

    3. I don't quite understand Sec.3.2. Why do they need to calibrate visual element detectors? What does it mean?

    1. 3. Since they trained all of the classifiers independently, I believe the magnitudes of the scores are not directly comparable without some procedure to relate them. (One classifier having a higher score over another might not be meaningful before calibration.) The same is true of SVMs with unnormalized margins.

  8. Automatically generating chairs is certainly an interesting application of convolutional neural networks. I did notice that the authors use rectified linear units (ReLU) as the activation function. Recently, parameterized recitifed linear units (PReLU), which automatically learn the negative slopes, were introduced into the literature as per our recent homework. We know that PReLUs can outperform ReLUs in terms of classification performance, however it would be interesting to see how PReLUs effect visual reconstruction of chairs.

  9. - These papers (1st and 2nd) are addressing the detection problem by leveraging 3D CAD-like models. The main advantage of using 3D CAD model is that we can generate infinity number of arbitrary views, which allows us to do fine-pose alignment providing 3D pose information of the object. The major challenge in these approaches is that the rendered images usually have different characteristic compared real images (texture, illumination, occlusions, and background clutters).

    - The two papers are aiming to a similar problem, but their goals are slightly different. While the FPM paper focuses on the detection with exactly same model, the "Seeing 3D chair" paper tries to detect the object in a similar category with a 3D model.

    - Surprisingly, "Seeing 3D chairs" paper does not compare their result with FPM paper. I guess the main reason is that they are more focusing on category-level detection. However, their results show that their categories mean actually almost similar shape of chairs (only color or length are slightly different). To be honest, I believe the FPM would be still working in these cases.

    - The evaluation method of the "Seeing chair" paper is also somehow poor than the FPM, in that they are mainly using usual bounding box based evaluation, compared to the 3D distance of the FPM paper.

    - I personally prefer the FPM paper because they are explicitly modeling the importance of views and parts. One major drawback of rendered 3D model is that they cannot handle occluded parts, which usually happens in real images. Additional, as mentioned, some view of 3D model might generate too many false positives. If we would like to apply these technique with large number of object cases, this would produce to many false positives. Therefore, it is worth reducing search space by focusing on more frequent and important example views.

    - Since both papers are considering only fixed number of view changes, both of them are actually not fully fine-alignment. Even if it may provide additional information compared to conventional detector, I actually doubt the usefulness of output of these methods.

    - I think these approaches may not be scalable. It is obvious that the number of available 3D model is very limited. And, also I am curious how their algorithms performing on the shape changes. If they are requiring very similar 3D model to their target, it is hardly practical.

    1. Most of all, in what application is this approach useful? For example, why not 3D-3D matching (3D CAD and Kinect or stereo camera)? In robot application, we actually need fine-tuned alignment, thus more accurate method is needed. If there is no interaction with target object, just coarse pose information which DPM may produce would be sufficient. The output of these approach is somehow in the middle of them, but it is not clear why we need this technique practically.

  10. Since a lot of facts have been covered by Hanbyul and Minh on all the three papers, I would like to highlight a couple of interesting points that struck me in the first paper. That said, I agree with Hanbyul that no useful practical application of this solution comes to mind.

    The method employed by the authors is a very interesting blend of the old 3D and geometry based recognition methods we saw in the first lecture and the more recent part based and exemplar templates.

    The authors attribute the success of their method to their use of mid-level HoG descriptors that have the side effect of capturing object boundaries without explicitly detecting edges. This is an interesting way to think of and use HoG descriptors in other applications as well since reliably detecting object edges is a hard problem.

  11. The first two papers address similar problems using 3D CAD generated training patches to train a model for 2D chair classification and fitting. 3D CAD models are capable of generating large amount of data easily by changing view angles. I also find it interesting to combine DPM with part importance.

    The third paper introduces a CNN approach to an interesting application. The goal is to generate 2D chair using 3D CAD data parameters. The overall architecture is easy to comprehend but I am still uncertain how they finalized on the specific layers.

  12. Both the first and the second paper focus on leveraging the power of 3D CAD model (i.e. can generate infinite examples from infinite views) to do fine-pose alignment as well as increase the accuracy of detection. As Hanbyul and Ming mentioned/questioned above, the two papers not only have similar goals, but also have similar contributions. However, I personally likes the second paper more, since it use statistics to computes the part occlusion and then come up with the corresponding weight, which I think is a more straightforward and reasonable way to explicitly model the importance of views and parts.

  13. The first paper takes a geometric approach to category-level recognition by extracting discriminative mid-level visual elements from a large dataset of rendered views of 3D chair models. This allows for leveraging modern discriminative techniques for transferring higher-level information such as 3D pose.

    While this works very well for chairs due to the availability of a large number of 3D models that effectively span the space of inter-class variability. However, most classes do not have large databases of 3D models. Instead, it seems like a better approach would have a single 3D prototype that could be aligned with 2D images with significant visual differences. I wonder if something like sift flow could be used for initial 2D alignment in order to reduce the need for many class instances. h

  14. The usefullness of using 3D cad models in finding a 2D alignment is useful and and the frist two papers us different algorithms to find these differences. The first paper uses a very data driven approach. As the others have mentioned now that we have more 3D realizations with SFM and kinect data, is there less usefullness with doing a 3D-2D alignment. Using a data driven approach with depth data might net a much more useful result given that the current processing reality has netted us with a lot more 3D data than ever before. Can all the little tricks in the contributions of the first two papers be adapted to a more data driven 3D approach? I wonder as an extrapolation to the 3rd paper whether that you can generate 3D chairs using a similary architecture, although there is a lot lost with the extrapolation to 3D the interplolation between different models could be useful to generate noisy data training or more bootstrapped method for 2D generation.