## Tuesday, February 24, 2015

P. Arbelaez, M. Maire, C. Fowlkes and J. Malik. Contour detection and hierarchical image segmentation, TPAMI 33, no. 5 (2011): 898-916.

J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation, CVPR, 2010.

J. Long, E. Shelhamer and T. Darrell. Fully convolutional networks for semantic segmentation, arXiv preprint arXiv:1411.4038 (2014).

1. 1. Contour Detection and Hierarchical Image Segmentation
- For contour detection, the authors developed gPb contour detector, of which the globalized probability of boundary is formed as a weighted sum of local, Pb or mPb, and spectral signals sPb.
- For segmentation, using the non-max suppressed gPb contours, firstly they perform Oriented Watershed Transform (OWT) to construct a hierarchical segmentation, and then with an interpretation that the boundary strength as an estimate of the probability being a true contour, they create Ultrametric Contour Map (UCM), the real-valued image obtained by weighting each boundary by its scale of disappearance.
- The Pb contour detector basically computes the oriented gradient signal by placing a circular disc at location (x,y) and taking Chi-squared distance between two half-disc histograms. And with linear combination of different scales of gradients they create one single multi-scale oriented signal.
- The results of gPb-owt-ucm shows the best performance in both boundary and region quality on every detest they tested.
- They argue that a limited range of local cues covers a variety of complex scenes, but I still wonder if it is sufficient to show that it works on a wide range of scales for object detection.

2. Constrained Parametric Min-Cuts for Automatic Object Segmentation
- In short, they generate multiple segments using constrained parametric min-cuts (CPMC), reject many redundant segments and finally use machine learning tools to rank them by telling the importance of the features based on graph, region, Gestalt properties.
- The approach of having multiple segmentations are exploited in many works, but their final working set of segments are mostly accurate.
- They compute the figure-ground segmentations as an optimization problem in which the energy over the pixels is to be minimized. In this energy function, they measure the similarity between adjacent pixels by taking exponential max of gPb output. I was not so clear about the intuition and the set-up behind this penalty term as well as how this can perform that useful as a single-max problem.
- They performed a fast rejection step with some fixed settings of minimum pixels in the segments, the total number of segments to remain, percentage overlap to tell the similarity. It would be greater if they showed a numerical result to compare the betweens.
- Besides the proximity, similarity and good continuation as a grouping characteristics, 26 features are used and learned by the random forests regressor. The result ranks the minor axis of the ellipse having the same normalized second central moments as the segment. It seems that inter-region similarity has more feature importance, which is actually expectable.
- It is notable that the quality of the segments in VOC2009 actually shows better performance that CPMC is more accurate that gPb-owt-ucm.

2. 3. Fully Convolutional Networks for Semantic Segmentation
- In short, their key approach is in the following: They performed a fully convolutional network (FCN) training for pixel-wise prediction by extending the deep classification nets to fully convolutional layers with pre-training, and they train for segmentation by fine-tuning.
- At first, they regard the fully connected layers as convolutions with kernels that cover the entire input regions, which is not particularly new, but a very good interpretation. This enables the dense prediction of any input size and output classification maps. They perform subsampling of the output at the end.
- They also found out that dense prediction can be performed by up sampling, or same as the backwards convolution, which can be learned.
- They selected AlexNet, VGG16, and GoogLeNet, and replace the fully connected layers at classifier layer with convolutional layers.
- They also combine the final prediction layer with lower layers with finer strides in order to compensate the limitations in the upsampled output. But, I did not understand quite well why finer scale predictions should need fewer layers.
- The result of finer stride nets shows a qualitatively better segmentations. Although it reaches to the ground-truth as the stride gets smaller, fast and effectively, I still doubt the reliability of up sampling and I don't quite understand why this should work better. They also perform "skip" architecture, and lead to the focus that it combines deep, coarse, semantic information and shallow, fine appearance information, which presumably means where and what, but I am just not sure why this has to work.

3. I think Namhoon has summarized the papers in great detail.

1. Contour Detection and Hierarchical Image Segmentation
- In the chi-square distance calculation between the two half-discs they apply a second- order Savitzky-Golay filtering to reduce multiple detection peaks. I am not sure why this is required?
- How $D_{ii}=\sum_{j}W_{ij}$ is encoding global information?
2. Constrained Parametric Min-Cuts for automatic Object Segmentation
- CPMC already used gPb in their pairwise energy term so is the comparison with gPb-owt-ucm fair?
- One important limitation of CPMC is that it could only be used to segment foreground objects. We can’t get the regions/segmentation of whole image.

1. I think both semantic contour detction and segmentation are difficult. So, I agree that we need some sort of algorithms to remove meaningless peaks otherwise, they may be too bad so the later segmentation. However, I am not sure why we need the Savitzky-Golay filter. It seems to me that a simple non-max suppression based on local response is sufficient.

4. Regarding Ankit's question above 1 (ii):
As far as I understand, W_{ij} is a measure of how 'connected' the pixels i and j are (high value if the max strength of gradient over the line connecting the 2 points is low). D_ii is an element of a diagonal matrix that is a measure of the sum of such 'connected-ness' of a pixel to every other pixel in the image. Hence, it sort of encodes the global information by connecting every pixel to every other pixel of the image. It is actually a standard technique in spectral clustering (ref: http://en.wikipedia.org/wiki/Spectral_clustering)

5. In reply to Ankit's question, I think the matrix D represents a normalization of the cluster affinities in the image. Similar clusters are present in the same region, while different clusters are not. But, what if similar clusters are present in different parts of the image? The effect of the following equation (on page 7, (D-W)v = lambda*D*v) could be to balance/normalize the affinities of all the regions in the image.

Savitzky-Golay tries to look at a patch centered around a pixel, fit a polynomial to that patch, and compute the "filtered" value for that pixel based on the (x,y) coordinate of that pixel. In this case, the (x,y) = (0,0) since the patch is centered around some pixel. All one needs to do then is to simply evaluate the polynomial equation (given the coefficients) at (0,0). An excellent explanation is given here: http://research.microsoft.com/en-us/um/people/jckrumm/savgol/savgol.htm

6. FYI, Savitzky-Golay can be used for regression analysis; it is useful to fit a lower-order polynomial to noisy data. In my opinion, it is sort of an intermediate point between linear regression, and fanicer kernel-based regression approaches like RBF's.

7. I found "Fully convolutional networks for semantic segmentation" by Long et al to be a very innovative adaptation of the AlexNet (and few other classification nets). Since we discussed AlexNet in the previous class, I have attempted to briefly capture the essence of this work as an "evolution" story of AlexNet in a slightly different angle than Namhoon.

The biggest challenge with respect to adapting image classification nets for spatially dense prediction task is that the dimensions of the output are no longer fixed. We would like to have a per-pixel output label in this case. The majority of the paper talks about modifications made to achieve this. The authors describe the process of mapping the coarse output of convolutional layers to dense pixels as deconvolution. This to be a very interesting way to visualize the process. I find it to be similar to how we tried to "deconvolve" AlexNet in the previous class by trying to map the behavior of neurons back to the pixels that triggered them. How this is achieved is well summarized by Namhoon.

In the previous class, we used the concept of "switches" to be able to reverse the process of max-pooling in AlexNet when backtracking to the pixels that fired specific neurons. It was because it would otherwise be impossible to recover the information lost when pooling. However, the FCN described in this paper seems to be able to recover this lost information through upsampling and "shift-and-stitch". Since a lot of information is already lost in the lower layers, it is not very clear how this is done or how this is even possible.

8. The authors of the "Fully Convolutional Networks for Semantic Segmentation" paper have done a good job describing their novel convolutional neural network that is able to make dense predictions for per-pixel tasks such as semantic segmentation. The authors report excellent performance, achieving state-of-the-art segmenetaiton results on three datasets. One important thing to note is that the algorithm utilizes a supervised pre-training step and a fine-tuning step. There are various issues with ground-truth training data, especailly when the training data consists of segmentation masks, which are not only very laborous to obtain but can be very subjective. Given these Issues, I wonder if we can replace the supervised pre-training step with an unsupervised alternative. For example, the authors of the following paper describe "Sparse Laplacian filter learning" to obtain CNN filters with large amounts of labeled data for vehicle type classification: http://iitlab.bit.edu.cn/mcislab/~wuyuwei/paper_PDF/TITS%202014_Vehicle%20Type%20Classification%20Using%20Semi-Supervised%20Convolutional%20Neural%20Network.pdf

9. In the paper on "Constrained parametric min-cuts for automatic object segmentation", it was interesting to see the authors borrowing ideas from Gestalt theorists. Their statement that proximity, similarity and good continuation are key to visual
grouping is intuitive, and in fact is helpful in this segmentation approach.

As Ankit pointed out, this algorithm is limited to segmenting foreground object(s). If there are disconnected foreground objects, they would ideally be represented in differently ranked hypothesis. Looking into more than just the top ranked hypothesis makes sense for most real world images that have multiple foreground objects.

10. I read the “Contour Detection and Hierarchical Image Segmentation” by Arbelaez et al and have two minor questions, both about segmentation. (Contour detection is surprisingly straightforward!)

(1)In order to correct the artifact caused by the original Watershed Transform, the authors first approximated the arcs with line segments and then showed that their approximation methods are pretty promising (in Fig. 13). Although this seems to be not that important, personally I’m curious about how the authors initialize those end points (by random or not?) and whether the initialization matters or not.

(2) In the Ultrametric Contour Map (UCM) section, the author proposed a metric to measure the distance between two regions (eq. 15 & eq. 16). Although I fully understand the concept of this metric, it seems that the author did not make use of this property in their approach, did they? Please correct me if I missed something…

11. One of the key contributions of the paper was introducing a Multiscale cue combination but the paper doesn't mention in detail how they learned their weights for Multiscale cue combination. The paper also mentions that simple linear combination of spectral and multiscale cue is sufficient but fails to explain the intuition behind or empirical evidence.

For segmentation the paper uses a greedy approach, would there be an improvement if we use min-cut partition on the graph?

This paper is an evidence of how using better signal and image processing techniques and using various mathematical formulations can improve the performance but this can be achieved only after you have read sufficient literature regarding not just the problem but also many other fields of signal processing. I would like to see how CNNs were designed to achieve better performance. Now this leaves an important question for us to answer. Do we need to do research in the traditional way to solve a problem or go the "Deep Learning" way?

12. This comment has been removed by the author.

13. Contour Detection and Hierarchical Image Segmentation:
- Histogram-based gradient captures more appearance information than the
conventional Gaussian gradient, since it is free from the averaging process. The
choice of the histogram-based gradient makes sense here given the image is of
low noise level. Flters such as Savitzky-Golay can be used to deal with the
issues introduced by noise. Here we have one more options in terms of gradient
in our toolbox.

- The training process of parameters in eq.(10) and eq.(14) is interesting, since
the objective function being optimized is the F-measure itself rather than some
energy function with more physical-sense (eg. inter-variance or intra-variance
of pixels). So I think the goal achieved by using the F-measure as the objective
function is that the learned parameters in eq.(10) and eq.(14) maximized the
likelihood of the behavior of the system to human observer (assuming human
observer can achieve the best F-measure).

- UCM relates the detected boundaries to the regions. The equivalency is rooted in
the watershed algorithm used in the OWT process, since the watershed process
guarantees that the boundaries are closed. In this sense, new metric measuring
the distance between regions can be defined based on the intensities of the
boundaries - eq.(16). Generally this distance is not based on what is INSIDE the
region, but is based on what is ALONG the boundaries. The advantages of this
are at least two-folds: first, it greatly reduce the number of data points since
there's no need to consider the pixel/super-pixel inside the regions; second,
this makes it natural to build a hierarchical structure of the image simply
based on the threshold values of the boundaries.

- I'm wondering what eq.(16) implies intuitively. This inequality is not used
after it is defined..

14. Great summaries presented here.

I am also confused about the use of the properties in eq.15, 16, because comparing the distance between two regions R1, R2 doesn't affect the algorithm for generating the contour map, it seems to be a result of the algorithm rather than a driver for the algorithm's correctness.

15. Regarding the paper on "Contour detection and Hierarchical Image Segmentation", I find the literature survey and comparison quite detailed. It might be interesting to have like a brief summary of this in class.

I have a couple of comments.
1. From their results it seems that the main power of their segmentation is from the superior contour detection. Along the lines of what Ganesh pointed, it could actually be possible to get better results by using a graph cut algorithm.

2. How important is the Texton map. Im not sure if I saw any results where they ignore the Textons. This seems like an arbitrary addition without explanation. Would it be interesting to use arbitrary 17 filters?

16. While the paper detailed pretty much everything very well, something that I thought was lacking in the paper "Contour detection and Hierarchical Image Segmentation" was data to support the claim of "fast" in section 4.4.4. It does not seem that there is an examination of this; maybe it is left to the reader to extract from the papers describing all the stages of the pipeline, or did I miss something?

Also, great summaries guys.

17. The paper "Contour detection and Hierarchical Image Segmentation" introduce a very robust contour detector as well as a image segmentation methods. I like the following aspects of the paper.

1. Use texon as another image channel. Unlike the other three channels, texon is a low level representation of the image and carries more robust information. From Fig.6, we can see texon provides a higher (maybe noisier) heat map which enhance the gradients clue.

2. The usage of graph laplacian to globalize local gradient response.

3. The UCM method builds a hierarchical representation of segmentations.

But still, the paper relies on gradient based method for contour extraction, which cannot circumvent the great influence of illumination. I think future work could be done by proposing some learnt features to represent more robust and semantic meaning.

18. Several thoughts on the paper "Contour detection and Hierarchical Image Segmentation":
1. One of the advantages of this paper is that this gPb-UCM algorithm combines local and global contour signals into a hierarchical region tree. This hierarchical structure allows segmentations on multiple scales and therefore further allows top-down knowledge infusion.
2. However, this algorithm involves many intentional and complicated designs, which may require expensive computations (such as learning alpha in mPb). I’d like to see whether these goals can be achieved in an automatic way, such as using deep network in “Fully Convolutional Networks for Semantic Segmentation”.
3. The success of mPb may due to that the learning process directly uses F-measure as objective function.

19. This comment has been removed by the author.