A. Shrivastava, S. Singh and A. Gupta. Constrained Semi-Supervised Learning Using Attributes and Comparative Attributes, ECCV, 2012.
And additionally:
R. Fergus, Y. Weiss and A. Torralba. Semi-supervised Learning in Gigantic Image Collections, NIPS, 2009.
S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, R. Fergus. Training Convolutional Networks with Noisy Labels, arXiv preprint arXiv:1406.2080 (2014).
I am presenting with Ankit tomorrow, and I am covering the additional papers.
ReplyDeleteThe gigantic SSL paper is the main one that I am presenting. It expands on an interesting machine learning idea, which is instead of sub-sampling massive datasets - Just assume that the dataset is infinite.
The infinite data assumption yields the idea of using functions instead of vectors. This paper then provides some (reasonable?) assumptions to attack that problem, such as assuming that the dimensions of the data are independent, and numerical approximations to the graph laplacian.
The other paper I am presenting, which is the noisy label paper (note that it is a different paper than on the website!), will be briefly presented. This paper elaborates on the idea of modeling label noise in CNN and introduces an extra layer on top of the network along with constraints to model the label noise.
Following are my thoughts about the paper "Constrained Semi-Supervised Learning using Attributes and Comparative Attributes".
ReplyDeleteThe paper proposes training classifiers for image classification using bootstrapping as a semi-supervised learning technique. Although bootstrapping is a common technique in the semi-supervised learning domain, the authors propose training the classifiers together instead of independently to alleviate semantic drift. They propose additional constraints towards this respect like mutual exclusion, binary attribute constraint and comparative attributes constraint. They also include a self-cleaning step to further improve the performance.
Overall, the idea to train classifiers together instead of independently is an innovative idea since humans tend to use comparative attributes a lot while reasoning about scenes.
If space was not a constraint, I would have liked to see more experiments that quantitatively analyze the impact of each constraint individually as I am not sure if some of the constraints might improve the performance as much as the others. For example, I am not sure if the binary attribute constraint would improve the performance as much as comparative attributes.
Also, the self-cleaning phase seems rather counter-intuitive. It rejects labeled data that are classified with low confidence by the classifiers. I would have expected something more like reinforcement learning to correct the classifiers to classify the labeled image with higher confidence.
After Introspection the removed images are again added to the unlabeled pool. Its not the case that they are complete thrown out of the experiments. So If there was a positive example which got rejected in introspection then it will be picked again in subsequent iterations.
DeleteI read the Constrained SSL using attributes and comparative attributes paper. It is an interesting approach to SSL, and I was particularly taken with the mutual exclusion (ME) and comparative attributes constraints. These seemed very straightforward to implement in order to reduce semantic drift, and are particularly good ideas. The self-introspection constraint is not very convincing. As Venkatesh suggested, it seems counter-productive to kill positives that receive low scores during classification. Pooling high confidence classifications or reinforcement learning could have been implemented.
ReplyDeleteI read the constrained SSL paper. Training classifiers together is a natural extension to training separately. They way the they paper does this is to add a couple of constraints such as sharing attributes, dissimilarity, etc. Using different classifiers for scene, attributes, as well as a comparative pairwise classifier, the output of these are combined in a specific energy function. The idea is that the comparative pairwise classifier would distract negative examples. The annotations used for the binary class attributes and comparative attributes are a fixed set.
ReplyDeleteIt is interesting that the 19 binary constraints does worse than the then 10 binary constraints and the paper states that it is because of saturation. I wonder if varying the annotations drastically affect these results, as we can see that the different wasn't low. Another thing is that adding comparative constraints improves the 19 binary constraints by more than changing the number of binary constraints. It seems that really both are needed. It would have been nice to see what the 10 binary constraints with comparative constraints would be, or how we can learn the most "optimal" set of binary constraints
Yes. I agree that they should have also shown the experiment with 10BA+CA.
Delete"The annotations used for the binary class attributes and comparative attributes are a fixed set" - This is the case only in the first experiment where they want to see if their approach even works. In the subsequent experiments they add the annotation of the selected images for a class to retrain the attribute classifiers.
I read the "Constrained SSL .." paper. The paper presents intuitive use of adaptive constraints for SSL of classifiers and a thorough variety of experiments to analyze their approach.
ReplyDeleteI didn't find the qualitative result explanation satisfactory - at one point the talked about how their classifier was adding bedrooms (Figure 4) to the class of auditoriums in later iterations in the first experiment. However, in the second experiment they state that the classifier was working really well upto 60 iterations, even when in Figure 6 there are couple of bedrooms in the auditoriums category. I am curious if there is a mechanism hidden in the approach to help one learn what attributes cause such bugs, and how could one avoid them.
The paper “Training Convolutional Networks with Noisy Labels” by Sukhbaatar et al. describes a method of training CNNs that is more robust to “noisy” labels by using an extra noise layer. This method models two kinds of label noise: “label flip noise” where the image is given the incorrect label of another class within the dataset, and “outlier noise”, where the image label does not belong to any of the classes. This sort of network can be very useful in real-world scenarios because obtaining labels that are 100% accurate is a laborious and expensive task. What I found surprising is just how robust this sort of network (and even CNNs are in general) to this this sort of noise. For example, in one of the experiments, the authors flipped half of the ground truth labels on a 1000-class imagenet classification task and performance using the author's method was around 5.4% worse than the case where there was no noise (errors of 45.2% and 39.8% respectively). Without using a noise layer, the error rises to 53.7%, which is significantly worse than 39.8% but given that chance performance has an error of 99.9%, it shows that the network is still learning reasonably well in spite of noise. What would be interesting to see is a more formalized way to compare standard networks that are trained with perfectly labeled data without the noise model and and networks that are trained with possibly more but nosily labeled data with the noise model. What is the “equivalent” number of perfectly labeled ground truth examples given some fixed number of noisily labeled examples?
ReplyDeleteI read the paper constrained 'Constrained Semi-Supervised Learning Using Attributes and Comparative Attributes'. The paper uses relative attributes and some additional constraints to improve semi-supervised learning. All the constraints mentioned in the paer seem very intuitive. The binary attribute constraint and the comparative attributes constarint are not independent i.e if we are enforcing CA wouldn't BA also be enforced. Are the attributes used in both CA and BA the same? I didn't understand the mathematical formulation it would be nice if the present can explain the derivation of the energy function. In learning the classifier for CA differential features x_i-x_j were used, I understand these as the distance in feature space but does using better distance metrics improve accuracy? or what if we learn a classifier like in relative attributes.
ReplyDeleteI read the 'SSL in Giagantic Image Collection’ Paper. The major contribution of the paper is to use efficient approximations to enable the SSL algorithms to scale from polynomial to linear time in the number of class and images. The main idea is given infinite data, find functions f which agree with the labeled data but are also smooth with respect to the graph. For this, they construct accurate numerical approximations to the eigenvectors of the normalized graph Laplacian and using this propagate labels across the dataset. They claim that if the independent conditions between the components are not violated and the SSL is a linear combination, then it their algorithm will exactly recover the solution as n tends to infinity. They have shown some convincing results, however I was hoping for stronger quantitative comparisons.
ReplyDeleteConstrained Semi-Supervised Learning using Attributes and Comparative Attributes
ReplyDeleteSemi-Supervised Learning is similar to Unsupervised learning in that it requires a large amount of data. Also, it uses the notion of joint attributes and attribute ranking.
It is interesting to see an experiment with number of attributes vs resulting accuracy. I am wondering if there is a possible way to determine the best attributes or number of attributes for a specific scene.
I read the third paper, "Training Convolutional Networks with Noisy Labels.” I really love this concept, since it aims to solve the basic problem of deep learning: requiring tons of labeled training data. Since the authors assumed that “label flips” are independent of input data \vec{x}, what they did is simply adding a constrained linear layer on top of the original classification network to model the noise distribution, and then do end-to-end training. Throughout the whole paper, although the authors made many assumptions (some I think are not that reasonable, e.g. I don’t think label flips should be independent of \vec{x}), the performance still seemed to be pretty well. I think this work may be a good starting point for using noisy data to train a DNN!
ReplyDeleteI read the "Constrained Semi-Supervised Learning...." paper.
ReplyDeleteOne thing that I did not understand was the attribute labeling. Are these attributes part of the ground truth? If so, does that mean that two sets of images (such as amphitheater and auditorium) would end up having overlapping classification if their attributes did not have any clear separation (no "has rounded edges")? From what I understand, ME wouldn't allow it, so what would happen?
In general, this paper presents a framework to prevent semantic dift problem in semi-supervised learning approaches such as bootstrapping. Unlike in the standard bootstrapping approaches, where samples and classes are pruned and processed independtly and "rigidly", the author propose to consider the relations among them and classify the samples in a way that is not as exclusively as before.
ReplyDeleteMore specifically, the relations of samples are modeled using attribute and comparative attribute. The non-rigid classification is achieved using adaptive mutual exclusion constraint: a candidate added to the pool of one class is not used as a negative example for the other.
To account for the intra-relation between samples (images in this case), the fully-connected graph is used here. But I think the fullu connected graph is very difficult to optimize over and this is where the intractabiliy of the problem comes from. Although appoximation methods such as loopy belief propagation and Gibbs Sampling exists, those methods do not gurantee to converge to the optimal solution, especially for a fully-connected graph with millions of nodes. So I'm curious if using the partially connected graph will make the problem more tractable or easier to optimze.
The first paper addresses the problem of concept drift in bootstrap learning based approaches for semi-supervised scene classification. Specifically, the authors claim that the initially labeled examples are insufficient to constrain the learning process, causing the classifiers to diverge from what they were intended to represent. Previous attempts to overcome this issue relied on multi-class classifiers, which strictly enforced the mutual exclusion constraint: positive classification by one classifier implies negative classification for all others. However, for visually similar classes, this constraint is likely too stringent resulting in overfitting and poor performance. Thus, this paper proposes relaxing this constraint. In addition, they introduce two other constraints based on binary attribute and comparative attribute classifiers whose scores are combined as unary and binary potentials in an energy function. Images can then be classified jointly by (approximately) minimizing this function.
ReplyDeleteSince optimizing the energy function is intractable, only a single iteration of loopy belief propagation is performed. Is this sufficient to give reasonable results? How sensitive is it to initialization? Also, efficient is the training procedure? It seems that multiple passes through the data would be required at each iteration. How does the computation time scale with the number of classes/attributes?
Is this approach restricted to fixed, pre-defined attributes? Attributes seem to be useful in situations where easily-interpretable, discriminative properties can be picked out manually from the classes of interest. But this may not always be the case. Can would we also learn the attributes at the same time, or would there also be a drift in what the attributes represent?
I read the paper “Training Convolutional Networks with Noisy Labels”. This work models the noise by using a class confusion matrix. The idea seems very simple but effective.
ReplyDeleteOne thing I'm not clear. To model the label noise, especially label flip, is it appropriate to use class confusion as a feature? To my understanding, class confusion is for modeling the class-level error, for example class "cat" is easier to confused with class "tiger". The label flipping noise is more like an instance-level error, where whether one instance is flipped or not is simply made by some randomness. Should it better to consider the instance itself (such as the similarity among the data in the same class) to model the flipping noise?