Tuesday, March 17, 2015

Reading for Wednesday 3/18

D. Parikh and K. Grauman. Relative attributes, ICCV, 2011.

And additionally:

A. Farhadi, I. Endres, D. Hoiem and D. Forsyth. Describing Objects by their Attributes, CVPR, 2009.

N. Kumar, A. Berg, P. Belhumer and S. Nayar. Attribute and Simile Classifiers for Face Verification, ICCV, 2009.

18 comments:

  1. Hi All,

    I'll be presenting "Relative Attributes" paper tomorrow. Just to get the feel of the paper and relative attributes you can try out the following in your free time.

    http://godel.ece.vt.edu/whittle/

    The basic idea of the papaer is how we humans communicate in describing something we know of but have never seen. For example if you need to describe a person you know to another person who has never seen that person than you would be descibing the person w.r.t the people that person has seen like "he is shorter than him, darker than him" etc. This is the fundamental idea behind "Relative Attributes" and we shall discuss how to actually formulate this relativeness in to math(SVM) later we shall see some applications of using relative attributes.

    Let me know if you want me to discuss anything in particular about the paper in more detail.

    ReplyDelete
    Replies
    1. One question I have is regarding the statement below.
      "Our goal is instead to estimate the degree of that attribute’s presence—which, importantly, differs from the probability of a binary classifier’s prediction." How is using the probability of a binary classifier to rank the images by attributes different from the authors' approach?

      Delete
  2. Here's my summarization and comments for "Attribute and Simile Classifiers for Face Verification":

    In addition to using the describable attribute (e.g. blond hair, female etc.), the authors propose to the traits - "smiles" - decribing how much a part of the testing face looks like the parts of the reference faces. This is very intuitive if we think about how we usually desribing the appearance of a person.
    The "smile" traits have two advantages:
    1. it descibes the characteristics in terms of the data itself. In this application, it describes the part of the face by the similarity between the part of the testing face and the part of the reference faces. This "relative" definition makes sense since we often describe the appearance by comparison, in addition to the absolute description.

    2. The "smile" traits classifiers need much less labeled data, since the set of labels are only {of the reference person, not of the freference person}, compared with labels for the absolute attribute (65 labels needed for one sample).

    Also this indicates another method to define the traits: by comparing the training/testing data with the reference data of the same type. In this example, the data type is the apperance of the faces. But it can be any other data type, appearance of X, motion pattern, space configuration etc.

    In the experiment section 4.2, the author states that they didn't use the regions outside of the face region in their methods. But they didn't mention if the regions outside of the face is used in the comparison methods implemented in their experiment. Also, the reader may be interested in the effects of using the regions out side of the face their methods. So including another experiment comparing the performance including/not including those regions would be apprciated.

    ReplyDelete
  3. In my opinion, the first two papers are similar to some extent. Both of them introduce a new way to describe an object, one is “relative attributes” and the other is “discriminative attributes.” However, I think they conceptually are doing the same thing, which is comparing an object with others and then describe it. The only difference is the former paper use relative phrases to describe an attribute (i.e. taller than, bigger than), so the attributes are no longer a yes/no question, while the latter paper just simply use “xxx has ooo/ yyy don’t have ooo” to describe objects and make it a two option question. (Personally I feel like this is similar to soft assignment v.s. hard assignment)

    Also, one thing worth mentioning is that Farhadi et al introduced a novel feature selection method that could learn attributes which generalize well across different categories. What they did is to focus on within category prediction ability. For example, instead of directly training a wheel classifier, which may mis-learn the metallic concept accidentally due to the fact that all wheels in the dataset are surrounded by metallic, the authors propose to select classifiers that could perform well on distinguishing “cars with wheels” and “cars without wheels,” “trains with wheels” and “trains without wheels.” Since all cars and trains have metallic surfaces, it could reduce the probability of being confused by it.

    ReplyDelete
  4. One of the interesting thing to learn attributes is that you can localize them and get reasonable results even the groundtruth segmentation mask is not in the training set.

    In terms of learning to rank, there is also a similar paper from Fernando's group:
    http://www.humansensing.cs.cmu.edu/projects/mmed/MMED_CVPR12.pdf

    More maths inside though :P

    ReplyDelete
  5. In the " Attribute and Simile Classifiers for Face Verification", are the face parts aligned for simile classifier? If not then how do they deal with people with various poses having different sized noses at different locations in the image?

    ReplyDelete
    Replies
    1. The face images are roughly aligned, just since they are face images. Also, GIST uses pooling, so it has some spatial invariance.

      Delete
  6. The Relative attributes paper seems like a high level Vision paper.
    I have a few of questions,
    Firstly how expensive is the relative attribute data collection. Doesn't 8 categories seem very few?
    Are there results for image features apart from gist?

    Would it be interesting to learn the most discriminative attributes?

    ReplyDelete
    Replies
    1. In their case, I don't think it is very expensive, since they are just learning a single "line". But in general without any assumptions, it might be required to look at all pairs in your training data, which is prohibitively expensive. 8 Categories is very few yes, but this was just proof-of-concept. Learning to rank is expensive.

      They also used LAB color features, but gist is a pretty reasonable choice for general image features?

      Delete
  7. Regarding the main paper. This was obviously a seminal work (Marr prize and all that), and their contribution of getting relative supervision to work nicely in vision. However, I still wonder if this is really solving the problem of "relative attributes", I mean since this is just using relative supervision to learn a "line" in feature space, along which all of the images lie. That means that the main results of the paper (zero-shot learning, and caption generation) are essentially just using their learned mapping to map an image to a "smileness" score, for example, which is used to do something interesting.

    This is a nice concept, but I was curious to hear what you all thought about this. Since what they are saying is that they can find a single nice (linear) manifold that corresponds to the "smileness" of a picture, so it is not learning that something is more smiling than something else, but just learning to rank. Is this the right way to do this?

    ReplyDelete
  8. The authors adapt the Ranking-SVM (Joachims'02) formulation except with quadratic loss and some similarity constraints. I'm not sure if they give any justification or show experiments where/why/how this would work better than simply ranking-svm out-of-the-box...

    ReplyDelete
  9. In the Relative Attributes paper, the authors describe a method of generating "relative" descriptions of images which differs from more typical "binary" descriptions. An example of a relative description is as follows: "Person A is smiling more than Person B". A binary description would simply indicate whether or not Person A is smiling and Person B is smiling. The advantage with the former method is that it is less restrictive and more natural in human speech. To quantitatively evaluate the quality of the "relative" descriptions, the authors recruited 18 subjects and discovered that these subjects are more likely to identify the correct image using the "relative" descriptions when compared to "binary" descriptions (48% vs 34%). Although the increase in identification accuracy is noteworthy, I have an issue with the subject sample size and the fact that "some" of these subjects are familiar with Computer Vision. I would like to see the authors scale up this experiment to hundreds of individuals. One way to accomplish this is to use Amazon's Mechanical Turk platform.

    ReplyDelete
  10. I found "Relative Attributes" interesting because this is the first paper I have read where the users have based an algorithm completely on how successfully the input to and the output from the same can be interfaced with humans rather than the novelty of the algorithm itself. They claim that supervised human input to a system is more accurate/reliable when attributes are described relative to a similar object rather than an absolute measure. Similarly, they claim that humans tend to prefer descriptions of objects relative to similar objects rather than those that are stand-alone. Their algorithm performs better than related work solely because the input to the system from humans is more accurate and that the output is better accepted by humans.

    However, one detail I could not grasp was regarding the following statement:
    "Our goal is instead to estimate the degree of that attribute’s presence—which, importantly, differs from the probability of a binary classifier’s prediction." How is using the probability of a binary classifier to rank the images by attributes different from their approach?

    ReplyDelete
    Replies
    1. The binary classifiers are usually trained in a discriminative fashion (like SVM). They just care about separating the negatives from positives but do not care about the separation between the positives. The probability just denote the distance from the separating hyperplane which indicates how hard is to predict the attribute presence in the image. This is not same as the strength of attribute in it. So the probability of binary classifier doesn't actually give you the strength of the attribute.

      Delete
  11. "Relative attribute" is smart in that it not only categorize images, but also relate them to others. Each image is defined by its own degree of attributes, thus we can easily compare or group images by our own standard. One thing I noticed is the scoring system. There is no universal way of defining a scoring system for attributes, how did they deal with scoring for different attributes? This seems to be more objective based since different people rank different attributes differently.

    ReplyDelete
  12. Something that I noticed about the "relative attributes" is that the attributes where tied to the dataset. For instance, the smile feature was tied to faces, meaning that a face already had to be known to exist in the image. While this isn't problematic for faces as much, it could be fore something more vague like the outdoor scene. Is there a way to perform this relative attributes method without a previous context for the data?

    Also, the data used seems to only have a single interest point per image. Is there a way to possible utilize this method on images with more that one instance of an attribute in the image (IE two people in one image)?

    ReplyDelete
    Replies
    1. To answer the second question. They used a general picture descriptor for the whole image (GIST+LAB etc). It should be straight forward to generalize this to the case where you run a generic category detector first (face detector etc), and then only calculate GIST+LAB for that bounding box.

      Delete
  13. The problem I see with this paper is that it seems there is more objectively invovled than other algorithms. I like that relative attributes is a more closer representation of how humans use attributes rather than using them as binary classfication categories. However I think the problem with this is how do they deal with pairwise images that are ambiguous, do they throw these out of the training? As some people have noted does the number of categories drastically affect results and can this be scaled, I think fundamentally it can't be, although a minimal set might be able to be found. I agree with some sentiments shared that this needs to be tried at a larger scale before it can be fully accepted as it does better since the human input is more accurate due to this relative attributes description.

    ReplyDelete