B. Yao and F. Li. Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities, CVPR, 2010.
And additionally:
Z. Tu Auto-context and Its Application to High-level Vision Tasks, CVPR, 2008.
D. Hoiem, A.A. Efros, and M. Hebert, Putting Objects in Perspective, IJCV 2008.
"Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities"
ReplyDeleteThe authors build a strong argument as to why we should think about context when focusing on the problems of object detection and pose estimation. They claim that both the aforementioned problems can benefit by using each other as mutual context. For this paper they focus on sport videos, a specific case of Human Object Interactions (HOI). As an example, the authors suggest that detecting small objects like a cricket ball might be a very hard task, but if we're able to detect another 'easier' object like a cricket bat which helps in human pose estimation, this pose information can be used as context to get the possible locations of the ball.
They use a hierarchical random field model, to model the relationships between 'activity classes', 'objects' and 'human pose' where the human pose is described by various body parts. They also include latent variables to model the different location of body parts that still contribute to the same 'pose'. This model is then posed as a structure learning approach (which we've discussed in previously).
The authors also claim that their model is capable of learning co-occurance of activity, human pose and objects, different human poses for each activity, spatial context between objects and body parts, and relationships between other models. Besides results that show the advantage of using pose and object detections as mutual contexts for either of the problems, they also show good results in HOI action classification.
It would be interesting to see how such pose and object relationships can be scaled to deep networks (assuming we have lots of data) for action classification or maybe object detection. Would it make sense to use two networks, one for pose estimation and the other for object detection and then fuse the top layers ? (similar to the two convolutional nets in the action classification paper from Prof. Zisserman's lab, which we discussed in class on Monday).
I read the paper "Modeling Mutual Context of Object and Human Pose in Human-Object Activities".
ReplyDeleteA main theme in the paper is the simultaneous detection of objects and humans in scenarios where both occur together and even complement or occlude each other. The hierarchical random field model is an intuitive representation of how a human body is composed of parts, and different body parts perform actions on the object (such as a cricket bat used by a batsman), and that a human performs an action on/with/using an object (a batsman swings the bat to hit a cricket ball).
The authors present a structural max-margin learning approach for learning which body parts in a human act on an object (a right-handed batsman will put his left leg forward during a front-foot drive shot in cricket). The model yields competitive results and successfully understands human-object interactions.
As Sri points out, a deep learning approach for emulating such a model would be interesting.
I think a supervised training approach could work, where we present a Single-stream CNN images of dataset such that, we send as input an original image (a batsman playing a defensive shot for example), a tweaked image where the object (bat) is missing and a tweaked image where the human is missing, and finally a set of images where individual body parts of the human are occluded. This way, the CNN can learn relationships between each body part and the object similar to the hierarchical model presented in this paper.
This comment has been removed by the author.
ReplyDeleteI read the paper Modeling Mutual Context of Object and Human Pose in Human-Object Activities". The authors of the paper have tried to model context-based relationships between objects and human poses in images of people playing sports. They provide illustrations of how challenging it is to perform both object detection and human pose estimation in these images, and explain that it is beneficial to share/combine mutual context information.
ReplyDeleteThe authors use a hierarchical random field model with many activity classes, objects and human poses. They utilize the connections between the object, pose and body parts in order to determine what parts of the body should actually be connected object and the pose. For example, a tennis racket held in the right hand of a person is most likely going to be used for hitting the tennis ball. To this end, they use potential functions and describe the various object-pose relationships, such as object co-occurrence with a specific human pose, different human poses for the same activity etc. The model learns these relationships from labeled training images. It learns by structured learning and max-margin parameter estimation. Structured learning endeavors to find connections between the object and pose, while max-margin parameter estimation tries to discriminate between activity classes. Given a testing image, they detect the object and estimate the pose of the human by finding the maximum likelihood score for the testing image.
The results of the paper are somewhat expected because the authors try to reason about almost all possible human poses present in the training set, either visible or occluded. I was of the opinion that background cues would also help the model learn the relationships between objects and human poses. The authors have also predicted/echoed that this might enhance the results of the model.
I read the main paper. It's an interesting paper with quite impressive results and a nice motivation. People above have summarized it well so I won't repeat that.
ReplyDeleteI think this approach of modelling context can be applied to many other problems as well. In general object detection can benefit from the image context. For eg, animals would rarely be found co-occurring with tables/chairs etc. Furthermore, human pose estimation can help object's fine pose estimation and vice-versa, eg, person on a chair can help find the fine pose of the chair along with pose of the person.
I read the paper titled "Modeling Mutual Context of Object and Human Pose in Human-Object Activities"
ReplyDeleteIn this paper the authors are trying to improve object detection, pose estimation and activity recognition by building a joint hierarchical Random Field (RF). The authors claim that this RF will help in providing context of these tasks to one another which in turn will help improving performance for all of these tasks.
They learn the parameters as well as the structure of the RF. The structure is learned using a hill climbing approach and the parameters are learned using a max-margin learning approach with a new proposed object and constraints.
One thing I am not clear about is the how do they divide the poses in every action class to subclasses. is the number of subclasses fixed for every action category or are they equal to the number of training examples in that actin category.
As far as I understand, at each iteration the pick the worst model (least discriminative), and then cluster the samples in the model. A pose is just a model, and they stop when they have enough models (poses) such that each model only makes at most 3 mistakes.
DeleteThe paper Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities by Yao et al. discusses the following challenging tasks: 1) Detecting objects in cluttered scenes 2) Estimating human pose. The authors describe a random field model that facilitates both human pose and object recognition by learning the mutual context of both human poses and objects in a sports dataset consisting of classes of human-object interactions. They evaluate their proposed method by first separately testing object detection performance and human pose estimation performance against various algorithms and then by inferring the human-object interaction label from the human pose and the objects. The authors show improvement to not only baseline algorithms, but on recent work in the literature. Overall, this work seems like a very elegant method of utilizing context to perform various difficult computer vision tasks. It would be interesting to see this method applied for security purposes such as detecting pedestrians that are behaving suspiciously (say based on their walking patterns or on the objects they are carrying).
ReplyDeleteI read the first paper, it is very intuitive that by modeling the object and human pose within human object activities that you should achieve a increase in performance. Also by sharing across these detectors which have error themselves will allow the individual detectors to do better than by themselves. A hierarchical random field seems to lend itself to be that if a deep network was applied to this that it would most likely do well. I think that what sri said with two networks doesn't model the random field structure well enough so combining them into one large network seems a better approach. I think it wasn't a surprise that this paper performed better than the state of the art since it uses more information to provide a better estimate, of the action as well as object detection and pose estimation.
ReplyDeleteThe authors of "Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities" have shown that using the mutual context between small occluded objects and human pose can greatly improve the detection accuracy of both. They have shown almost a three times improvement in accuracy over sliding window based approaches that do not consider context.They used a graphical model to encode the interaction between the human body parts and the object.
ReplyDeleteWhile the results are very promising, I feel they have over-constrained their problem to just human poses and objects. They could have included a section about how this method can be adapted to detect say cars and roads. I am also unsure if it is possible to encode a variable number of object parts in the graph they proposed depending on the object in consideration.
"Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities":
ReplyDeleteI like the idea proposed in this paper to exploit the mutual relationship between objects and human poses. The performance seems to be promising (although there might be some problems with the data they use as they are too specific). However, to be honest, I'm a little bit suspicious about structure learning, especially inferring a non-tree structure from data. There is no analysis about how good the local optimal and the variance of the structures learned from different initial points.
I read the first paper, "Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities." Since the authors focus on unveiling the relationship between human poses and objects (especially sports dataset), I think the idea is pretty straightforward. Although I'm not 100% sure, I guess deep learning can also model the context between these two easily without any handcrafted potentials. Also, as Ming mentioned above, the authors did not analyze the difference of performance under different initializations, but I guess it doesn't really matter since it works so well...?
ReplyDeleteI read the paper "Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities".
ReplyDeleteThe idea of combining the context information during detection leads to an improvement in the detection performance, as demonstrated in the experiment setting. I think it is interesting to compare the method in this paper to the Deformable Parts model. Although those two papers tackle the detection problem in very different ways (one from the perspective of hierarchical graphical model; the other - DPM - uses two-layer representation of targets), they both models the relation (mostly spatial relation) between "parts". In this paper, the object can be thought of the counterpart of "parts". The difference of the object here between the parts of the target is that the object is also directly correlated with the action classification and the object itself is not decomposed into different parts. The energy models define in Sec 3.1 also validate this view: the spatial relationship between parts, objective and parts are almost the same (eq.1 and eq.2). The difference lies in the first energy model (agreement between A-action, O-objectve and H-human).
As Sri has suggested, I also think use a two-stream (or more) CNN framework is promising to improve the performance even further, since the relationship between "parts" is likely to be modeled using CNN.
In the paper "Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities" can the approach be used to understand scenes as in if a person stands with a cricket bat in a tennis court using the corealtion between the net and may be some other features can we say that the cricket bat is an anomaly given the context? As the authors are trying to classify a pose based on the corelatio of the pose to the object can a video be used where motion could be used for classification as a tennis serve or some other pose generally follow the same major motion trajectory.
ReplyDelete"Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities" raises an interesting point that object detection and pose estimation can mutually refine each other with the context information they provide.
ReplyDeleteThey built their modeled using a graph structure illustrating the relevance between pose, object and activity class. A graph representation make me wonder if there can be any graph-based algorithm that would lower the complexity of searching for relevance using the model.
I read the paper ‘Putting things into Perspective’. I think is a great paper which and one of the classics in the field with over 680 citations till date. In this paper they have presented a framework for object detection in 3d scene by capturing and modeling the interplay between different parts of the scene. They represent the scene as a tree structure of viewpoints, objects and surface geometry with certain legit assumptions that all objects of interest lie in the ground plane. Using such reasonings they limit the possible space of object detectors and hence refine the predicted bounding boxes. They do this is cyclic fashion where probabilistic prediction is used to refine geometry and vice versa. They have also performed extensive evaluations to show the improvements in results. I think this is one of the important papers in the field that tries to really understand the interplay between recognition and geometry in a very structured way.
ReplyDeleteModeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities
ReplyDeleteThis paper approaches the intuitive problem of trying to use mutual context to improve both human pose estimation and object detection.The results are promising, showing significant improvements over approaches that use only weak context, e.g. the bounding box of a pedestrian detector. This suggests that previous context-based detectors that only showed minor improvements may not have been using rich-enough contextual information.
My main issue with this paper is that learning does not seem to be scalable. Inference must be done on all possible action classes separately. Furthermore, they only test on very small datasets with very few classes and give no indication of running time.
Is there a more efficient, scalable approach to incorporating rich context into problems like this without having to resort to complicated graphical models with inefficient learning/inference?
I read the paper Modeling Mutual Context of Object and Human Pose in Human-Object Activities". The authors of the paper have tried to model context-based relationships between objects and human poses in images of people playing sports. They provide illustrations of how challenging it is to perform both object detection and human pose estimation in these images, and explain that it is beneficial to share/combine mutual context information.
ReplyDeleteThe authors use a hierarchical random field model with many activity classes, objects and human poses. They utilize the connections between the object, pose and body parts in order to determine what parts of the body should actually be connected object and the pose. For example, a tennis racket held in the right hand of a person is most likely going to be used for hitting the tennis ball. To this end, they use potential functions and describe the various object-pose relationships, such as object co-occurrence with a specific human pose, different human poses for the same activity etc. The model learns these relationships from labeled training images. It learns by structured learning and max-margin parameter estimation. Structured learning endeavors to find connections between the object and pose, while max-margin parameter estimation tries to discriminate between activity classes. Given a testing image, they detect the object and estimate the pose of the human by finding the maximum likelihood score for the testing image.
The results of the paper are somewhat expected because the authors try to reason about almost all possible human poses present in the training set, either visible or occluded. I was of the opinion that background cues would also help the model learn the relationships between objects and human poses. The authors have also predicted/echoed that this might enhance the results of the model.