Sunday, March 29, 2015

Reading for Monday 3/30

K. Simonyan, A. Zisserman. Two-Stream Convolutional Networks for Action Recognition in Videos, arXiv preprint arXiv:1406.2199 (2014).

And additionally:

A. Jain, A. Gupta, M. Rodriguez, L. Davis. Representing videos using mid-level discriminative patches, CVPR, 2013.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, F. Li. Large-scale Video Classification with Convolutional Neural Networks, CVPR, 2014.


  1. I read the “Large-scale Video Classification with Convolutional Neural Networks” by Karpathy et al. In this paper, the authors extend the connectivity of their network from a single image to multiple time frames so that their CNN can not only have access to the appearance information present in single, static images, but also their complex temporal evolution. However, this may lead to a longer training period, since the network needs to deal with several frames at a time rather than one. The authors therefore proposed a multiresolution, foveated architecture to speed up the process while maintaining decent performance. I think the whole paper’s concept is reasonable and very straightforward (e.g. how they extend to time dimension…etc). Its good to see that it really works and I believe that the dataset it released (Sports-1M) definitely opens the door to videos for deep learning!

  2. Two-Stream Convolutional Networks for Action Recognition in Videos:

    Things I like: (1) explicitly use optical flow to train temporal net. This avoids the difficult- to-learn motion filters. (2) Multitask learning to deal with small dataset

    Things I am upset: although the proposed architecture give much better result than any previous nets, the improvement is still minimal compare to state of the art shallow net running on hand-crafted features.

    Things I don't like: this paper claimed that the current dataset is still small and did not try on the recently introduced Fei-Fei Li's group a 1M dataset

  3. I read the paper "Representing Videos Using Mid-level Discriminative Patches". In this paper, the authors tackle the problem of defining relevant actions in videos and classifying videos. This is done through spatio-temporal patches, or small snippets of motion. These are combined with exemplar SVMs to classify actions taken. Rather than directly training on all potential patches with something like k-means (which would fail due to the large dimensionality of the dataset due to distance metrics), the potential actions are instead separated using eSVM, after KNN is applied to reduce the dataset to a manageable size (from thousands to hundreds). Ranking is done through: Appearance Consistency (eSVM scores of top 10 detections) and Purity (number of in-class to out of class patches). eSVMs are applied through a sliding cuboid approach during analysis. Finally, a cost function is minimized to try to establish a correspondence between training and testing videos. This results in the ability to transpose similar videos on top of one another. The paper was straightforward to understand.

    One question I have is:
    What happens if a given action is defined by the same movement repeated? Or a motion patter is shared between two classes?

    1. In the case a motion pattern is shared between 2 classes, the context-dependent patch selection technique should help. Through this technique, one can construct activation units that will fire only if motions specific to a certain kind of action are present. For example, if the action vocabulary is "bend", "rotate", "throw", "jump", "run", then we can think of a "long jump" being a vector [bend = 1, rotate = 0, throw = 0, run = 0, jump = 1] and "discuss throw" being a vector [bend = 1, rotate = 1, throw = 1, run = 0, jump = 0] (just for the sake of an example). The two sports actions share the same motion "bend" but the context dependent patch selection will help distinguish them.
      In the case an action is defined by the same movement repeated, the technique possibly works as before, defined by a single motion. For example a "track race" is defined by only running [run = 1, all others = 0]. I think the activation units will still work the same way.

  4. In the paper "Two-Stream Convolutional Networks for Action Recognition in Videos" I found the method a bit against the deep learning philosophy. Just like we learn from monocular images can't we give a volume of images to the network and than let it learn the motion from the successive frames? Why do we need to enforce motion instead of letting the network learn it. In section three when they are stacking the optical flow I didn't understand why the x-displacement and the y-displacement are stacked seperately and in the figure it looks like the vector d_x + d_y have been stacked. is the bi-directional optical flow taken into consideration just to make the optical flow more accurate?There is no significant improvement(0.2%) by introducing bi-directional optical flows. Were the results comapred against using jsut one stream? And also for one of the stream we smaple a video and and frmae in a video, how much dies this sampling affect the performance? Why is fine-tuning and just changing the last layer performing better than training from scratch? Is it because even though we have lots of images(frames) they are not much different from each other? Are there any other architectures with deep learning which beat the state of the art in video classification problem?

    1. The authors argue that there are papers which tried learning the motion from a volume of images but these papers found that this doesn't work(section 1.1 second last paragraph).

      Also, I think the bi directional flow was just to make the flow more accurate.

      Even if you consider all of the frames in the datasets. I don't think they have as much data as imagenet has. Also The frames in video are slight variations of each other so they just act like data you get from data augmentation not real data. Real data is just one frame per video. That's why I think the network trained from scratch didn't work very well. I am not sure why just changing the last layer works better than finetunig the whole network.

  5. I read the paper "Two-Stream Convolutional Networks for Action Recognition in Videos "

    This paper deals with action recognition in videos. They use deep CNN for this task. Videos give us two complementary information - spatial and temporal. They train two different nets for the two types of information and fuse at the end using averaging or SVM. The spatial information is encoded in the still frames of the videos so they fine tune an already existing CNN trained on imagenet to predict actions from them, They encode the temporal information using optical flow. Optic flow is also encoded in two different ways: stack L optic flow of L consecutive frames as input to the CNN and stack the trajectory of points in L consecutive frames. This CNN is trained from scratch using multi task learning. Multi task learning is used due to much less data present in the current datasets for action recognition.

    I have one question:
    They fuse the outputs of both of the CNNs in the end. But wouldn’t it be better if they fuse their outputs in some of the earlier layers also?

  6. In the "Two stream convolutional Network" paper, the idea of combining networks seems very interesting.
    I wonder if this can be made better by giving a 3D video input directly. Also I read that their method doesn't surpass the best non deep method. Is there any reason for this apart from the lack of data?

  7. I've read "Large-Scale Video Classification with Convolutional Neural Networks"

    1. I like the idea of multiresolution cnn architecture. Using a smaller crop of image can greatly speed up the training process. And from the result, it seems multires can even outperform the baseline method.

    2. In the result Single-frame context only, there is only 1% drop in top5 hit compare to single-frame multires. I think this suggests that if we want to try some fast idea, resize to a smaller image space could be the first thing to try.

    3. I'm don't quite understand the large improvement of CNN Average compare to slow fusion. Since the slow fusion performs the idea of combining early and late fusion, if this network is learnt properly, the information from other model (single, early and late) should be redundant, which can not explain the improvement of combining all model output(cnn average)

    1. I think your question in part 3 is interesting. Its partly answered in the paper where the authors discuss "Contributions of motion". The authors suggest via their experiments that the Slow-fusion CNNs are motion-aware, but don't perform well in presence of camera shake and zoom. This suggest that using Single, Early and Late CNNs along with Slow-fusion may help combat the issue of camera motion.

  8. The paper "Large-scale Video Classification with Convolutional Neural Networks" paper is a very naive first try on using CNN for action.

    Their approach is just extracting very limited number of frames from video and throw them into the CNN and do classification. There is no studying on using temporal information except using a clip of 4 or 10 frames as input.

    The results are actually very bad. Their number on UCF 101 is 20% worse than the state-of-the-art method(improved dense trajectory). But they do not put it into comparison.

  9. Two-Stream Convolutional Networks for Action Recognition in Videos:

    This paper attempts to overcome previous difficulties in applying deep learning to video by using two parallel convolutional neural networks that mimic the separation of the human visual cortex into the ventral stream (which recognizes objects) and the dorsal stream (which recognizes motion.)

    The temporal stream convnet uses multi-frame optical flow fields as input, which seems to go against the typical, fully-learned philosophy of deep learning. Perhaps using optical flow is not as "bad" as using hand-engineered features since it's goal is an objective, physical quantity (displacement vectors of densely sampled points.) Playing devil's advocate, though, I do wonder whether the assumptions required for the optical flow estimation (intensity/gradient constancy, displacement field smoothness) or its jpeg compression introduce any biases in the network. (Though I assume they would be less significant than those introduced using something like SIFT features in a shallow model.)

    The comparisons of multiple representations for optical flow were interesting, though not particularly enlightening due to similar performance, especially when comparing regular optical flow stacking with dense trajectory stacking (which one might expect to result in a more significant difference in performance.)

    Recurrent networks also seem like a natural fit for video (more so than caption generation). Do they result in better performance than this?

  10. This comment has been removed by the author.

  11. I read the paper "Two-Stream Convolutional Networks for Action Recognition in Videos", and found their method of decoupling temporal and spatial information very interesting. To second Calvin's comment above, I don't completely agree that they are going against the end-to-end learning philosophy of deep learning. They use optical flow to capture the temporal information, which is not available in any format otherwise in their architecture. Hence, I see it simply as a way of inputting the that information. Clearly, inputting the whole 3D (4D?) video blocks would've been more suited to the philosophy though (as Lerrel pointed out above).

    Anyway, they show almost all experiments I was hoping for. I would be interested in knowing for what classes was the difference in performance of just the spatial network vs spatial+temporal was not that significant..

  12. In the first paper I found it very interesting that it uses optical flow as an input to the neural network. I don't really like it in terms of deep learning, however due to how videos are structured I feel that it isn't a bad approach, since this forces the network to learn more relevant things. I wonder if they could fuse the neural networks as a cascade, we see that the SVM does better than averaging by more than a minor amount, so using a cascade might learn some filters that can be useful when combining spatial and temporal data. It doesn't perform better than the state of the art, however I think that it can be attributed to using optical flow methods since optical flow doesn't 100% accurately encode the temporal data, just a specific representation of that data. Maybe if there was a way to use pure frames in a video it would do a lot better, maybe in a recurrent net structure?

  13. The paper, Two-Stream Convolutional Networks for Action Recognition in Videos, is interesting. The authors utilize two streams in their deep network architecture in order to do action recognition in videos: stream 1 - spatial component stream for action recognition from individual images, and stream 2 - temporal component stream for computing displacement fields (and trajectory) through optical flow.

    I think the use of bi-directional optical flow is a plus for performing action recognition. However, I did not quite understand what the authors meant when they said that they subtract the mean vector from each displacement field? I have read prior papers which suggest that it is good for motion compensation, but it would be been easier for the authors to elaborate on this.

    Also, I would like to second Ankit's comment when he said that it would have been better if the authors had fused their outputs from the previous layers instead of just at the end. However, I do remember Prof. Abhinav mentioning something to this effect in class when he said that pooling outputs in the earlier layers, and allowing the pooled result to be used as input for the next few layers somewhat decreases the performance since we are losing a lot of general information by following that approach.

  14. The paper Large-scale Video Classification with Convolutional Neural Networks by Karpathy et al. describes a unique method of classifying video files using convolutional neural networks. Specifically, the authors exploit the temporal information of consecutive frames in video sequences in addition to the images themselves. To facilitate more efficient classification, inputs are processed at two different scales. The authors run experiments that show the validity of their approach. Another major contribution by the authors is the release of a dataset consisting of 1 million videos of 487 classes of sports actions -- called the Sports-1M dataset. From a real-time performance perspective, it would be nice to see some performance benchmarks on evaluation of videos of the neural network. Even with the two-scale optimization described above, evaluation of a single video has the potential to be computationally expensive requiring sampling of 20 clips and evaluation of each clip 4 times (with different crops and flips).

  15. The authors of "Two-Stream Convolutional Networks for Action Recognition in Videos" have used a very interesting technique to encode temporal information in the input to the CNN. They have encoded optical flow in the hope that it completely captures the temporal information of actions between frames. While the results are promising, I feel that the fact that they have introduced a manual artifact like optical flow goes against the goodness of CNNs. Ideally, the network should be able to extract this information on its own in some layer. I am of the opinion that a network that can extract optical flow information in some intermediate layer can perform better than the proposed network. Therefore, I am not surprised that there is only a marginal improvement in the performance when compared to single-frame models (unlike how the authors seem surprised).

    On the other hand, using this simplified input scheme allows for much faster training if we consider the time taken to train using a single frame.

  16. I read the paper "Two-Stream Convolutional Networks for Action Recognition in Videos”. The authors have presented a parallel spatio-temporal network architecture to do action recognition in videos. Optical flow has been used to encode the temporal information between frames. While the paper has been well written and has shown extensive evaluation for state-of the art performance, it is unclear to me why we need an entire parallel stack. I would be interested in seeing how the results degrade (or not) if we just use optical flow as a input feature layer along with still image frames.

  17. "Two-Stream Convolutional Networks for Action Recognition in Videos”

    I like the idea of using the temporal information of video streams. The parallel CNN network using both space and time features are also interesting.

    However, is optical flow a valid way to represent and cover temporal information? It has many restrictions on the training frames. The performance may be acceptable using the chosen dataset but it is hard to expand onto random datasets. Also, we can consider merging training results mid way through the spatial/temporal training process.

  18. I read the paper "Two-Stream Convolutional Networks for Action Recognition in Videos".

    To use the temporal information in the video for action recognition, the author propose to use the "descriptor" for the motion: optical flow/ the trajectory. Using this idea, the temporal and appearance information of the video has been decoupled and it is possible to use the pre-trained model for the appearance part, in which we can leverage the large-scale database for image recognition. This is one of the advantages of the decoupling. But I think the disadvantage here is the raw representation of the motion: is the optical flow the best raw representation for motion in action recognition, corresponding the pixel values in image? After all, the optical flow itself is not 100% correct and it will suffer from occlusion/lack of texture/large motion etc. So the question is: can we learn the motion descriptor directly from the raw video rather than "transforming" it into the optical flow volume first? Also, I think it would be interesting if we can extract the key frames from the video, i.e. the frames that are most informative in the action recognition.