This blog post is written by AI CDT student, Isabella Degen
In the rapidly evolving field of computer vision and artificial intelligence, action recognition in videos remains a challenging yet crucial task. A recent talk at BIAS24 by Dr. Davide Moltisanti from the University of Bath shed light on some of the often-overlooked aspects of this problem, particularly focusing on the impact of semantic and temporal ambiguity in video labelling on classification. This blog post delves into the key insights from Dr. Moltisanti’s presentation, exploring the challenges faced by current models and the innovative solutions proposed to address them.
The Challenge of Action Recognition
At its core, action recognition aims to identify actions occurring in video sequences. While this may seem straightforward, the reality is far more complex. Traditional approaches rely heavily on supervised learning, where models are trained on labelled datasets. However, as Dr. Moltisanti pointed out, we often take these labels for granted without considering the inherent ambiguities in the labelling process itself. The research presented by Dr. Moltisanti explored the impact such ambiguity has on the training and testing of models and suggested solutions to improve the classification accuracy of action classification in videos.
Semantic Ambiguity: When Words Fail Us
One of the primary issues highlighted in the talk is semantic ambiguity. This occurs when multiple verbs can describe similar motions or when the same verb can represent different actions. For example, “push drawer” and “close drawer” might refer to the same action, while “cut” could describe various activities depending on the context. When annotators label videos, there is a wide variability in the labels used.
This ambiguity poses a significant challenge for classifiers, which struggle to handle class overlap effectively. Dr. Moltisanti proposed an innovative solution: the use of pseudo-labels. By identifying similar actions in the feature space of a given verb e.g., “cut” might be associated with “chop,” or “slice”.
Two approaches were tested:
- Masking pseudo-labels during training to weaken the loss function
- Using pseudo-labels as actual labels
Interestingly, the masking approach proved more effective, with both methods outperforming existing benchmarks on the EPIC Kitchens dataset. An ablation study further revealed that instance-level pseudo-labels were more beneficial than class-level ones, highlighting the importance of fine-grained action understanding.
Temporal Ambiguity: The Elusive Boundaries of Action
The second major issue addressed was temporal ambiguity – the difficulty in precisely defining the start and end points of an action which then might. This ambiguity leads to inconsistencies across datasets and can significantly impact model performance. Dr. Moltisanti’s research showed that even minor variations in temporal boundaries could result in accuracy fluctuations of up to 10%.
To address this, the team introduced the concept of “Rubicon boundaries,” drawing from psychology to define clearer action phases. By providing annotators with more precise guidelines, they achieved more consistent labelling, resulting in improved model accuracy.
Image 1 shows the original variability in labelling (left box diagram for each label) compared to the variability achieved with Rubicon Boundaries (RB). It shows that the mean is closer to 1 for RB annotations and the variance in start and end time between different annotators is less. The change in labelling improved the accuracy of the model from 61.2 to 65.6.
Efficient Video Labeling: A Novel Approach
Recognizing the tedious and expensive nature of traditional video labeling, Dr. Moltisanti proposed an innovative method requiring only a single time point per action instead of start and end times. This approach uses a distribution around the chosen point and employs curriculum learning to gradually refine the model’s understanding of action boundaries.
The method also incorporates a ranking system based on confidence scores to determine the best representative frames for each action. This approach proved particularly effective for shorter actions and denser datasets, offering a promising direction for more efficient video annotation.
Image 2 shows how from a single timestamp for an action in a video an initial frame (dotted lines) is found and automatically updated (solid lines) to best capture the frame for the action both in location and duration.
The Importance of Negative Cues
An intriguing point raised towards the end of the talk was the significance of negative cues in action recognition. While most models focus on positive cues (what to look for), Dr. Moltisanti emphasized the importance of also considering what the model should not focus on, such as ethnicity or attire, to reduce bias in recognition systems.
Conclusion: Rethinking Action Recognition
Dr. Moltisanti’s talk serves as a reminder of the complexities involved in action recognition and the importance of considering the impact of data labelling. By addressing semantic and temporal ambiguities in labels, we can develop more robust and accurate models for understanding actions in videos.
As the field continues to evolve, these insights pave the way for more nuanced approaches to video understanding, potentially leading to breakthroughs in applications ranging from surveillance and security to human-computer interaction and automated video analysis.
The research presented not only offers practical solutions to current challenges but also encourages us to think more critically about the fundamental processes underlying our AI systems.
*Written with the help of Anthropic’s Claude 3.5 Sonnet