FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding
For slides and more information on the paper, visit https://ai.science/e/fine-gym-fine-gym-a-dataset-for-fine-grained-video-action-understanding-and-our-experience-of-building-a-high-quality-dataset--rS08FM6kLXXd0MLjTBpm
Discussion lead: Dian Shao (PhD Candidate, CUHK)
Discussion moderator: Xiyang Chen (CTO, Aggregate Intellect)
We will be hosting another livestream session featuring Dian Shao from CUHK, speaking about her team's latest work FineGym, a fine-graned action understanding dataset that received 3 "strongly accept" scores at CVPR this year. Dian will also be sharing their experience and lessons learned from building a high quality dataset.
Join us live here: https://ai.science/e/fine-gym-fine-gym-a-dataset-for-fine-grained-video-action-understanding-and-our-experience-of-building-a-high-quality-dataset--rS08FM6kLXXd0MLjTBpm
Link to the paper's homepage: https://sdolivia.github.io/FineGym/
What will be discussed?
- Introduction to FineGym
- Why it is important to go fine-grained for action understanding tasks
- Lessons learned from creating a high quality dataset
- How to strike a balance between accuracy and efficiency on subtly different actions?
- How to model complex temporal dynamics efficiently, effectively and robustly
- Future work of action understanding
Abstract
On public benchmarks, current action recognition techniques have achieved great success. However, when used in real-world applications, e.g. sport analysis, which requires the capability of parsing an activity into phases and differentiating between subtly different actions, their performances remain far from being satisfactory. To take action recognition to a new level, we develop FineGym, a new dataset built on top of gymnastic videos. Compared to existing action recognition datasets, FineGym is distinguished in richness, quality, and diversity. In particular, it provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy. For example, a "balance beam" event will be annotated as a sequence of elementary sub-actions derived from five sets: "leap-jump-hop", "beam-turns", "flight-salto", "flight-handspring", and "dismount", where the sub-action in each set will be further annotated with finely defined class labels. This new level of granularity presents significant challenges for action recognition, e.g. how to parse the temporal structures from a coherent action, and how to distinguish between subtly different action classes. We systematically investigate representative methods on this dataset and obtain a number of interesting findings. We hope this dataset could advance research towards action understanding.