Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Temporal Segmentation of Fine-gained Semantic Action: A Motion-Centered Figure Skating Dataset

Temporal Segmentation of Fine-gained Semantic Action: A Motion-Centered Figure Skating Dataset The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Temporal Segmentation of Fine-grained Semantic Action: A Motion-Centered Figure Skating Dataset 1 1,* 1,* Shenglan Liu , Aibin zhang , Yunheng Li , 1 2 1 1 Jian Zhou , Li Xu , Zhuben Dong , Renhao Zhang Dalian University of Technology, Dalian, Liaoning, 116024 China Alibaba Group liusl@dlut.edu.cn, renwei.xl@alibaba-inc.com Abstract Coarse-grained semantics. The coarse-grained TAS is relatively easy for the existing models. However, it is dif- Temporal Action Segmentation (TAS) has achieved great suc- ficult to meet the related applications of fine-grained seman- cess in many fields such as exercise rehabilitation, movie tics (Sun et al. 2015; Piergiovanni and Ryoo 2018) which is editing, etc. Currently, task-driven TAS is a central topic more challenging for frame-level action classification. in human action analysis. However, motion-centered TAS, Spatial characteristics. In most TAS datasets, scene, tool as an important topic, is little researched due to unavail- able datasets. In order to explore more models and prac- and object (even more important than the action itself some- tical applications of motion-centered TAS, we introduce a times) play very important roles in human action recogni- Motion-Centered Figure Skating (MCFS) dataset in this pa- tion. However, we should pay more attention to the action per. Compared with existing temporal action segmentation in many practical applications (Bhattacharya et al. 2020; Li datasets, the MCFS dataset is fine-grained semantic, special- et al. 2019). Besides, the task-driven datasets cannot show ized and motion-centered. Besides, RGB-based and Skeleton- the full human body in an expected manner. Therefore, it based features are provided in the MCFS dataset. Experi- is difficult to extract more modal features to perform TAS mental results show that existing state-of-the-art methods are tasks. difficult to achieve excellent segmentation results (includ- Temporal characteristics. Generally, the action content ing accuracy, edit and F1 score) in the MCFS dataset. This categories for task-driven TAS datasets are simple, and the indicates that MCFS is a challenging dataset for motion- centered TAS. The latest dataset can be downloaded at speed difference in distinct actions is too small. The lit- https://shenglanliu.github.io/mcfs-dataset/. tle speed variance is difficult to cause frame-level feature changes, which is less challenging for TAS tasks. The issues above limit the broader research of TAS mod- Introduction els. In order to exploit new methods on the task of motion- Temporal action segmentation (TAS) has been widely used centered TAS, this paper proposed a new dataset named in sports competitions (Urban and Russell 2003), exer- MCFS. MCFS is composed of 271 single figure skating per- cise rehabilitation (Lin and Kulic ´ 2013), movie editing formance videos. The videos are taken from the 17.3 hours (Magliano and Zacks 2011) and other fields. Technically, competition of the 2017-2019 World Figure Skating Cham- TAS has been extended to many new topics, such as video pionships. Each clip is 30 frames per second, with a resolu- action localization (Lee, Uh, and Byun 2020) and mo- tion of 1080 720 and a length of 162s to 285s. All actions ment retrieval (Zhang et al. 2019) etc. In recent years, are annotated with the semantic labels on three levels (see TAS has made remarkable progress on task-based video, es- Fig. 1). The camera focuses on the skater to ensure that he pecially in designing new temporal convolution networks (she) appears in every frame during the action. Compared (TCN) (e.g. Encoder-Decoder TCN (ED-TCN) (Lea et al. with the existing datasets, MCFS has five remarkable ad- 2017), Multi-Stage Temporal Convolutional Network (MS- vantages which are listed as follows. TCN) (Farha and Gall 2019) and Self-Supervised Temporal Multi-level fine-grained semantics. All the annotations Domain Adaptation (SSTDA) (Chen et al. 2020a)) which are carried at three levels, namely set, subset and element achieve higher performance on cooking task datasets (e.g. in this paper. Fine-grained semantics means that similar ac- GTEA (Fathi, Ren, and Rehg 2011), 50Salads (Stein and tions may have different labels because of motion-centered Mckenna 2013) and Breakfast (Kuehne, Arslan, and Serre characteristics in figure skating (e.g. Lutz and Flip jumps 2014), etc.). are similar in motion aspect, but are two different jumps.). However, it can be found that the existing datasets have Such a semantic hierarchy provides a distinct structure for three limitations for TAS research, which can be mainly comprehending coarse-grained and fine-grained operations. summarized as follows. Multi-modal action features. Previous datasets only of- fer features based on RGB video content such as Flow and *Equal contribution. I3D (Carreira and Zisserman 2017).etc., while MCFS pro- Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. vides extra Skeleton (Cao et al. 2017) feature which provides 2163 Figure 1: A video in MCFS. The labels of this video belong to the subset-level. new opportunities and has significant to TAS methodologi- ance, and complex motion-centered actions. It can be used cal research. to provide high quality and fine-grained annotations of full sequence, special annotations can be divided into three se- Motion-centered human actions. All actions are inde- mantic levels, namely set, subset and element. pendent of scenes and objects in MCFS dataset (i.e. most classes of actions are dominantly biased for skaters’ pose.) (2) We make in-depth study on MCFS, explored optional Large variance of action speed & duration. In the multi-modal features as input data of TAS model, and reveal MCFS, the action content is complicated, and the speed dif- the major challenges of future research and potential appli- cations for high-vary speed motion tasks. ference in distinct actions is too large. For instance, one jumping action is completed within about 2s, by contrast, the longest step could reach 72s. The large speed variance Related Work always makes the large action duration variance of differ- Methods for TAS ent actions, which can be regarded as a great challenge to frame-based TAS task. Unsupervised Learning Approaches. For unsupervised Specialization. All videos in the MCFS are high- TAS task, the major technique is to exploit discriminative resolution records taken from the World Figure Skating information by clustering of spatio-temporal features. Such Championships. Moreover, professional quality control is models introduce temporal consistency into the TAS meth- carried out on the full sequence of video annotations to guar- ods by using LSTMs (Bhatnagar et al. 2017) or general- antee the correctness, reliability and consistency of annota- ized mallows model (Sener and Yao 2018). Kukleva et al. tions. (Kukleva et al. 2019) utilized both frame-wise clustering According to the characteristics of MCFS, a series of and video-wise clustering to model bag-of-words represen- empirical studies are conducted to discover the challenges tation of each video. Besides, in order to use the contextual of motion-centered TAS. Specifically, we first tested vari- event in videos fully, Garcia et al. (Garcia del Molino, Lim, ous TAS techniques and observed the performance of these and Tan 2018) proposed an LSTM-based generative network methods is far from satisfactory in high-speed motion TAS. when solving TAS task. For dynamics TAS task, Aakur et In order to provide assistance for future research, we also re- al. (Aakur and Sarkar 2019) proposed a self-supervised and viewed some modeling options, such as input data patterns. predictive learning framework by utilizing features of adja- We found that for fine-grained TAS task, 1) motion infor- cent frames as loss function. As another efficient dynamics mation plays a very important role, rather than depending TAS approach without training or clustering, MWS only in- on the scene and object of the video content. 2) The fine- volves curvature of action features in a neighborhood to re- grained categories are more likely to be used to increase the alize segmentation’s locations of a clip. frame-based misjudgments (“burr” phenomenon) in clip ac- Weakly Supervised Approaches. The key idea of the tion decision, which might become a new challenge for the weakly supervised TAS task is to mitigate the dependence existing TAS models. 3) The input modal of TAS model is of direct labeling by using indirect supervision manner to very important, and new modal of the input (e.g. Skeleton) achieve highly TAS performance. For order-level weakly su- will exploit the research of TAS a new branch (e.g. GNN- pervised TAS task, Ding et al. (Ding and Xu 2018) proposed based (Scarselli et al. 2008) TAS approaches). a temporal autoencoder to predict frame-by-frame labels, Taken together, the work has contributed to the study of and combined soft boundary assignment to iteratively opti- TAS task can be listed as the following two aspects: mize the segmentation results. To further explore the tempo- (1) The MCFS dataset we collected is the first challenging ral structure, Kuehne et al. (Kuehne, Arslan, and Serre 2014) dataset for TAS task with large action speed, duration vari- used “task graph” for order description and developed a hi- erarchical model based on HMMs for task-driven TAS. For Mainly refers to the spatial position and the time change of online TAS, Richard et al. (Richard et al. 2018) used Viterbi- action sequence. In addition, it includes certain statistical charac- based loss offer a new deep model to achieve the frame-wise teristics of actions, such as: the variance of the action duration, and the variance of the action speed. TAS goal. Recently, an order-free TAS method (Richard, 2164 Dataset Duration People Segments Task Classes RGB? Skeleton? Fine-grained? Year GTEA 0.57h 4 922 CA 11   2011 p p p MPII 9.8h 12 5609 CA - (Semantics) 2012 50Salads 5.3h 27 966 CA 19   2013 JIGSAWS 2.6h - 1703 SA 3   2014 p p Breakfast 77h 52 8456 CA 48  (Temporally) 2014 Ikea-FA 3.9h 32 - MS 5   2017 EPIC-KITCHENS 55h 32 39596 CA 5   2018 p p p MCFS(ours) 17.3h 186 11656 FS 130 (Semantics) 2021 Table 1: Comparisons of attributes existing datasets. *CA: cooking activities, SA: surgical activities, AF: assembling furniture, FS: figure skating Kuehne, and Gall 2018) based on probabilistic model for poral fine-grained action units. Breakfast (Kuehne, Arslan, set-level weakly supervised TAS is proposed. and Serre 2014) constructs an order graph and units descrip- tion; EPIC-KITCHEN (Damen et al. 2018), which intro- Fully Supervised Approaches. Fully supervised TAS duces visual object detection to form a temporal fine-grained aims to segment the video into semantically consistent action unit, is a large-scale cooking TAS dataset. Actually, “blocks”. A large amount of related works have explored the above two tasks can be regarded as a procedure segmen- for supervised TAS tasks. Most supervised TAS models tation task in cooking, which is proposed in the Youcook2 adopt autoencoder architecture for preserving temporal con- (Zhou, Xu, and Corso 2018) dataset. Youcook2 provides not sistence between input and output. For example, Lea et al. only a temporal location, but also descriptions of the actions (Lea et al. 2017) proposed a temporal convolutional network in a sentence. In addition, tool-object fine-grained seman- for TAS, which utilized dilated convolutions to improve the tic class is offered in the MPII dataset. However, most of process of pooling and upsampling. In (Li et al. 2020), Farha the existing datasets are based on tool-objection content in et al. proposed a multi-stage structure combining smoothing TAS-related tasks, and it is impractical to extract Skeleton loss for TAS tasks, which also involved autoencoder net- features without full-body motion. Besides, the lack of char- work. Lei (Lei and Todorovic 2018) developed temporal de- acters, fine-grained semantics and categories in the existing formable residual network using deformable temporal con- datasets also limits the development of the TAS task. Ta- volutions to enhance the TAS performance. Yet these meth- ble 1 shows the development of TAS-related datasets in the ods suffer from long training time and unsatisfactory seg- past decade. These datasets, where the action segmentations mentation accuracy, which might be explained by the model are more based on hands, tool and objects, are mainly task- architecture. driven. These reasons hindered the development of the TAS methods based on human motion. MCFS will make up for TAS-related Datasets the shortcomings of the existing TAS datasets, and promote TAS-related datasets include TAS and action localization. the discovery of new problems in the TAS tasks. We believe Action localization aims to localize the temporal intervals MCFS will be a new challenging dataset for motion-centered of query actions, for example FineGym (Shao et al. 2020), TAS. while TAS intend to divide a video into independent actions at frame-level. We focus on TAS in this paper. For the early The MCFS Dataset datasets, GTEA (Fathi, Ren, and Rehg 2011) and 50Sal- MCFS aims to be a motion-centered dataset for TAS task, ads (Stein and Mckenna 2013), which are based on coarse- which can better exploit new TAS models. In this section, grained cooking tasks only, have less than 20 categories of we introduce the challenges of category definition, data an- actions (11 and 19 categories, separately), while the surgi- notation and quality control in MCFS, seperately. Moreover, cal activity dataset of JIGSAWS (Gao et al. 2014) only con- the detailed construction process including data preparation sists of 3 categories. The existing methods can achieve well and annotation details are introduced. Finally, we stated that performance based on these datasets limited by the num- MCFS exhibits more characteristics of statistics and physi- ber of categories and video duration. Later, many datasets cal motion, which, competitive with the existing datasets. have been improved in terms of video duration, action cate- gories, body motion, and fine-grained semantics (including Key Challenges temporal fine-grained units and semantic fine-grained class). All the above improvements make the TAS-related datasets There are a series of challenges during the data collection more challenging and practical. For TAS with body motion procedure, due to the top-level professionalism and com- task, Ikea-FA (Toyer et al. 2017) and MPII (Schiele et al. plexity of figure skating. Firstly, as a highly professional 2012) datasets realize the upper (occlusive) body motion sport, it is impractical to define categories manually for fig- characteristics, due to the particularity of furniture assembly ure skating. Fortunately, the exactitude of labeling data can and cooking tasks. Recently, most datasets are focusing on be ensured under the guide of the official technical docu- finer determination of action boundaries, especially for tem- ments of figure skating. Secondly, for the annotators, it is 2165 difficult to determine the category and boundary of actions, since the action is fast and highly similar to other actions within the same subset. To address the issue, the annotators are trained with necessary specialized knowledge by profes- sionals in figure skating. Besides, this work is guided by the labels in the original video sequences (The upper left label in Fig. 2)). Based on the above criterions, thus, the labeling and division for our MCFS would be more convincing. However, Figure 3: A three level semantics annotations and collect 4 we still enforce a series of measures to ensure high-quality sets (e.g. Spin), 22 subsets (e.g. CamelSpin) and 130 ele- dataset label, as described in next part. ments (e.g. CamelSpin3) at each annotation level. a series of control mechanisms are adopted, including: 1) Train annotators with professional knowledge. 2) Provide reference documents and sample videos. 3) Test the annota- tor’s labeling level strictly before formal annotation. 4) Re- Figure 2: Labels variation in the original video sequences. view the annotated videos. The upper left label in the original video sequences will change when the action changes. Statistic The MCFS dataset consists of 271 samples captured from 38 competition videos which include more than 1.73 mil- Dataset Construction lion frames. All the annotations are carried at three levels, Data Preparation. In MCFS, we collect 38 official videos and the number of categories at each level is 4, 22 and 130 of 186 skaters from 2017 to 2019 World Figure Skating respectively. There are 93 out of 130 elements has at least Championships. The complexity of the data distribution in two samples that present the natural heavy-tail distribution. the same set can be ensured with sufficient skaters. Each Except the “NONE” category, we annotate 2,995 effective video sequence is of high resolution and FPS to preserve clips in all samples. The distribution of video duration is the integrity of actions. Besides, the duplicate video are re- shown in Fig. 4 (a). The total video length is 15.9 hours with moved through manual checking. In addition, We provide an average duration of 212s per video. All the videos remain I3D, Skeleton for subsequent experiments. untrimmed and can be up to 300s. The distribution of seg- Annotation Collection. In MCFS, we apply three level ment durations is shown in Fig. 4 (b) with mean and standard semantics annotations and collect 4 sets, 22 subsets and 130 deviation of 9.4s and 8.3s, respectively. The longest segment elements at each annotation level, respectively. The MCFS lasts 72s and the shortest one lasts 1s. The large range of structure is shown in Fig. 3. For example, the set “Spin” can sample duration is a challenge to the TAS task. be expressed as Spin = fChCombospin, CamelSpin, Lay- backspin, Sitspin, ChSitSpin, ChCamelsping with 6 subsets. The elements in each subset will be further annotated with defined element labels. Such a semantic hierarchy provides a distinct structure for comprehending coarse-grained and fine-grained operations. The following requirements should be observed throughout the annotation process. First, it is necessary to refer to the real-time labels, which are provided in the original videos, to determine the start frame and the end frame. Meanwhile, all incomplete video clips will be Figure 4: MCFS-22 dataset duration statistics. removed in this process. Second, according to the element- level action category, the official manual and categorization structure are referred to classify them into subsets and sets. Dataset Properties Annotation Tool. As the segments are variant in length Motion-centered Human Actions. For most TAS and content, the workload to annotate the MCFS with a con- datasets, many factors such as hand, tool and object can af- ventional annotation tool will be crushing. In order to im- fect the results. For example, the action and scene are same prove the annotation efficiency, we develop a new tool to in 50Salads dataset for “cut-tomato” and “cut-cucumber”. preview the two frames before and after the current frame. Discriminant information only depends on the object in In addition, with this tool, the start and end frames can be the hand. However, all samples have a relatively consistent selected directly while updating the category manually. scene in the MCFS dataset. The discriminant of the action Quality Control. We annotate all the frame-level fine- category and boundary information only depends on the grained action categories and temporal segmentation bound- human body action should be realized to meet the challenge aries in MCFS. To assure the quality of the MCFS dataset, of modeling new TAS methods. 2166 Multi-modal Action Features. We extract not only Flow baseline results of state-of-the-art TAS methods by leverag- and I3D features but also two Skeleton features (2D loca- ing MCFS dataset. In addition, the characteristics of MCFS tions of 18 and 25 major body joints) based on RGB video is discussed based on the experimental results. in the MCFS dataset. Fig. 2 shows the Skeleton features. Experimental Setup Skeleton features offers a new opportunity for the combi- nation of GCN (Huang, Sugano, and Sato 2020) and TAS Data. MCFS is randomly split into 189 and 82 videos methods. It may also develop a new direction for the multi- for training and testing, respectively. Then, we utilize 5- modal of temporal human action segmentation. We hope the fold cross validation to assess generalization of the models. MCFS dataset can promote the research of machine learning MCFS-4, 22 and 130 share the same splits, but are anno- for temporal action segmentation of human action. tated by the different three hierarchical semantic labels (set, subset, element), respectively, which have been introduced Large Variance of Action Speed and Duration. For in section “the MCFS Dataset”. sport datasets like MCFS, the different actions are with large I3D Feature Based on RGB. For each frame, a 2048 di- action speed and duration variance. For example, jump is mensional feature vector of I3D, whose final feature vec- generally completed in 2-3s, but the longest sequence can tor for each frame is obtained by concatenating the vectors be more than 70s. We have calculated the nearest neigh- form both RGB and flow streams which results in a 2048 bor variance of I3D features for 21 frames in four datasets dimensional vector for each frame, is pretrained on Kinetics as shown in Fig. 5. It can be clearly seen that compared to (Kay et al. 2017). Specifically, temporal window for I3D of the other three kitchen datasets, there are dramatic changes a frame consists of 20 temporal nearest neighbored frames between different actions in the MCFS dataset. This brings of current frame (altogether 21 frames). More details can be great challenges to the division of boundaries. referred to reference (Carreira and Zisserman 2017). Skeleton Feature. On the MCFS, we use the 2D pose estimation results from the OpenPose (Cao et al. 2017) toolbox which outputs 18 joints and 25 joints. In addition, these joints of Skeleton feature are normalized by dividing two spatial direction coordinates of joints by corresponding frame size respectively, and then centralized by the waist joint (center joint). All our experiments utilize the Skeleton feature of 25 joints. Dataset F1@f10,25,50g Edit Acc 50Salads Bi-LSTM 62.6 58.3 47.0 55.6 55.7 ED-TCN 68.0 63.9 52.6 59.8 55.7 MS-TCN 76.3 74.0 64.5 67.9 80.7 SSTDA 83.0 81.5 73.8 75.8 83.2 Figure 5: The nearest neighbor variance results of GTEA, GTEA 50Salads, Breakfast and MCFS-22 dataset. Bi-LSTM 66.5 59.0 43.6 - 55.5 ED-TCN 72.2 69.3 56.0 - 64.0 MS-TCN 85.8 83.4 69.8 79.0 76.3 High Similarity of Category. In the MCFS dataset, two SSTDA 90.0 89.1 78.0 86.2 79.8 samples of different categories may only have few dif- Breakfast ferent frames called key frames. For example, in a sin- Bi-LSTM 33.4 21.9 13.6 35.8 56.6 gle jump, “Lutz” and “Flip” performs are basically the ED-TCN 48.6 43.1 27.7 38.6 67.3 same, except for the differences inside and outside the ice MS-TCN 52.6 48.1 37.9 61.7 66.3 skate blade. For continuous jump like “3Lutz 3Loop” and SSTDA 75.0 69.1 55.2 73.7 70.2 “3Lutz 3Toeloop”, the first jump is exactly the same, while the difference only depends on the subsequent jump. Such subtle differences can easily make the model misjudge the Table 2: Comparison with the state-of-the-art on 50Salads, category and segmentation point of actions. Meanwhile, be- GTEA, and the Breakfast dataset (All data obtained from cause similar frames may appear in different actions, multi- (Farha and Gall 2019) and (Chen et al. 2020a)). semantics frames become another inevitable problem. Evaluation Metric. For evaluation, we report the frame- Experiments wise accuracy (Acc), segmental edit distance and the seg- In this section, experimental setup is first introduced for TAS mental F1 score at overlapping thresholds 10%, 25% and task. Then, we report the experimental results on benchmark 50%, denoted by F1@f10, 25, 50g (Farha and Gall 2019). datasets (such as 50Salads, GTEA and Breakfast) and list the The F1 score can penalize over-segmentation errors while 2167 Dataset Modality F1@f10,25,50g Edit Acc MCFS-4 Bi-LSTM I3D 33.4 21.9 13.6 35.8 56.6 ED-TCN I3D 48.6 43.1 27.7 38.6 67.3 MS-TCN I3D 74.1 67.4 50.2 79.6 71.9 MS-TCN Skeleton 86.8 82.6 72.1 86.9 82.0 SSTDA I3D 75.8 69.9 52.5 82.1 71.4 SSTDA Skeleton 88.7 84.9 74.6 89.3 82.0 MCFS-22 Bi-LSTM I3D 14.8 5.9 1.5 13.6 54.3 ED-TCN I3D 32.3 25.7 11.6 25.6 58.8 MS-TCN I3D 49.4 44.1 29.8 52.6 62.6 MS-TCN Skeleton 74.3 69.7 59.5 74.2 75.6 SSTDA I3D 52.7 46.3 31.1 56.3 59.1 SSTDA Skeleton 76.7 72.2 61.2 77.5 75.7 MCFS-130 Bi-LSTM I3D 9.9 2.5 0.3 7.6 54.3 ED-TCN I3D 30.2 22.7 10.6 23.1 54.5 MS-TCN I3D 36.6 30.5 20.0 36.3 58.0 MS-TCN Skeleton 56.4 52.2 42.5 54.5 65.7 SSTDA I3D 42.6 37.3 24.6 44.4 55.1 SSTDA Skeleton 63.8 60.1 49.8 63.5 65.4 Figure 6: The confusion matrix results of MCFS-22 (Skele- ton) utilizing MS-TCN. Table 3: Element-level action recognition results of repre- sentative methods. Specifically, results of recognizing ele- ment categories across all sets, within a subset, and within quence, play important roles on object-based TAS datasets an element. such as 50Salads. For example, recognizing “cut tomato” and “cut cheese” is free of “cut”, but is to distinguish the different characteristics between tomato and cheese. In con- it does not penalize minor temporal shifts between the pre- trast, figure skating pays no attention to scene and object. dictions and ground truth. This is appropriate for TAS task Specifically, the accuracy in Table 3 are generally much because it is important to avoid over-segmentation errors for lower than the accuracy in Table 2 when using the same video summarization. As for this reason, we use the F1 score experimental setup. In addition, some categories of actions as a measure of the quality of the prediction. The detailed de- may be confusing because of the extremely high similar- scription of the above evaluation metrics can be referred to ity of motion in MCFS. As shown in Fig. 6, “Toeloop” is the related reference (Lea et al. 2017). wrongly recognized as “salchow” and “Lutz 3Toeloop”. The reason for the confusion of actions is that single-jump can Baselines for Temporal Action Segmentation only be recognized by a few of key frames, while joint-jump In this subsection, we conduct the experiments utilizing I3D pay attention to more key frames of the two consecutive feature on 50Salads, GTEA and Breakfast datasets, and list jumps. The above results illustrate MCFS is challenging on the detailed experimental results of four TAS methods based motion-centered TAS. on both TCN (including ED-TCN (Lea et al. 2017), MS- TCN (Farha and Gall 2019) and SSTDA (Chen et al. 2020a) Temporal Information. In TAS task, it is very impor- ) and the LSTM (i.e. Bi-LSTM (Graves, Fernandez, ´ and tant to capture the time dynamics. Both TCN-based and Schmidhuber 2005)) in Table 2. We show results for two LSTM-based methods could work well by utilizing the ex- modalities (I3D and Skeleton) of MCFS, as well as for the isting datasets (50Salads and GTEA etc.) without the com- four TAS methods in Table 3 (We only select two state-of- plex temporal characteristics. Due to the problem of large the-art models for Skeleton.). The detailed experimental re- variance of action speed and duration in MCFS, the LSTM- sults illustrate three challenging properties of MCFS as fol- based methods (Bi-LSTM) will suffer gradient disappear- lows. ance by a long time series inputting, while the TCN-based Motion-centered. Table 2 illustrates that I3D can achieve methods can avoid this issue and can obtain far superior per- superior performance on the benchmark datasets (50Salads, formance (Table 3). Another possible reason of the above GTEA and Breakfast). Specially, most values of metrics (in- issue is the temporal weak correlation among actions in cluding accuracy, segmental edit distance and F1 score) of MCFS. Besides, the complex transition motion (non-regular SSTDA model are over 70% (only F1@25,50 on Break- patterns in transition motions including content, duration fast is below 70%). This is because the scene and objects, and location) interspersed among actions is also challenging which can be well characterized by I3D in a video se- to determine the temporal intervals of actions for TCN-based 2168 Figure 7: Qualitative results for the TAS task on MCFS. networks. Video Description. While there has been increasing inter- est in the task (Xu et al. 2016; Wang et al. 2018) of describ- Fine-grained Semantics Label. MCFS provides three ing video with natural language, current computer vision levels of fine-grained annotations which result in confusion algorithms are still severely limited in associated language of different categories by similar actions. It is possible to that they can recognize. We believe MCFS can be utilized cause many over-segmentation errors in label prediction be- for Video Description because it can build the embedding cause of fine-grained characteristics. MCFS-22 achieves ex- between video frames and the words. cellent accuracy than 50Salads as shown in Table 2 and Ta- ble 3 by utilizing ED-TCN with the same setup. However, Action Reasoning. Action reasoning (Pirsiavash, Von- segmental edit distance and F1 score of MCFS-22 are much drick, and Torralba 2014) is an interesting issue. For exam- lower than that of 50Salads and GTEA. It is a serious prob- ple, it is straightforward to conclude a 3Lutz-3Toeloop jump lem that the finer semantics label will lead the more over- if single 3Lutz jump and 3toeloop jump have been recog- segmentation errors in MCFS. For example, the MCFS-130 nized. This direction provides more empirical research ideas performs worse than either MCFS-4 or MCFS-22 by any for model design. compared TAS methods in Table 3. Skeleton Features of Action Video-Text Retrieval. Cross-modal retrieval between videos and texts (Chen et al. 2020b) has attracted growing It can be seen in Fig. 7 and Table 3, both the errors of ac- attentions. We believe that MCFS can contribute to Video- tion recognition and the errors of over-segmentation based Text Retrieval, since all actions are annotated with the se- on I3D predictions are far more than that based on Skeleton, mantic labels on three levels in MCFS. Besides, such hier- which illustrates that MCFS depends on the human motion. archical structure enables methods has better generalization In addition, OpenPose can be easily used for Skeleton ex- and improves the ability to distinguish fine-grained semantic traction because of the whole body appearing of the skater differences. in the video. The two TAS methods (MS-TCN and SSTDA) using Skeleton feature achieves better performance than that using I3D feature. For example, in MCFS-22, the perfor- Conclusion mance of SSTDA using Skeleton are 24% and 30.1% higher than that using I3D on F1@0.1 and F1@0.5 respectively. In this paper, we introduce a new fine-grained dataset called MCFS for the TAS task. Hierarchical semantic structure of Directions for Future Works our dataset has been organized by professional knowledge. In human action classification task, GNN based models have In addition, MCFS differs from existing TAS datasets in be developed rapidly, such as ST-GCN (Yan, Xiong, and Lin multiple aspects, including motion-centered human actions, 2018), 2S-AGCN (Shi et al. 2019) and MS-G3D (Liu et al. large variance of action speed and duration, multi-modal 2020). So far as we are aware, due to the lack of Skeleton action features and high category similarity. Based on the features in the existing datasets, GNN-based approach is not above differences, a number of comparative experiments are used in TAS task. MCFS could be utilized to exploit more conducted on MCFS. The experimental results indicate it is excellent multi-modal and Skeleton-based models by using promising and challenging for MCFS to be used in the TAS optic flow and Skeleton features in TAS field. task. Besides, MCFS could be utilized to exploit more excel- lent multi-modal and Skeleton-based models by using optic Potential Applications flow and Skeleton features in TAS field. We will move on The high-quality data of MCFS has offered a foundation for to propose more state-of-the-art TAS methods. We hope that various applications. Besides fine-grained action segmenta- our dataset would promote the development of action analy- tion tasks, it also includes some potential applications. sis and related research topics. 2169 References Graves, A.; Fernandez, ´ S.; and Schmidhuber, J. 2005. Bidi- rectional LSTM Networks for Improved Phoneme Classifi- Aakur, S. N.; and Sarkar, S. 2019. A Perceptual Prediction cation and Recognition. In Artificial Neural Networks: For- Framework for Self Supervised Event Segmentation. In Pro- mal Models & Their Applications-icann, International Con- ceedings of the IEEE/CVF Conference on Computer Vision ference, Warsaw, Poland, September. and Pattern Recognition (CVPR). Huang, Y.; Sugano, Y.; and Sato, Y. 2020. Improving Ac- Bhatnagar, B. L.; Singh, S.; Arora, C.; Jawahar, C.; and tion Segmentation via Graph-Based Temporal Reasoning. In CVIT, K. 2017. Unsupervised Learning of Deep Feature Proceedings of the IEEE/CVF Conference on Computer Vi- Representation for Clustering Egocentric Actions. In IJCAI, sion and Pattern Recognition, 14024–14034. 1447–1453. Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Bhattacharya, U.; Mittal, T.; Chandra, R.; Randhavane, T.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, Bera, A.; and Manocha, D. 2020. STEP: Spatial Tempo- P.; et al. 2017. The kinetics human action video dataset. ral Graph Convolutional Networks for Emotion Perception arXiv preprint arXiv:1705.06950 . from Gaits. In AAAI, 1342–1350. Kuehne, H.; Arslan, A.; and Serre, T. 2014. The language Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh, Y. 2017. Re- of actions: Recovering the syntax and semantics of goal- altime Multi-Person 2D Pose Estimation Using Part Affinity directed human activities. In Proceedings of the IEEE con- Fields. In Proceedings of the IEEE Conference on Computer ference on computer vision and pattern recognition, 780– Vision and Pattern Recognition (CVPR). Carreira, J.; and Zisserman, A. 2017. Quo vadis, action Kukleva, A.; Kuehne, H.; Sener, F.; and Gall, J. 2019. Unsu- recognition? a new model and the kinetics dataset. In pro- pervised learning of action classes with continuous temporal ceedings of the IEEE Conference on Computer Vision and embedding. In Proceedings of the IEEE/CVF Conference on Pattern Recognition, 6299–6308. Computer Vision and Pattern Recognition, 12066–12074. Chen, M.-H.; Li, B.; Bao, Y.; AlRegib, G.; and Kira, Z. Lea, C.; Flynn, M. D.; Vidal, R.; Reiter, A.; and Hager, G. D. 2020a. Action Segmentation with Joint Self-Supervised 2017. Temporal convolutional networks for action segmen- Temporal Domain Adaptation. In Proceedings of the tation and detection. In proceedings of the IEEE Conference IEEE/CVF Conference on Computer Vision and Pattern on Computer Vision and Pattern Recognition, 156–165. Recognition, 9454–9463. Lee, P.; Uh, Y.; and Byun, H. 2020. Background Suppression Chen, S.; Zhao, Y.; Jin, Q.; and Wu, Q. 2020b. Fine-grained Network for Weakly-Supervised Temporal Action Localiza- Video-Text Retrieval with Hierarchical Graph Reasoning. In tion. In AAAI, 11320–11327. Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 10638–10647. Lei, P.; and Todorovic, S. 2018. Temporal deformable resid- ual networks for action segmentation in videos. In Proceed- Damen, D.; Doughty, H.; Maria Farinella, G.; Fidler, S.; ings of the IEEE Conference on Computer Vision and Pat- Furnari, A.; Kazakos, E.; Moltisanti, D.; Munro, J.; Perrett, tern Recognition, 6742–6751. T.; Price, W.; et al. 2018. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Con- Li, J.; Wang, J.; Tian, Q.; Gao, W.; and Zhang, S. 2019. ference on Computer Vision (ECCV), 720–736. Global-local temporal representations for video person re- identification. In Proceedings of the IEEE International Ding, L.; and Xu, C. 2018. Weakly-Supervised Action Seg- Conference on Computer Vision, 3958–3967. mentation with Iterative Soft Boundary Assignment. In 2018 IEEE/CVF Conference on Computer Vision and Pat- Li, S.-J.; AbuFarha, Y.; Liu, Y.; Cheng, M.-M.; and Gall, tern Recognition (CVPR). J. 2020. Ms-tcn++: Multi-stage temporal convolutional net- work for action segmentation. IEEE Transactions on Pattern Farha, Y. A.; and Gall, J. 2019. Ms-tcn: Multi-stage tempo- Analysis and Machine Intelligence . ral convolutional network for action segmentation. In Pro- ceedings of the IEEE Conference on Computer Vision and Lin, J. F.-S.; and Kulic, ´ D. 2013. Online segmentation of Pattern Recognition, 3575–3584. human motion for automated rehabilitation exercise analy- sis. IEEE Transactions on Neural Systems and Rehabilita- Fathi, A.; Ren, X.; and Rehg, J. M. 2011. Learning to rec- tion Engineering 22(1): 168–180. ognize objects in egocentric activities. In IEEE Conference on Computer Vision & Pattern Recognition. Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; and Ouyang, W. 2020. Disentangling and unifying graph convolutions for Gao, Y.; Vedula, S. S.; Reiley, C. E.; Ahmidi, N.; Varadara- skeleton-based action recognition. In Proceedings of the jan, B.; Lin, H. C.; Tao, L.; Zappella, L.; Bejar ´ , B.; Yuh, IEEE/CVF Conference on Computer Vision and Pattern D. D.; et al. 2014. Jhu-isi gesture and skill assessment work- Recognition, 143–152. ing set (jigsaws): A surgical activity dataset for human mo- tion modeling. In Miccai workshop: M2cai, volume 3, 3. Magliano, J. P.; and Zacks, J. M. 2011. The impact of conti- nuity editing in narrative film on event segmentation. Cog- Garcia del Molino, A.; Lim, J.-H.; and Tan, A.-H. 2018. Pre- nitive science 35(8): 1489–1517. dicting visual context for unsupervised event segmentation in continuous photo-streams. In Proceedings of the 26th Piergiovanni, A.; and Ryoo, M. S. 2018. Fine-grained ac- ACM international conference on Multimedia, 10–17. tivity recognition in baseball videos. In Proceedings of the 2170 IEEE Conference on Computer Vision and Pattern Recogni- Xu, J.; Mei, T.; Yao, T.; and Rui, Y. 2016. Msr-vtt: A large tion Workshops, 1740–1748. video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision Pirsiavash, H.; Vondrick, C.; and Torralba, A. 2014. As- and pattern recognition, 5288–5296. sessing the quality of actions. In European Conference on Yan, S.; Xiong, Y.; and Lin, D. 2018. Spatial Temporal Computer Vision, 556–571. Springer. Graph Convolutional Networks for Skeleton-Based Action Richard, A.; Kuehne, H.; and Gall, J. 2018. Action sets: Recognition. In AAAI. Weakly supervised action segmentation without ordering Zhang, D.; Dai, X.; Wang, X.; Wang, Y.-F.; and Davis, L. S. constraints. In Proceedings of the IEEE Conference on Com- 2019. Man: Moment alignment network for natural lan- puter Vision and Pattern Recognition, 5987–5996. guage moment retrieval via iterative graph adjustment. In Richard, A.; Kuehne, H.; Iqbal, A.; and Gall, J. 2018. Proceedings of the IEEE Conference on Computer Vision Neuralnetwork-viterbi: A framework for weakly supervised and Pattern Recognition, 1247–1257. video learning. In Proceedings of the IEEE Conference on Zhou, L.; Xu, C.; and Corso, J. 2018. Towards automatic Computer Vision and Pattern Recognition, 7386–7395. learning of procedures from web instructional videos. In Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Proceedings of the AAAI Conference on Artificial Intelli- Monfardini, G. 2008. The graph neural network model. gence, volume 32. IEEE Transactions on Neural Networks 20(1): 61–80. Schiele, B.; Andriluka, M.; Amin, S.; and Rohrbach, M. 2012. A database for fine grained activity detection of cook- ing activities. In IEEE Conference on Computer Vision & Pattern Recognition. Sener, F.; and Yao, A. 2018. Unsupervised learning and seg- mentation of complex activities from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8368–8376. Shao, D.; Zhao, Y.; Dai, B.; and Lin, D. 2020. Finegym: A hierarchical video dataset for fine-grained action under- standing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2616–2625. Shi, L.; Zhang, Y.; Cheng, J.; and Lu, H. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 12026– Stein, S.; and Mckenna, S. J. 2013. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. Sun, C.; Shetty, S.; Sukthankar, R.; and Nevatia, R. 2015. Temporal localization of fine-grained actions in videos by domain transfer from web images. In Proceedings of the 23rd ACM international conference on Multimedia, 371– Toyer, S.; Cherian, A.; Han, T.; and Gould, S. 2017. Human pose forecasting via deep markov models. In 2017 Interna- tional Conference on Digital Image Computing: Techniques and Applications (DICTA), 1–8. IEEE. Urban, T. L.; and Russell, R. A. 2003. Scheduling sports competitions on multiple venues. European Journal of op- erational research 148(2): 302–311. Wang, B.; Ma, L.; Zhang, W.; and Liu, W. 2018. Reconstruc- tion network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 7622–7631. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Proceedings of the AAAI Conference on Artificial Intelligence Unpaywall

Temporal Segmentation of Fine-gained Semantic Action: A Motion-Centered Figure Skating Dataset

Proceedings of the AAAI Conference on Artificial IntelligenceMay 18, 2021

Loading next page...
 
/lp/unpaywall/temporal-segmentation-of-fine-gained-semantic-action-a-motion-centered-ChIKGF1wOn

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
ISSN
2159-5399
DOI
10.1609/aaai.v35i3.16314
Publisher site
See Article on Publisher Site

Abstract

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Temporal Segmentation of Fine-grained Semantic Action: A Motion-Centered Figure Skating Dataset 1 1,* 1,* Shenglan Liu , Aibin zhang , Yunheng Li , 1 2 1 1 Jian Zhou , Li Xu , Zhuben Dong , Renhao Zhang Dalian University of Technology, Dalian, Liaoning, 116024 China Alibaba Group liusl@dlut.edu.cn, renwei.xl@alibaba-inc.com Abstract Coarse-grained semantics. The coarse-grained TAS is relatively easy for the existing models. However, it is dif- Temporal Action Segmentation (TAS) has achieved great suc- ficult to meet the related applications of fine-grained seman- cess in many fields such as exercise rehabilitation, movie tics (Sun et al. 2015; Piergiovanni and Ryoo 2018) which is editing, etc. Currently, task-driven TAS is a central topic more challenging for frame-level action classification. in human action analysis. However, motion-centered TAS, Spatial characteristics. In most TAS datasets, scene, tool as an important topic, is little researched due to unavail- able datasets. In order to explore more models and prac- and object (even more important than the action itself some- tical applications of motion-centered TAS, we introduce a times) play very important roles in human action recogni- Motion-Centered Figure Skating (MCFS) dataset in this pa- tion. However, we should pay more attention to the action per. Compared with existing temporal action segmentation in many practical applications (Bhattacharya et al. 2020; Li datasets, the MCFS dataset is fine-grained semantic, special- et al. 2019). Besides, the task-driven datasets cannot show ized and motion-centered. Besides, RGB-based and Skeleton- the full human body in an expected manner. Therefore, it based features are provided in the MCFS dataset. Experi- is difficult to extract more modal features to perform TAS mental results show that existing state-of-the-art methods are tasks. difficult to achieve excellent segmentation results (includ- Temporal characteristics. Generally, the action content ing accuracy, edit and F1 score) in the MCFS dataset. This categories for task-driven TAS datasets are simple, and the indicates that MCFS is a challenging dataset for motion- centered TAS. The latest dataset can be downloaded at speed difference in distinct actions is too small. The lit- https://shenglanliu.github.io/mcfs-dataset/. tle speed variance is difficult to cause frame-level feature changes, which is less challenging for TAS tasks. The issues above limit the broader research of TAS mod- Introduction els. In order to exploit new methods on the task of motion- Temporal action segmentation (TAS) has been widely used centered TAS, this paper proposed a new dataset named in sports competitions (Urban and Russell 2003), exer- MCFS. MCFS is composed of 271 single figure skating per- cise rehabilitation (Lin and Kulic ´ 2013), movie editing formance videos. The videos are taken from the 17.3 hours (Magliano and Zacks 2011) and other fields. Technically, competition of the 2017-2019 World Figure Skating Cham- TAS has been extended to many new topics, such as video pionships. Each clip is 30 frames per second, with a resolu- action localization (Lee, Uh, and Byun 2020) and mo- tion of 1080 720 and a length of 162s to 285s. All actions ment retrieval (Zhang et al. 2019) etc. In recent years, are annotated with the semantic labels on three levels (see TAS has made remarkable progress on task-based video, es- Fig. 1). The camera focuses on the skater to ensure that he pecially in designing new temporal convolution networks (she) appears in every frame during the action. Compared (TCN) (e.g. Encoder-Decoder TCN (ED-TCN) (Lea et al. with the existing datasets, MCFS has five remarkable ad- 2017), Multi-Stage Temporal Convolutional Network (MS- vantages which are listed as follows. TCN) (Farha and Gall 2019) and Self-Supervised Temporal Multi-level fine-grained semantics. All the annotations Domain Adaptation (SSTDA) (Chen et al. 2020a)) which are carried at three levels, namely set, subset and element achieve higher performance on cooking task datasets (e.g. in this paper. Fine-grained semantics means that similar ac- GTEA (Fathi, Ren, and Rehg 2011), 50Salads (Stein and tions may have different labels because of motion-centered Mckenna 2013) and Breakfast (Kuehne, Arslan, and Serre characteristics in figure skating (e.g. Lutz and Flip jumps 2014), etc.). are similar in motion aspect, but are two different jumps.). However, it can be found that the existing datasets have Such a semantic hierarchy provides a distinct structure for three limitations for TAS research, which can be mainly comprehending coarse-grained and fine-grained operations. summarized as follows. Multi-modal action features. Previous datasets only of- fer features based on RGB video content such as Flow and *Equal contribution. I3D (Carreira and Zisserman 2017).etc., while MCFS pro- Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. vides extra Skeleton (Cao et al. 2017) feature which provides 2163 Figure 1: A video in MCFS. The labels of this video belong to the subset-level. new opportunities and has significant to TAS methodologi- ance, and complex motion-centered actions. It can be used cal research. to provide high quality and fine-grained annotations of full sequence, special annotations can be divided into three se- Motion-centered human actions. All actions are inde- mantic levels, namely set, subset and element. pendent of scenes and objects in MCFS dataset (i.e. most classes of actions are dominantly biased for skaters’ pose.) (2) We make in-depth study on MCFS, explored optional Large variance of action speed & duration. In the multi-modal features as input data of TAS model, and reveal MCFS, the action content is complicated, and the speed dif- the major challenges of future research and potential appli- cations for high-vary speed motion tasks. ference in distinct actions is too large. For instance, one jumping action is completed within about 2s, by contrast, the longest step could reach 72s. The large speed variance Related Work always makes the large action duration variance of differ- Methods for TAS ent actions, which can be regarded as a great challenge to frame-based TAS task. Unsupervised Learning Approaches. For unsupervised Specialization. All videos in the MCFS are high- TAS task, the major technique is to exploit discriminative resolution records taken from the World Figure Skating information by clustering of spatio-temporal features. Such Championships. Moreover, professional quality control is models introduce temporal consistency into the TAS meth- carried out on the full sequence of video annotations to guar- ods by using LSTMs (Bhatnagar et al. 2017) or general- antee the correctness, reliability and consistency of annota- ized mallows model (Sener and Yao 2018). Kukleva et al. tions. (Kukleva et al. 2019) utilized both frame-wise clustering According to the characteristics of MCFS, a series of and video-wise clustering to model bag-of-words represen- empirical studies are conducted to discover the challenges tation of each video. Besides, in order to use the contextual of motion-centered TAS. Specifically, we first tested vari- event in videos fully, Garcia et al. (Garcia del Molino, Lim, ous TAS techniques and observed the performance of these and Tan 2018) proposed an LSTM-based generative network methods is far from satisfactory in high-speed motion TAS. when solving TAS task. For dynamics TAS task, Aakur et In order to provide assistance for future research, we also re- al. (Aakur and Sarkar 2019) proposed a self-supervised and viewed some modeling options, such as input data patterns. predictive learning framework by utilizing features of adja- We found that for fine-grained TAS task, 1) motion infor- cent frames as loss function. As another efficient dynamics mation plays a very important role, rather than depending TAS approach without training or clustering, MWS only in- on the scene and object of the video content. 2) The fine- volves curvature of action features in a neighborhood to re- grained categories are more likely to be used to increase the alize segmentation’s locations of a clip. frame-based misjudgments (“burr” phenomenon) in clip ac- Weakly Supervised Approaches. The key idea of the tion decision, which might become a new challenge for the weakly supervised TAS task is to mitigate the dependence existing TAS models. 3) The input modal of TAS model is of direct labeling by using indirect supervision manner to very important, and new modal of the input (e.g. Skeleton) achieve highly TAS performance. For order-level weakly su- will exploit the research of TAS a new branch (e.g. GNN- pervised TAS task, Ding et al. (Ding and Xu 2018) proposed based (Scarselli et al. 2008) TAS approaches). a temporal autoencoder to predict frame-by-frame labels, Taken together, the work has contributed to the study of and combined soft boundary assignment to iteratively opti- TAS task can be listed as the following two aspects: mize the segmentation results. To further explore the tempo- (1) The MCFS dataset we collected is the first challenging ral structure, Kuehne et al. (Kuehne, Arslan, and Serre 2014) dataset for TAS task with large action speed, duration vari- used “task graph” for order description and developed a hi- erarchical model based on HMMs for task-driven TAS. For Mainly refers to the spatial position and the time change of online TAS, Richard et al. (Richard et al. 2018) used Viterbi- action sequence. In addition, it includes certain statistical charac- based loss offer a new deep model to achieve the frame-wise teristics of actions, such as: the variance of the action duration, and the variance of the action speed. TAS goal. Recently, an order-free TAS method (Richard, 2164 Dataset Duration People Segments Task Classes RGB? Skeleton? Fine-grained? Year GTEA 0.57h 4 922 CA 11   2011 p p p MPII 9.8h 12 5609 CA - (Semantics) 2012 50Salads 5.3h 27 966 CA 19   2013 JIGSAWS 2.6h - 1703 SA 3   2014 p p Breakfast 77h 52 8456 CA 48  (Temporally) 2014 Ikea-FA 3.9h 32 - MS 5   2017 EPIC-KITCHENS 55h 32 39596 CA 5   2018 p p p MCFS(ours) 17.3h 186 11656 FS 130 (Semantics) 2021 Table 1: Comparisons of attributes existing datasets. *CA: cooking activities, SA: surgical activities, AF: assembling furniture, FS: figure skating Kuehne, and Gall 2018) based on probabilistic model for poral fine-grained action units. Breakfast (Kuehne, Arslan, set-level weakly supervised TAS is proposed. and Serre 2014) constructs an order graph and units descrip- tion; EPIC-KITCHEN (Damen et al. 2018), which intro- Fully Supervised Approaches. Fully supervised TAS duces visual object detection to form a temporal fine-grained aims to segment the video into semantically consistent action unit, is a large-scale cooking TAS dataset. Actually, “blocks”. A large amount of related works have explored the above two tasks can be regarded as a procedure segmen- for supervised TAS tasks. Most supervised TAS models tation task in cooking, which is proposed in the Youcook2 adopt autoencoder architecture for preserving temporal con- (Zhou, Xu, and Corso 2018) dataset. Youcook2 provides not sistence between input and output. For example, Lea et al. only a temporal location, but also descriptions of the actions (Lea et al. 2017) proposed a temporal convolutional network in a sentence. In addition, tool-object fine-grained seman- for TAS, which utilized dilated convolutions to improve the tic class is offered in the MPII dataset. However, most of process of pooling and upsampling. In (Li et al. 2020), Farha the existing datasets are based on tool-objection content in et al. proposed a multi-stage structure combining smoothing TAS-related tasks, and it is impractical to extract Skeleton loss for TAS tasks, which also involved autoencoder net- features without full-body motion. Besides, the lack of char- work. Lei (Lei and Todorovic 2018) developed temporal de- acters, fine-grained semantics and categories in the existing formable residual network using deformable temporal con- datasets also limits the development of the TAS task. Ta- volutions to enhance the TAS performance. Yet these meth- ble 1 shows the development of TAS-related datasets in the ods suffer from long training time and unsatisfactory seg- past decade. These datasets, where the action segmentations mentation accuracy, which might be explained by the model are more based on hands, tool and objects, are mainly task- architecture. driven. These reasons hindered the development of the TAS methods based on human motion. MCFS will make up for TAS-related Datasets the shortcomings of the existing TAS datasets, and promote TAS-related datasets include TAS and action localization. the discovery of new problems in the TAS tasks. We believe Action localization aims to localize the temporal intervals MCFS will be a new challenging dataset for motion-centered of query actions, for example FineGym (Shao et al. 2020), TAS. while TAS intend to divide a video into independent actions at frame-level. We focus on TAS in this paper. For the early The MCFS Dataset datasets, GTEA (Fathi, Ren, and Rehg 2011) and 50Sal- MCFS aims to be a motion-centered dataset for TAS task, ads (Stein and Mckenna 2013), which are based on coarse- which can better exploit new TAS models. In this section, grained cooking tasks only, have less than 20 categories of we introduce the challenges of category definition, data an- actions (11 and 19 categories, separately), while the surgi- notation and quality control in MCFS, seperately. Moreover, cal activity dataset of JIGSAWS (Gao et al. 2014) only con- the detailed construction process including data preparation sists of 3 categories. The existing methods can achieve well and annotation details are introduced. Finally, we stated that performance based on these datasets limited by the num- MCFS exhibits more characteristics of statistics and physi- ber of categories and video duration. Later, many datasets cal motion, which, competitive with the existing datasets. have been improved in terms of video duration, action cate- gories, body motion, and fine-grained semantics (including Key Challenges temporal fine-grained units and semantic fine-grained class). All the above improvements make the TAS-related datasets There are a series of challenges during the data collection more challenging and practical. For TAS with body motion procedure, due to the top-level professionalism and com- task, Ikea-FA (Toyer et al. 2017) and MPII (Schiele et al. plexity of figure skating. Firstly, as a highly professional 2012) datasets realize the upper (occlusive) body motion sport, it is impractical to define categories manually for fig- characteristics, due to the particularity of furniture assembly ure skating. Fortunately, the exactitude of labeling data can and cooking tasks. Recently, most datasets are focusing on be ensured under the guide of the official technical docu- finer determination of action boundaries, especially for tem- ments of figure skating. Secondly, for the annotators, it is 2165 difficult to determine the category and boundary of actions, since the action is fast and highly similar to other actions within the same subset. To address the issue, the annotators are trained with necessary specialized knowledge by profes- sionals in figure skating. Besides, this work is guided by the labels in the original video sequences (The upper left label in Fig. 2)). Based on the above criterions, thus, the labeling and division for our MCFS would be more convincing. However, Figure 3: A three level semantics annotations and collect 4 we still enforce a series of measures to ensure high-quality sets (e.g. Spin), 22 subsets (e.g. CamelSpin) and 130 ele- dataset label, as described in next part. ments (e.g. CamelSpin3) at each annotation level. a series of control mechanisms are adopted, including: 1) Train annotators with professional knowledge. 2) Provide reference documents and sample videos. 3) Test the annota- tor’s labeling level strictly before formal annotation. 4) Re- Figure 2: Labels variation in the original video sequences. view the annotated videos. The upper left label in the original video sequences will change when the action changes. Statistic The MCFS dataset consists of 271 samples captured from 38 competition videos which include more than 1.73 mil- Dataset Construction lion frames. All the annotations are carried at three levels, Data Preparation. In MCFS, we collect 38 official videos and the number of categories at each level is 4, 22 and 130 of 186 skaters from 2017 to 2019 World Figure Skating respectively. There are 93 out of 130 elements has at least Championships. The complexity of the data distribution in two samples that present the natural heavy-tail distribution. the same set can be ensured with sufficient skaters. Each Except the “NONE” category, we annotate 2,995 effective video sequence is of high resolution and FPS to preserve clips in all samples. The distribution of video duration is the integrity of actions. Besides, the duplicate video are re- shown in Fig. 4 (a). The total video length is 15.9 hours with moved through manual checking. In addition, We provide an average duration of 212s per video. All the videos remain I3D, Skeleton for subsequent experiments. untrimmed and can be up to 300s. The distribution of seg- Annotation Collection. In MCFS, we apply three level ment durations is shown in Fig. 4 (b) with mean and standard semantics annotations and collect 4 sets, 22 subsets and 130 deviation of 9.4s and 8.3s, respectively. The longest segment elements at each annotation level, respectively. The MCFS lasts 72s and the shortest one lasts 1s. The large range of structure is shown in Fig. 3. For example, the set “Spin” can sample duration is a challenge to the TAS task. be expressed as Spin = fChCombospin, CamelSpin, Lay- backspin, Sitspin, ChSitSpin, ChCamelsping with 6 subsets. The elements in each subset will be further annotated with defined element labels. Such a semantic hierarchy provides a distinct structure for comprehending coarse-grained and fine-grained operations. The following requirements should be observed throughout the annotation process. First, it is necessary to refer to the real-time labels, which are provided in the original videos, to determine the start frame and the end frame. Meanwhile, all incomplete video clips will be Figure 4: MCFS-22 dataset duration statistics. removed in this process. Second, according to the element- level action category, the official manual and categorization structure are referred to classify them into subsets and sets. Dataset Properties Annotation Tool. As the segments are variant in length Motion-centered Human Actions. For most TAS and content, the workload to annotate the MCFS with a con- datasets, many factors such as hand, tool and object can af- ventional annotation tool will be crushing. In order to im- fect the results. For example, the action and scene are same prove the annotation efficiency, we develop a new tool to in 50Salads dataset for “cut-tomato” and “cut-cucumber”. preview the two frames before and after the current frame. Discriminant information only depends on the object in In addition, with this tool, the start and end frames can be the hand. However, all samples have a relatively consistent selected directly while updating the category manually. scene in the MCFS dataset. The discriminant of the action Quality Control. We annotate all the frame-level fine- category and boundary information only depends on the grained action categories and temporal segmentation bound- human body action should be realized to meet the challenge aries in MCFS. To assure the quality of the MCFS dataset, of modeling new TAS methods. 2166 Multi-modal Action Features. We extract not only Flow baseline results of state-of-the-art TAS methods by leverag- and I3D features but also two Skeleton features (2D loca- ing MCFS dataset. In addition, the characteristics of MCFS tions of 18 and 25 major body joints) based on RGB video is discussed based on the experimental results. in the MCFS dataset. Fig. 2 shows the Skeleton features. Experimental Setup Skeleton features offers a new opportunity for the combi- nation of GCN (Huang, Sugano, and Sato 2020) and TAS Data. MCFS is randomly split into 189 and 82 videos methods. It may also develop a new direction for the multi- for training and testing, respectively. Then, we utilize 5- modal of temporal human action segmentation. We hope the fold cross validation to assess generalization of the models. MCFS dataset can promote the research of machine learning MCFS-4, 22 and 130 share the same splits, but are anno- for temporal action segmentation of human action. tated by the different three hierarchical semantic labels (set, subset, element), respectively, which have been introduced Large Variance of Action Speed and Duration. For in section “the MCFS Dataset”. sport datasets like MCFS, the different actions are with large I3D Feature Based on RGB. For each frame, a 2048 di- action speed and duration variance. For example, jump is mensional feature vector of I3D, whose final feature vec- generally completed in 2-3s, but the longest sequence can tor for each frame is obtained by concatenating the vectors be more than 70s. We have calculated the nearest neigh- form both RGB and flow streams which results in a 2048 bor variance of I3D features for 21 frames in four datasets dimensional vector for each frame, is pretrained on Kinetics as shown in Fig. 5. It can be clearly seen that compared to (Kay et al. 2017). Specifically, temporal window for I3D of the other three kitchen datasets, there are dramatic changes a frame consists of 20 temporal nearest neighbored frames between different actions in the MCFS dataset. This brings of current frame (altogether 21 frames). More details can be great challenges to the division of boundaries. referred to reference (Carreira and Zisserman 2017). Skeleton Feature. On the MCFS, we use the 2D pose estimation results from the OpenPose (Cao et al. 2017) toolbox which outputs 18 joints and 25 joints. In addition, these joints of Skeleton feature are normalized by dividing two spatial direction coordinates of joints by corresponding frame size respectively, and then centralized by the waist joint (center joint). All our experiments utilize the Skeleton feature of 25 joints. Dataset F1@f10,25,50g Edit Acc 50Salads Bi-LSTM 62.6 58.3 47.0 55.6 55.7 ED-TCN 68.0 63.9 52.6 59.8 55.7 MS-TCN 76.3 74.0 64.5 67.9 80.7 SSTDA 83.0 81.5 73.8 75.8 83.2 Figure 5: The nearest neighbor variance results of GTEA, GTEA 50Salads, Breakfast and MCFS-22 dataset. Bi-LSTM 66.5 59.0 43.6 - 55.5 ED-TCN 72.2 69.3 56.0 - 64.0 MS-TCN 85.8 83.4 69.8 79.0 76.3 High Similarity of Category. In the MCFS dataset, two SSTDA 90.0 89.1 78.0 86.2 79.8 samples of different categories may only have few dif- Breakfast ferent frames called key frames. For example, in a sin- Bi-LSTM 33.4 21.9 13.6 35.8 56.6 gle jump, “Lutz” and “Flip” performs are basically the ED-TCN 48.6 43.1 27.7 38.6 67.3 same, except for the differences inside and outside the ice MS-TCN 52.6 48.1 37.9 61.7 66.3 skate blade. For continuous jump like “3Lutz 3Loop” and SSTDA 75.0 69.1 55.2 73.7 70.2 “3Lutz 3Toeloop”, the first jump is exactly the same, while the difference only depends on the subsequent jump. Such subtle differences can easily make the model misjudge the Table 2: Comparison with the state-of-the-art on 50Salads, category and segmentation point of actions. Meanwhile, be- GTEA, and the Breakfast dataset (All data obtained from cause similar frames may appear in different actions, multi- (Farha and Gall 2019) and (Chen et al. 2020a)). semantics frames become another inevitable problem. Evaluation Metric. For evaluation, we report the frame- Experiments wise accuracy (Acc), segmental edit distance and the seg- In this section, experimental setup is first introduced for TAS mental F1 score at overlapping thresholds 10%, 25% and task. Then, we report the experimental results on benchmark 50%, denoted by F1@f10, 25, 50g (Farha and Gall 2019). datasets (such as 50Salads, GTEA and Breakfast) and list the The F1 score can penalize over-segmentation errors while 2167 Dataset Modality F1@f10,25,50g Edit Acc MCFS-4 Bi-LSTM I3D 33.4 21.9 13.6 35.8 56.6 ED-TCN I3D 48.6 43.1 27.7 38.6 67.3 MS-TCN I3D 74.1 67.4 50.2 79.6 71.9 MS-TCN Skeleton 86.8 82.6 72.1 86.9 82.0 SSTDA I3D 75.8 69.9 52.5 82.1 71.4 SSTDA Skeleton 88.7 84.9 74.6 89.3 82.0 MCFS-22 Bi-LSTM I3D 14.8 5.9 1.5 13.6 54.3 ED-TCN I3D 32.3 25.7 11.6 25.6 58.8 MS-TCN I3D 49.4 44.1 29.8 52.6 62.6 MS-TCN Skeleton 74.3 69.7 59.5 74.2 75.6 SSTDA I3D 52.7 46.3 31.1 56.3 59.1 SSTDA Skeleton 76.7 72.2 61.2 77.5 75.7 MCFS-130 Bi-LSTM I3D 9.9 2.5 0.3 7.6 54.3 ED-TCN I3D 30.2 22.7 10.6 23.1 54.5 MS-TCN I3D 36.6 30.5 20.0 36.3 58.0 MS-TCN Skeleton 56.4 52.2 42.5 54.5 65.7 SSTDA I3D 42.6 37.3 24.6 44.4 55.1 SSTDA Skeleton 63.8 60.1 49.8 63.5 65.4 Figure 6: The confusion matrix results of MCFS-22 (Skele- ton) utilizing MS-TCN. Table 3: Element-level action recognition results of repre- sentative methods. Specifically, results of recognizing ele- ment categories across all sets, within a subset, and within quence, play important roles on object-based TAS datasets an element. such as 50Salads. For example, recognizing “cut tomato” and “cut cheese” is free of “cut”, but is to distinguish the different characteristics between tomato and cheese. In con- it does not penalize minor temporal shifts between the pre- trast, figure skating pays no attention to scene and object. dictions and ground truth. This is appropriate for TAS task Specifically, the accuracy in Table 3 are generally much because it is important to avoid over-segmentation errors for lower than the accuracy in Table 2 when using the same video summarization. As for this reason, we use the F1 score experimental setup. In addition, some categories of actions as a measure of the quality of the prediction. The detailed de- may be confusing because of the extremely high similar- scription of the above evaluation metrics can be referred to ity of motion in MCFS. As shown in Fig. 6, “Toeloop” is the related reference (Lea et al. 2017). wrongly recognized as “salchow” and “Lutz 3Toeloop”. The reason for the confusion of actions is that single-jump can Baselines for Temporal Action Segmentation only be recognized by a few of key frames, while joint-jump In this subsection, we conduct the experiments utilizing I3D pay attention to more key frames of the two consecutive feature on 50Salads, GTEA and Breakfast datasets, and list jumps. The above results illustrate MCFS is challenging on the detailed experimental results of four TAS methods based motion-centered TAS. on both TCN (including ED-TCN (Lea et al. 2017), MS- TCN (Farha and Gall 2019) and SSTDA (Chen et al. 2020a) Temporal Information. In TAS task, it is very impor- ) and the LSTM (i.e. Bi-LSTM (Graves, Fernandez, ´ and tant to capture the time dynamics. Both TCN-based and Schmidhuber 2005)) in Table 2. We show results for two LSTM-based methods could work well by utilizing the ex- modalities (I3D and Skeleton) of MCFS, as well as for the isting datasets (50Salads and GTEA etc.) without the com- four TAS methods in Table 3 (We only select two state-of- plex temporal characteristics. Due to the problem of large the-art models for Skeleton.). The detailed experimental re- variance of action speed and duration in MCFS, the LSTM- sults illustrate three challenging properties of MCFS as fol- based methods (Bi-LSTM) will suffer gradient disappear- lows. ance by a long time series inputting, while the TCN-based Motion-centered. Table 2 illustrates that I3D can achieve methods can avoid this issue and can obtain far superior per- superior performance on the benchmark datasets (50Salads, formance (Table 3). Another possible reason of the above GTEA and Breakfast). Specially, most values of metrics (in- issue is the temporal weak correlation among actions in cluding accuracy, segmental edit distance and F1 score) of MCFS. Besides, the complex transition motion (non-regular SSTDA model are over 70% (only F1@25,50 on Break- patterns in transition motions including content, duration fast is below 70%). This is because the scene and objects, and location) interspersed among actions is also challenging which can be well characterized by I3D in a video se- to determine the temporal intervals of actions for TCN-based 2168 Figure 7: Qualitative results for the TAS task on MCFS. networks. Video Description. While there has been increasing inter- est in the task (Xu et al. 2016; Wang et al. 2018) of describ- Fine-grained Semantics Label. MCFS provides three ing video with natural language, current computer vision levels of fine-grained annotations which result in confusion algorithms are still severely limited in associated language of different categories by similar actions. It is possible to that they can recognize. We believe MCFS can be utilized cause many over-segmentation errors in label prediction be- for Video Description because it can build the embedding cause of fine-grained characteristics. MCFS-22 achieves ex- between video frames and the words. cellent accuracy than 50Salads as shown in Table 2 and Ta- ble 3 by utilizing ED-TCN with the same setup. However, Action Reasoning. Action reasoning (Pirsiavash, Von- segmental edit distance and F1 score of MCFS-22 are much drick, and Torralba 2014) is an interesting issue. For exam- lower than that of 50Salads and GTEA. It is a serious prob- ple, it is straightforward to conclude a 3Lutz-3Toeloop jump lem that the finer semantics label will lead the more over- if single 3Lutz jump and 3toeloop jump have been recog- segmentation errors in MCFS. For example, the MCFS-130 nized. This direction provides more empirical research ideas performs worse than either MCFS-4 or MCFS-22 by any for model design. compared TAS methods in Table 3. Skeleton Features of Action Video-Text Retrieval. Cross-modal retrieval between videos and texts (Chen et al. 2020b) has attracted growing It can be seen in Fig. 7 and Table 3, both the errors of ac- attentions. We believe that MCFS can contribute to Video- tion recognition and the errors of over-segmentation based Text Retrieval, since all actions are annotated with the se- on I3D predictions are far more than that based on Skeleton, mantic labels on three levels in MCFS. Besides, such hier- which illustrates that MCFS depends on the human motion. archical structure enables methods has better generalization In addition, OpenPose can be easily used for Skeleton ex- and improves the ability to distinguish fine-grained semantic traction because of the whole body appearing of the skater differences. in the video. The two TAS methods (MS-TCN and SSTDA) using Skeleton feature achieves better performance than that using I3D feature. For example, in MCFS-22, the perfor- Conclusion mance of SSTDA using Skeleton are 24% and 30.1% higher than that using I3D on F1@0.1 and F1@0.5 respectively. In this paper, we introduce a new fine-grained dataset called MCFS for the TAS task. Hierarchical semantic structure of Directions for Future Works our dataset has been organized by professional knowledge. In human action classification task, GNN based models have In addition, MCFS differs from existing TAS datasets in be developed rapidly, such as ST-GCN (Yan, Xiong, and Lin multiple aspects, including motion-centered human actions, 2018), 2S-AGCN (Shi et al. 2019) and MS-G3D (Liu et al. large variance of action speed and duration, multi-modal 2020). So far as we are aware, due to the lack of Skeleton action features and high category similarity. Based on the features in the existing datasets, GNN-based approach is not above differences, a number of comparative experiments are used in TAS task. MCFS could be utilized to exploit more conducted on MCFS. The experimental results indicate it is excellent multi-modal and Skeleton-based models by using promising and challenging for MCFS to be used in the TAS optic flow and Skeleton features in TAS field. task. Besides, MCFS could be utilized to exploit more excel- lent multi-modal and Skeleton-based models by using optic Potential Applications flow and Skeleton features in TAS field. We will move on The high-quality data of MCFS has offered a foundation for to propose more state-of-the-art TAS methods. We hope that various applications. Besides fine-grained action segmenta- our dataset would promote the development of action analy- tion tasks, it also includes some potential applications. sis and related research topics. 2169 References Graves, A.; Fernandez, ´ S.; and Schmidhuber, J. 2005. Bidi- rectional LSTM Networks for Improved Phoneme Classifi- Aakur, S. N.; and Sarkar, S. 2019. A Perceptual Prediction cation and Recognition. In Artificial Neural Networks: For- Framework for Self Supervised Event Segmentation. In Pro- mal Models & Their Applications-icann, International Con- ceedings of the IEEE/CVF Conference on Computer Vision ference, Warsaw, Poland, September. and Pattern Recognition (CVPR). Huang, Y.; Sugano, Y.; and Sato, Y. 2020. Improving Ac- Bhatnagar, B. L.; Singh, S.; Arora, C.; Jawahar, C.; and tion Segmentation via Graph-Based Temporal Reasoning. In CVIT, K. 2017. Unsupervised Learning of Deep Feature Proceedings of the IEEE/CVF Conference on Computer Vi- Representation for Clustering Egocentric Actions. In IJCAI, sion and Pattern Recognition, 14024–14034. 1447–1453. Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Bhattacharya, U.; Mittal, T.; Chandra, R.; Randhavane, T.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, Bera, A.; and Manocha, D. 2020. STEP: Spatial Tempo- P.; et al. 2017. The kinetics human action video dataset. ral Graph Convolutional Networks for Emotion Perception arXiv preprint arXiv:1705.06950 . from Gaits. In AAAI, 1342–1350. Kuehne, H.; Arslan, A.; and Serre, T. 2014. The language Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh, Y. 2017. Re- of actions: Recovering the syntax and semantics of goal- altime Multi-Person 2D Pose Estimation Using Part Affinity directed human activities. In Proceedings of the IEEE con- Fields. In Proceedings of the IEEE Conference on Computer ference on computer vision and pattern recognition, 780– Vision and Pattern Recognition (CVPR). Carreira, J.; and Zisserman, A. 2017. Quo vadis, action Kukleva, A.; Kuehne, H.; Sener, F.; and Gall, J. 2019. Unsu- recognition? a new model and the kinetics dataset. In pro- pervised learning of action classes with continuous temporal ceedings of the IEEE Conference on Computer Vision and embedding. In Proceedings of the IEEE/CVF Conference on Pattern Recognition, 6299–6308. Computer Vision and Pattern Recognition, 12066–12074. Chen, M.-H.; Li, B.; Bao, Y.; AlRegib, G.; and Kira, Z. Lea, C.; Flynn, M. D.; Vidal, R.; Reiter, A.; and Hager, G. D. 2020a. Action Segmentation with Joint Self-Supervised 2017. Temporal convolutional networks for action segmen- Temporal Domain Adaptation. In Proceedings of the tation and detection. In proceedings of the IEEE Conference IEEE/CVF Conference on Computer Vision and Pattern on Computer Vision and Pattern Recognition, 156–165. Recognition, 9454–9463. Lee, P.; Uh, Y.; and Byun, H. 2020. Background Suppression Chen, S.; Zhao, Y.; Jin, Q.; and Wu, Q. 2020b. Fine-grained Network for Weakly-Supervised Temporal Action Localiza- Video-Text Retrieval with Hierarchical Graph Reasoning. In tion. In AAAI, 11320–11327. Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 10638–10647. Lei, P.; and Todorovic, S. 2018. Temporal deformable resid- ual networks for action segmentation in videos. In Proceed- Damen, D.; Doughty, H.; Maria Farinella, G.; Fidler, S.; ings of the IEEE Conference on Computer Vision and Pat- Furnari, A.; Kazakos, E.; Moltisanti, D.; Munro, J.; Perrett, tern Recognition, 6742–6751. T.; Price, W.; et al. 2018. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Con- Li, J.; Wang, J.; Tian, Q.; Gao, W.; and Zhang, S. 2019. ference on Computer Vision (ECCV), 720–736. Global-local temporal representations for video person re- identification. In Proceedings of the IEEE International Ding, L.; and Xu, C. 2018. Weakly-Supervised Action Seg- Conference on Computer Vision, 3958–3967. mentation with Iterative Soft Boundary Assignment. In 2018 IEEE/CVF Conference on Computer Vision and Pat- Li, S.-J.; AbuFarha, Y.; Liu, Y.; Cheng, M.-M.; and Gall, tern Recognition (CVPR). J. 2020. Ms-tcn++: Multi-stage temporal convolutional net- work for action segmentation. IEEE Transactions on Pattern Farha, Y. A.; and Gall, J. 2019. Ms-tcn: Multi-stage tempo- Analysis and Machine Intelligence . ral convolutional network for action segmentation. In Pro- ceedings of the IEEE Conference on Computer Vision and Lin, J. F.-S.; and Kulic, ´ D. 2013. Online segmentation of Pattern Recognition, 3575–3584. human motion for automated rehabilitation exercise analy- sis. IEEE Transactions on Neural Systems and Rehabilita- Fathi, A.; Ren, X.; and Rehg, J. M. 2011. Learning to rec- tion Engineering 22(1): 168–180. ognize objects in egocentric activities. In IEEE Conference on Computer Vision & Pattern Recognition. Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; and Ouyang, W. 2020. Disentangling and unifying graph convolutions for Gao, Y.; Vedula, S. S.; Reiley, C. E.; Ahmidi, N.; Varadara- skeleton-based action recognition. In Proceedings of the jan, B.; Lin, H. C.; Tao, L.; Zappella, L.; Bejar ´ , B.; Yuh, IEEE/CVF Conference on Computer Vision and Pattern D. D.; et al. 2014. Jhu-isi gesture and skill assessment work- Recognition, 143–152. ing set (jigsaws): A surgical activity dataset for human mo- tion modeling. In Miccai workshop: M2cai, volume 3, 3. Magliano, J. P.; and Zacks, J. M. 2011. The impact of conti- nuity editing in narrative film on event segmentation. Cog- Garcia del Molino, A.; Lim, J.-H.; and Tan, A.-H. 2018. Pre- nitive science 35(8): 1489–1517. dicting visual context for unsupervised event segmentation in continuous photo-streams. In Proceedings of the 26th Piergiovanni, A.; and Ryoo, M. S. 2018. Fine-grained ac- ACM international conference on Multimedia, 10–17. tivity recognition in baseball videos. In Proceedings of the 2170 IEEE Conference on Computer Vision and Pattern Recogni- Xu, J.; Mei, T.; Yao, T.; and Rui, Y. 2016. Msr-vtt: A large tion Workshops, 1740–1748. video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision Pirsiavash, H.; Vondrick, C.; and Torralba, A. 2014. As- and pattern recognition, 5288–5296. sessing the quality of actions. In European Conference on Yan, S.; Xiong, Y.; and Lin, D. 2018. Spatial Temporal Computer Vision, 556–571. Springer. Graph Convolutional Networks for Skeleton-Based Action Richard, A.; Kuehne, H.; and Gall, J. 2018. Action sets: Recognition. In AAAI. Weakly supervised action segmentation without ordering Zhang, D.; Dai, X.; Wang, X.; Wang, Y.-F.; and Davis, L. S. constraints. In Proceedings of the IEEE Conference on Com- 2019. Man: Moment alignment network for natural lan- puter Vision and Pattern Recognition, 5987–5996. guage moment retrieval via iterative graph adjustment. In Richard, A.; Kuehne, H.; Iqbal, A.; and Gall, J. 2018. Proceedings of the IEEE Conference on Computer Vision Neuralnetwork-viterbi: A framework for weakly supervised and Pattern Recognition, 1247–1257. video learning. In Proceedings of the IEEE Conference on Zhou, L.; Xu, C.; and Corso, J. 2018. Towards automatic Computer Vision and Pattern Recognition, 7386–7395. learning of procedures from web instructional videos. In Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Proceedings of the AAAI Conference on Artificial Intelli- Monfardini, G. 2008. The graph neural network model. gence, volume 32. IEEE Transactions on Neural Networks 20(1): 61–80. Schiele, B.; Andriluka, M.; Amin, S.; and Rohrbach, M. 2012. A database for fine grained activity detection of cook- ing activities. In IEEE Conference on Computer Vision & Pattern Recognition. Sener, F.; and Yao, A. 2018. Unsupervised learning and seg- mentation of complex activities from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8368–8376. Shao, D.; Zhao, Y.; Dai, B.; and Lin, D. 2020. Finegym: A hierarchical video dataset for fine-grained action under- standing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2616–2625. Shi, L.; Zhang, Y.; Cheng, J.; and Lu, H. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 12026– Stein, S.; and Mckenna, S. J. 2013. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. Sun, C.; Shetty, S.; Sukthankar, R.; and Nevatia, R. 2015. Temporal localization of fine-grained actions in videos by domain transfer from web images. In Proceedings of the 23rd ACM international conference on Multimedia, 371– Toyer, S.; Cherian, A.; Han, T.; and Gould, S. 2017. Human pose forecasting via deep markov models. In 2017 Interna- tional Conference on Digital Image Computing: Techniques and Applications (DICTA), 1–8. IEEE. Urban, T. L.; and Russell, R. A. 2003. Scheduling sports competitions on multiple venues. European Journal of op- erational research 148(2): 302–311. Wang, B.; Ma, L.; Zhang, W.; and Liu, W. 2018. Reconstruc- tion network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 7622–7631.

Journal

Proceedings of the AAAI Conference on Artificial IntelligenceUnpaywall

Published: May 18, 2021

There are no references for this article.