Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
Convolutional Transformer with Similarity-based Boundary Prediction for Action Segmentation
Ist Teil von
2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), 2022, p.855-860
Ort / Verlag
IEEE
Erscheinungsjahr
2022
Quelle
IEEE Electronic Library Online
Beschreibungen/Notizen
Action classification has made great progress, but segmenting and recognizing actions from long videos remains a challenging problem. Recently, Transformer-based models with strong sequence modeling ability have succeeded in many se-quence modeling tasks. However, the lack of inductive bias and the difficulty of handling long video sequences limit the application of the Transformer in the action segmentation task. In order to explore the potential of the Transformer in this task, we replace some specific linear layers in the vanilla Transformer with dilated temporal convolution, and a sparse attention mechanism is utilized to reduce the time and space complexities to process long video sequences. Besides, directly using frame-wise classification loss to train the model will cause that frames at boundaries of actions are treated equally with those in the middle of actions, and the learned features are not sensitive to boundaries. We propose a new local log-context attention module to predict whether each frame is at the beginning, middle, or end of an action. Since boundary frames are similar to their neighboring frames of different classes, our similarity-based boundary prediction helps learn more discriminative features. Extensive experiments on three datasets show the effectiveness of our method.