Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
Ergebnis 21 von 114097
IEEE transaction on neural networks and learning systems, 2024-05, Vol.35 (5), p.6779-6791
2024
Volltextzugriff (PDF)

Details

Autor(en) / Beteiligte
Titel
Video Moment Retrieval With Noisy Labels
Ist Teil von
  • IEEE transaction on neural networks and learning systems, 2024-05, Vol.35 (5), p.6779-6791
Ort / Verlag
United States: IEEE
Erscheinungsjahr
2024
Quelle
IEEE/IET Electronic Library (IEL)
Beschreibungen/Notizen
  • Video moment retrieval (VMR) aims to localize the target moment in an untrimmed video according to the given nature language query. The existing algorithms typically rely on clean annotations to train their models. However, making annotations by human labors may introduce much noise. Thus, the video moment retrieval models will not be well trained in practice. In this article, we present a simple yet effective video moment retrieval framework via bottom-up schema, which is in end-to-end manners and robust to noisy label training. Specifically, we extract the multimodal features by syntactic graph convolutional networks and multihead attention layers, which are fused by the cross gates and the bilinear approach. Then, the feature pyramid networks are constructed to encode plentiful scene relationships and capture high semantics. Furthermore, to mitigate the effects of noisy annotations, we devise the multilevel losses characterized by two levels: a frame-level loss that improves noise tolerance and an instance-level loss that reduces adverse effects of negative instances. For the frame level, we adopt the Gaussian smoothing to regard noisy labels as soft labels through the partial fitting. For the instance level, we exploit a pair of structurally identical models to let them teach each other during iterations. This leads to our proposed robust video moment retrieval model, which experimentally and significantly outperforms the state-of-the-art approaches on standard public datasets ActivityCaption and textually annotated cooking scene (TACoS). We also evaluate the proposed approach on the different manual annotation noises to further demonstrate the effectiveness of our model.

Weiterführende Literatur

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX