Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
Ergebnis 5 von 37457
ACM transactions on multimedia computing communications and applications, 2020-06, Vol.16 (2), p.1-23
2020
Volltextzugriff (PDF)

Details

Autor(en) / Beteiligte
Titel
Dual-path Convolutional Image-Text Embeddings with Instance Loss
Ist Teil von
  • ACM transactions on multimedia computing communications and applications, 2020-06, Vol.16 (2), p.1-23
Erscheinungsjahr
2020
Quelle
Alma/SFX Local Collection
Beschreibungen/Notizen
  • Matching images and sentences demands a fine understanding of both modalities. In this article, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image/text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss on heterogeneous features (i.e., text and image features) is less effective, because it is hard to find appropriate triplets at the beginning. So the naive way of using the ranking loss may compromise the network from learning inter-modal relationship. To address this problem, we propose the instance loss, which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image/text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this article constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language-based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.
Sprache
Englisch
Identifikatoren
ISSN: 1551-6857
eISSN: 1551-6865
DOI: 10.1145/3383184
Titel-ID: cdi_crossref_primary_10_1145_3383184
Format

Weiterführende Literatur

Empfehlungen zum selben Thema automatisch vorgeschlagen von bX