Multilingual sentence embeddings capture rich semantic information not only for measuring similarity between texts but also for catering to a broad range of downstream cross-lingual NLP tasks. State-of-the-art multilingual sentence embedding models require large parallel corpora to learn efficiently, which confines the scope of these models. In this paper, we propose a novel sentence embedding framework based on an unsupervised loss function for generating effective multilingual sentence embeddings, eliminating the need for parallel corpora. We capture semantic similarity and relatedness between sentences using a multi-task loss function for training a dual encoder model mapping different languages onto the same vector space. We demonstrate the efficacy of an unsupervised as well as a weakly supervised variant of our framework on STS, BUCC and Tatoeba benchmark tasks. The proposed unsupervised sentence embedding framework outperforms even supervised state-of-the-art methods for certain under-resourced languages on the Tatoeba dataset and on a monolingual benchmark. Further, we show enhanced zero-shot learning capabilities for more than 30 languages, with the model being trained on only 13 languages. Our model can be extended to a wide range of languages from any language family, as it overcomes the requirement of parallel corpora for training.

Goswami, K., Dutta, S., Assem, H., Fransen, T., Mccrae, J. P., Cross-lingual Sentence Embedding using Multi-Task Learning, in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, (Online and Punta Cana, 07-11 November 2021), Association for Computational Linguistics, Punta Cana 2021: 9099-9113. [10.18653/v1/2021.emnlp-main.716] [https://hdl.handle.net/10807/270179]

Cross-lingual Sentence Embedding using Multi-Task Learning

Fransen, Theodorus;
2021

Abstract

Multilingual sentence embeddings capture rich semantic information not only for measuring similarity between texts but also for catering to a broad range of downstream cross-lingual NLP tasks. State-of-the-art multilingual sentence embedding models require large parallel corpora to learn efficiently, which confines the scope of these models. In this paper, we propose a novel sentence embedding framework based on an unsupervised loss function for generating effective multilingual sentence embeddings, eliminating the need for parallel corpora. We capture semantic similarity and relatedness between sentences using a multi-task loss function for training a dual encoder model mapping different languages onto the same vector space. We demonstrate the efficacy of an unsupervised as well as a weakly supervised variant of our framework on STS, BUCC and Tatoeba benchmark tasks. The proposed unsupervised sentence embedding framework outperforms even supervised state-of-the-art methods for certain under-resourced languages on the Tatoeba dataset and on a monolingual benchmark. Further, we show enhanced zero-shot learning capabilities for more than 30 languages, with the model being trained on only 13 languages. Our model can be extended to a wide range of languages from any language family, as it overcomes the requirement of parallel corpora for training.
2021
Inglese
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
2021 Conference on Empirical Methods in Natural Language Processing
Online and Punta Cana
7-nov-2021
11-nov-2021
978-1-955917-09-4
Association for Computational Linguistics
Goswami, K., Dutta, S., Assem, H., Fransen, T., Mccrae, J. P., Cross-lingual Sentence Embedding using Multi-Task Learning, in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, (Online and Punta Cana, 07-11 November 2021), Association for Computational Linguistics, Punta Cana 2021: 9099-9113. [10.18653/v1/2021.emnlp-main.716] [https://hdl.handle.net/10807/270179]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/270179
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? 3
social impact