Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI)method, whichcan simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI modelunderstands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-artunsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.

Goswami, K., Sarkar, R., Chakravarthi, B. R., Fransen, T., Mccrae, J. P., Unsupervised Deep Language and Dialect Identification for Short Texts, in Proceedings of the 28th International Conference on Computational Linguistics, (Barcelona, SPAIN (online), 2020-04-08), International Committee on Computational Linguistics, Barcelona, SPAIN (online) 2020: 1606-1617. [10.18653/v1/2020.coling-main.141] [https://hdl.handle.net/10807/270154]

Unsupervised Deep Language and Dialect Identification for Short Texts

Fransen, Theodorus;
2020

Abstract

Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI)method, whichcan simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI modelunderstands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-artunsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.
2020
Inglese
Proceedings of the 28th International Conference on Computational Linguistics
28th International Conference on Computational Linguistics
Barcelona, SPAIN (online)
8-apr-2020
13-apr-2024
International Committee on Computational Linguistics
Goswami, K., Sarkar, R., Chakravarthi, B. R., Fransen, T., Mccrae, J. P., Unsupervised Deep Language and Dialect Identification for Short Texts, in Proceedings of the 28th International Conference on Computational Linguistics, (Barcelona, SPAIN (online), 2020-04-08), International Committee on Computational Linguistics, Barcelona, SPAIN (online) 2020: 1606-1617. [10.18653/v1/2020.coling-main.141] [https://hdl.handle.net/10807/270154]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/270154
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact