The Cardamom Workbench for Historical and Under-Resourced Languages

Doyle, Adrian; Fransen, Theodorus; Stearns, Bernardo; Mccrae, John P.; Dereza, Oksana; Rani, Priya

This paper describes the creation of a workbench tool designed to make technologies developed throughout the lifespan of the Cardamom project easily accessible to researchers who could most benefit from them, but who may not have the technical expertise to apply bleeding edge technologies to their own datasets. The workbench provides an intuitive graphical user interface (GUI) and workflow which abstract users away from underlying technical tasks, while providing them with a suite of powerful NLP tools developed by the Cardamom team. These include tokenisers, POS-taggers, various annotation tools, and ML models. The performance of workbench tools can be improved as text and annotations are added by users. It is envisioned that this workbench will provide a simple route to digital publication for academics in the humanities, or more specifically, for linguists working with under-resourced or historical languages, who have collected text data but are unable to make it available online as a result of financial or technical restraints. This has the added benefit of increasing the availability of high quality, annotated text data to NLP researchers, thereby providing value to both communities of researchers.

Doyle, A., Fransen, T., Stearns, B., Mccrae, J. P., Dereza, O., Rani, P., The Cardamom Workbench for Historical and Under-Resourced Languages, in Proceedings of the 4th Conference on Language, Data and Knowledge, (Vienna, Austria, 12-15 September 2023), NOVA CLUNL, Portugal, Lisbon 2023: 109-120 [https://hdl.handle.net/10807/270174]