Morphology Beyond Inflection. Building a Word Formation Based Lexicon for Latin

Litta Modignani Picozzi, Eleonora Maria Gabriella

The computational linguistics world is gradually focussing its interests in researching and building new derivational morphology resources and tools. This happens especially in the production of tools for modern languages such as the lexical network for Czech, DeriNet,1 and the derivational lexicon for German DErivBASE.2 On the Classical languages front, although the number of lexical resources and NLP tools (especially for Latin) is now manifold and varied, until now there has not been any attempt to create a derivational morphology tool, where lemmas are segmented and analysed into their derivational morphological components, so to establish relationships between them on the basis of word formation, and the verbal noun amator can be reconnected to the verb amo through a suffixation of –a-tor. The first steps towards constructing a lexicon based on wordformation for Latin were actually made by Marco Passarotti and Francesco Mambrini in 2012, when they published a paper proposing a model for the semi-automatic extraction of word formation rules and the subsequent pairing of lemmas to their morphologically simplest lemma (i.e. non-derived). 3 In this context, the Word Formation Latin project (WFL) has been awarded a Marie Curie individual fellowship to expand on these efforts and create a definitive derivational lexicon for Classical Latin. This will ultimately be included in the automatic lemmatiser for Latin LEMLAT (http://www.ilc.cnr.it/lemlat/lemlat/index.html, accessed 21/01/2016, due to an update soon), creating a 360° resource for the study of Latin Morphology. The data is collected and organised in a MySql relational database according to the following steps: a) A list of lemmas is automatically extracted from the LEMLAT dataset. b) The wordformation rules (WFR) are conceived according to the Item-and-Arrangement model, which considers word forms either as simple morphemes (simplex) or as a concatenation of morphemes absolving the following conditions: 1) Baudoin’s assumption that both base and affixes are lexical elements (i.e. they are both morphemes), 2) They are dualistic, having both form and meaning (Bloomfield’s “sign-base” morpheme theory) 3) They both exist in a lexicon (Bloomfield’s “lexical morpheme” theory)( Passarotti- Mambrini, 2012. Hockett, 1954). In Passarotti & Mambrini, a list of WFRs was obtained both manually and automatically, then identified and formalised into a table, according to their type (prefixal, suffixal, compound and 1 Ševčíková, Magda, and Zdeněk Žabokrtskỳ. 2014. “Word-Formation Network for Czech.” In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), 1087–93 2 Zeller, Britta D., Jan Snajder, and Sebastian Padó. 2013. “DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German”, in ACL (1), 1201–11. http://anthology.aclweb.org/P/P13/P13-1118.pdf 3 M. Passarotti & F. Mambrini, First Steps towards the Semi-automatic Development of a word-formation-based Lexicon of Latin, in Proceedings of LREC 2012, Istanbul, Turkey, 852-859 conversion) and according to the category of transformation undergone by the lexical element in input (N-to-N, N-to-V, N-to-A etc.). In the first phase of the WFL project, for each WFR, we automatically find input and output candidate lemmas through the aid of sql queries (an output lemma can belong to only one WFR). In phase 2, morphological families are induced from the data. A morphological family is the set of lemmas morphologically derived from one common ancestor-lemma: all those (simple, or complex) lemmas that share the same base are assigned to the same morphological family. Finally, the members of each family are automatically linked to each other according to their part of speech, inflectional category, and affixes by means of the WFR assignment. The simple lemma member is assigned the role of ancestor of the family. This automatic procedure is considered non-ultimate for building morphological families. However, it provides filtered data that must be checked manually. Manual checking allows the identification of false results, duplication and lacunas resulting from the automatic process; manual hardcoding is necessary for those lemmas produced by poorly productive WFRs, or morphotactically obscure word-formation processes. For example, in the treatment of the rule that forms nominal adjectives with the addition of the suffix –a-cius/-a-cis/a-x, the sql script pairs and generates two possible candidates for the formation of fugax: fuga and fugium. This duplicate result needs to be analysed and rectified, there must be only one simple input form for each output form, just like there must be only one WFR associated with a derivative lemma. Evaluation of the language resource is performed by manually checking data organised into homogeneous groups based on WFRs (coverage of rules) and stemming (coverage of morphological families). Precision and recall will be used as evaluation metrics in order to calculate the rate of positive and negative. Since the start of the project three months ago, 68 prefixal and 27 suffixal rules have been covered, and around 7500 lemmas have been assigned to a WFR. The quality of precision of the sql queries is higher when the morphotactic mutations are lower: for example in prefixal rules, the precision rate is about 80% to 95%, while in the treatment of the first suffixal rules, precision rate can vary from 75% to as little as 30%. These results are to be considered only temporary, as fine-tuning of queries and a process of exclusion from an ever-growing list of already assigned lemmas can reduce the discrepancy between query-generated results and manual skimming of candidates. Recall will need to be evaluated at the end of the project, as currently, we are unable to verify how many lemmas are not automatically picked up by queries. The presentation will illustrate the methodology employed to obtain the digital resource, the challenges that the Latin language represents as a dead language, the progress through the project schedule and an illustration of mock-up visualisation for the final result.

Litta Modignani Picozzi, E. M. G., Morphology Beyond Inflection. Building a Word Formation Based Lexicon for Latin, in Formal Representation and the Digital Humanities, (Verona, 28-29 June 2016), Cambridge Scholars Publishing, Newcastle upon Tyne 2018: 97-114 [http://hdl.handle.net/10807/130504]