March 15, 2010
Errors Made In Basque Have Been Analyzed To Be Applied In Automatic Correctors And Language Learning Tools
For a number of years now, the IXA group of the Faculty of Information Technology at the University of the Basque Country (UPV/EHU) has been undertaking research aimed at developing semi-automatic systems of benefit to the Basque language (Euskara). Amongst these systems are found the automatic treatment of mistakes in Basque and the tools that enable the learning of the language with IT means. In her PhD thesis presented at the UPV/EHU, Ms. Larraitz Uria, member of the IXA group, has set out the bases for the development of these two systems, through the establishment of several criteria for the analysis of errors and deviations.
Ms Uria's PhD is entitled Euskarazko erroreen eta desbideratzeen analisirako lan-ingurunea. Determinatzaile-erroreen azterketa eta prozesamendua (Working environment for the analysis of errors and deviations in the Basque language. Evaluation and processing of errors with determiners). First errors were differentiated from deviations, one of the most important contributions of the research. Errors are mistakes in spelling or grammar. Deviations, grammatically correct words but inappropriately used in a given context, are related to the register or dialect. The idea is that the automatic systems of the future differentiate the two concepts, and so the distinction is relevant.
In her thesis Ms Uria describes two data bases in which examples and details of errors and deviations have already begun to be stored. These have been put into operation by the IXA group and are adapted to two applications. The first is for storing information necessary for developing the automatic treatment of errors in Euskara (correctors, markers of dialectic variations, etc.). The second is for gathering data that facilitates the creation of tools for learning the language using information technology. It is totally unusual to fuse these two lines, but the fact is that much of the data for the automatic treatment of errors is useful for learning the language using information technology and vice-versa. This is one of the contributions of the work.
Essential for developing a detector of errors
Another is the corpus, already up and running, and which is the main pillar of the data bases. From this the first examples of errors and deviations were extracted, an essential step for drawing up a system capable of detecting them. A corpus of 113,290 words was drawn up, gathered from students' texts in Basque at various levels. Likewise, a number of students' texts in technical Basque and those of common speakers were included. For this first stage, a significant amount of information has been collated in order to start the analysis process, the criteria for creating the corpus having been defined.
The first step in this analysis is labeling. Concretely in this PhD thesis, and as a starting point in the research, mostly mistakes made regarding determiners were labeled. Given that mistakes with determiners in Basque are not very common but, at the same time, are serious when made, Ms Uria considered them to be an appropriate example for a preliminary trial. In any case, the intention in the future is to develop the detection of all kinds of errors and deviations. For this process of labeling, the editor created by the IXA group, EtikErro, was employed. Apart from labeling errors, it exports the labeled examples to the data base, and even the linguistic information required for their analysis.
For the classification stage "“ just after that of labeling "“ a great contribution has been made. The main structure of the classification was defined, especially developing the category referring to errors with determiners. Finally, and after completing the stages mentioned, a start was made with creating the two data bases. Both store the same examples and linguistic information, but have differences. The data base for the automatic treatment of errors in Basque includes technical information. On the other hand, the data base for learning the language using information technology stores psycholinguistic information.
First results of automatic treatment
Ms Uria, together with the IXA group, has already carried out the first trials in order to test the results thrown up by the automatic treatment of errors, based on the tools mentioned. Using a technique and a series of rules appropriate for mistakes made with determiners, she measured the precision of the treatment. That is, she contrasted the efficacy of the treatment using a computer program. The precision was only 45.5 % at the beginning. Nevertheless, if at first non-labeled errors are eliminated, the "background noise" disappears and the precision rises to 80 %. Ms Uria also concluded that the more extensive the corpus, the greater its efficacy. The contribution of this thesis is no more than the first step in a challenge for the future.
About the author
Ms Larraitz Uria GarÃn (Hernani, 1977) is a graduate in English Language and in Primary Education Teaching. She drew up her thesis under the direction of Ms Igone Zabala Unzalu and Ms Montse Maritxalar Anglada, from the Department of Basque Language and the Information Technology Faculty, respectively. She is currently a researcher with the IXA group at the UPV/EHU and at the IKER group at the University of Bayonne.