Sunday, October 24, 2010

Dying embers of lost passion: Post-mortem of Finnish Annotator



What Finnish Annotator?


Finnish Annotator was my CALL website, developed around 2005-2006. In those years, I was finishing my studies and spent summers writing the website. The site featured an annotator for Finnish and Chinese, a flashcard program and a character-drawing exercise. I took it down in 2008 as it had no users.

Annotator is a "text dictionary", which decodes the inflection and searches explanations for all words in a copy-pasted text. While Google Translate is free, annotators are more useful for language-learners. You can read the text as long as you completely understand it, resorting to hovering your mouse over annotations only when you have to.

The entry page shows how it annotated Chinese text. It also describes how you could turn a copy-pasted text into a flashcard deck. The post about the fundamental problem of flashcards mentioned that my website tried to solve it by taking example sentences from the annotated text. Indeed after you you press "show answer" it showed annotated example sentence where "kun" was used.

The Finnish vocabulary contained 1000-word test vocabulary. The demo page used to work on all browsers, but currently crashes Firefox. Being acutely aware of the need for context, the word definitions contain well-split meanings and example phrases, and sometimes even comparison and contrast to related words.

At the bottom of the Chinese entry page there is a screenshot of the character-drawing exercise, where you move the brush with your mouse and the stroke appears if you are moving the brush correctly. This mayseem similar to Skritter, since both programs took influence from WriteChinese, a piece of prior art from the nineties.

Morphology engine and master's thesis


About half of the code in the website deinflects Finnish words. Finnish inflection is very complex: for example substantives can have 4 different types of postfixes. The site used two-level morphology and state machines to decode the words. These were a bit obsolete methods to handle morphology, but they were provably successful for Finnish and clearly described in Koskenniemi's book. Modern methods would have required access to commercial state machine libraries, which I didn't have.

My thesis described the algorithms in the morphology engine. It used athematical notations and also contained a few proofs. When I returned it, it got full points.

The algorithm for compiling two-level inflection rules contained a minor simplification. Thesis inspector said that it was actually publishable research, but I didn't follow up on that, since I was not planning to return to school. Anyway, it kind of demonstrates that I already know how to do research, I just don't know how to identify it and wrap it into form, which can be sent to conferences and journals.

How it failed


Since I consider myself economically rational and didn't work for two summers, I had to rationalize away the congnitive dissonance somehow. My feeble excuse was that I was doing a semi-commercial system, which would continue to mill extra income after intial setup effort. In practise, what I did was closer to a mild for of hikikomori.

Firstly, I didn't tell about the system to many people, thinking that I'll publish the product when it is ready. Therefore not a single person becase interested enough in it to give feedback and criticize away obvious weaknesses which were easy to correct but for which I was blind, having spent too much time doing it. For example the need to log in first was such a weakness.

Also, in those days I had not yet discovered the Game of Talking and I kept getting bad outcomes in human relationships without really understanding what the hell went wrong. When I wrote last year "Most people develop these surfacial skills as young adults. Unfortunately, you can't skip the development of social skills. If you fail to complete this developmental task as a young person, it will continue to haunt you and drag you down until you solve it.", I meant also Finnish Annotator. This severly limited my ability to get feedback on the system.

The system was quite close in function not just to MDBG annotator, but also to Lukutulkki, a commercial system for annotating English text to Finnish speakers. Had I presented it right, some CALL researchers should have become interested in it.

The most damaging hit from commercial mindset was my reluctance to use gray copyright vocabularies. It was also a question of quality, as dict vocabularies didn't have split meanings nor example phares. I actually started to collect my own Finnish vocabulary. In the end, it had inflections for about 5000 words and meanings and example phrases for somewhat over 1000 words. At that point, Google Translate published Finnish translation, so I thought that no way in hell am I going to get the vocabulary collected before free services offer better than what I have. Since the system had no users, I took it down. It was really idiotic move to start to collect vocabulary from scratch. I believe now that Takkirauta's talk about Manstein's matrix has a seed of truth, and if you notice that you are doing a lot of repetitive informational work (like vocabulary colllection), you are probably doing something wrong and should stop to ponder different options. Don't just do something, stand there!

The main lessons I learned from it are the importance of social skills and awareness that I am prone to obsessive-compulsive tunnel vision which makes me exert a lot of effort when the right solution would be to look at different options.

What parts of it are still useful


Before I can apply for graduate studies, I need to find a research group. Finnish Annotator is my main merit for persuading others to include me in their work and publications. Next, I'll list examples of how the technologies and components in FA could contribute to CALL research.

The character-drawing engine can be modified to train students to write Russian or Arabic characters. In the first Arabic course I participated, learning to read the script was a huge part of the course. Speeding it up with spaced repetition system, which gradually introduces new material after ensuring that the student has mastered dependencies could make a big enough difference for a publishable paper.

Since the two-level morphology can handle Finnish inflection, it can deal with almost any language. Annotation works best when embedded to other services. FA didn't just annotate copy-pasted text, it also annotated any example sentences in the flashcards. Annotation can be integrated to boost any existing research ambitions in CALL.

No comments: