Tagger

INFO

Read this first!

The info below is quite long and detailed and it might frighten you away from this page. You should not be worried, however. Here is a brief guide to help you to start using most of the applications without reading the more detailed Info page below.
From the drop-down menu on the left, choose the application that you wish to use. To the left text box type anything that IS NOT found in the target text. The easiest thing to do it is just to press some key at least three times. It is very unlikely that it is part of target text. Then press the tab Tag. To the right text box will be printed short text that will help you to use the application.
Some applications print even long lists for helping the user to choose the correct search key.
More detailed info can be found on this page which you are currently reading.

These instructions apply to the screen in normal PC. If you use a smart phone or tablet, the view is different. Instead of the left text box there is only one text box on the screen - that is the left box. The right box, where the output comes, is underneath and can be scrolled to become visible. When reading the output, the horizontal position of a phone might be optimal.

Intelligent search

In this location, there are three intelligent search machines. Each of these machines is applied to four text corpora, (a) Suomen perustuslaki, (b) vuoden 1938 Raamattu, (c) vuoden 1992 Raamattu, and (d) Apokryfikirjat.

The search machines make use of language analysis and disambiguation. This method facilitates fully covering and precise search results. That is, all required hits will be found, and the result has only those hits which were searched.

The target text is in the following format:
1Moos_1:1 Alussa {alku_N} Jumala {Jumala_ERISN} loi {luoda_V} taivaan {taivas_N} ja {ja_CC} maan {maa_N}.

In Biblical texts the length of line is the verse.
In Perustuslaki the length of the line is the sentence. A paragraph in Perustuslaki is cut into separate sentences, each with a unique numerical symbol.

Here we use the concepts surface form and base form.
For example, Alussa is a surface form, and {alku_N} is a base form.
It is important to understand the difference between these two, because with some search engines the search can be directed to surface form or base form.

In case the search key would produce empty result, you will get information on why this happened. You will also get examples on how to formulate search keys in that particular application.

Search engines with the extension -EASY

In this search method, the typed word will first be analysed, and the analysis result will be the actual search word. The analysis system finds the base form of the word, and also shows its part-of-speech.
Example: If you type the word kirjoissaankaan, it will be converted to the form {kirja_N}. The search will be done on the basis of this form.
In some cases the search word is ambiguous. For example, the search word tuli can be a noun or verb. The ambiguity can be solved by typing such a form of the word, which is not ambiguous. The form tulessa will return only the noun {tuli_N}. The form tulivat will return only the verb {tulla_V}.
Note that with this system you cannot search for surface strings. Neither can you combine words or use partial word search. For this purpose we have the search engines with the extension -PRO.

Search engines with the extension -PRO

In this search method, the search is targeted to the analysed and disambiguated text file, as in the above method. The difference is that here we do not use the runtime analyser. We direct the search to the corpus, which has the original text, and the lemma and its part-of-speech tag after each surface word. Therefore, we can do many types of searches.

A search engine with the extension -PRO is a flexible search system. It allows precise search of the word using the defined lemma in the search box. For example, if you enter {perustuslaki_N} into the search box, all sentences containing any form of the noun 'perustuslaki' will be found.
If you enter {laki_N} into the search box, all the sentences containing any form of the word 'laki' will be found.
If you leave the word boundary mark '{' out from the search word, you vill get less precise output. For example, the search word laki_N will find all the nouns whose base form ends with 'laki', such as 'laki', 'perustuslaki', 'hallintolaki' etc.
In order to get only the base form hits and not surface form hits, you must include the underscore '_' into the search string, e.g. laki_.

Below is a descpription of all search possibilities of the search engines with the extension -PRO. The types of examples used here apply to all four text corpora.

In precise base form search, we make use of the language analysis, and instead of searching for the surface string we search for the lemma of the word coupled with its word class tag. For example, if you enter {kirja_N} into the search box, you will find all those verses where the noun 'kirja' occurs in any form, and nothing else. You can also abbreviate the search key by omitting the final } and you will get the same result. You can also omit the part-of-speech (POS) tag N and you will get the same result. But you must be careful in omitting features in the search key, because the stem can sometimes have more than one POS tag. If you omit both curly braces and type kirja_N, you will find all occurrences if the word 'kirja', but also all such words, where the last part of the base form is 'kirja', such as lainkirja and erokirja. If you want to find only the forms of the word 'kirja', you should type {kirja_N or {kirja_.
Note that the output shows all the lines, where the key word occurs, and shows also the base form of the surface form after it.
In output, base forms are surrounded with curly braces, { and }.

NOTE! For base word boundary marks { and }, there is a single alternative mark '.' (full stop, or dot). For the underscore _ between the stem and the part-of-speech code, there are two alternative marks, + (plus sign) and - (dash). These aliases were added, so that it is easier to use the search system with mobile devices.
Therefore, all these search keys work in the same way (vertical bar '|' added here for separating individual search keys): {laki_ | .laki_ | .laki- | .laki+ | {laki- | {laki+
Also the following search keys (in this case for finding all nouns) work in the same way: _N} | -N} | +N} | _N. | -N. | +N

Searching full surface words
You can also search for full surface forms of words. In this case you will find such lines, where the input string occurs as full word. The search system does not check what its base form or POS is. Also capital letters and lowercase letters are different. For example, if you want to find all occurrences of the surface word kuningas, you can type the search string [Kk]uningas or (K|k)uningas. The output will have the capital initial and lowercase intial occurrences.
In output, full surface words are surrounded with angle brackets < and >.

Searching two or three consecutive words
It is possible to search for two or three consecutive words with two methods.
In one method, you should type the words exactly in the form, where they are in target text. Also capital initial letters matter.
There should be one or more empty spaces between words.
Example: Jumalan sana
Example: Kaikessa julkisessa toiminnassa
In output, the word cluster is surrounded with angle brackets < and >.

In another method, you should type each key word in the form, which link them to the base form.
Example: Jumala_ sana_
Example: kaikki_ julkinen_ toiminta_
In output, the word is surrounded with curly braces { and }.

Free searching of partial surface words
It is also possible to use such search strings, which match with part of the word only. The search string can be any part of the target word. The string must occur somewhere in the word. This search method does not check in which part of the word the match is.
Examples: vanh, hursk, kaus
Words found with free partial string search are surrounded with square brackets [ and ].

NOTE! In targeting at surface words, you can use any capital and lower-case letters. However, the search string must have at least two characters. The only exceptions are the capital letters N, A and V, which are part-of-speech codes (N=noun, A=adjective, V=verb). Using one of these three letters alone as a search string, all words of that category will be retrieved. Note that in these cases the search is targeted to base forms.

Controlled searching of partial surface words
For searching partial strings in a controlled way, you can use the asterisk '*' to show that only part of the search word is typed into the search box. If you put the asterisk in the end of the word, all words which match with the search word before asterisk will be printed. If you put the asterisk in the beginning of the word, all words that match with the search word after asterisk will be printed.
Note that if you use asterisk, the part of the word that you type must be the whole beginning part or end part of the word. You cannot type just some letters inside the word.
Examples: vanhur*, *hurskaus
laki* will find such words as laki, lakialoite, lakiosa.
*laki will find such words as laki, perustuslaki, avioliittolaki.

But if you omit asterisk, you will get the following output:
laki will find such words as laki, perustuslaki, lakialoite, perustuslakivaliokunta.
Words found with controlled partial string search are surrounded with square brackets [ and ].

Using Boolean operators
With this option it is also possible to use two Boolean operators, 'OR' and 'AND'.
OR-operator: You can make a search string, such as 'laki OR oikeus' for finding one of these two words with context. Also three options in search string are possible, such as 'laki OR oikeus OR rauha'. If one of these is found, the hit will be printed.
AND-operator: With this operator you will force the words to co-occur on the same line. For example, the search string 'laki AND oikeus' will find lines where the words 'laki' and 'oikeus' occur. Note that the words must be in the given order on the line. By switching the search words you will find lines, where the words are in the opposite order. Also here, it is possible to use three criteria, such as 'laki AND oikeus AND rauha'. All three words must occur on the same line and in the given order.
You can also use both operators in the same search string, such as 'laki AND oikeus OR rauha'. In this case you will find such lines, which have the word 'laki' and either 'oikeus' or 'rauha'.
The operator OR has also these alternative forms: 'or', 'TAI', 'tai', 'AU' and 'au'.
The operator AND has also these alternative forms: 'and', 'JA', 'ja', 'NA' and 'na'.
Note that when you type the search string, you must not surround it with single quotes. In this text the quotes are just for the sake of clarity.
Also note that although you type the search words in base form, the system finds also all inflected forms of the word. If there is ambiguity in POS classification, so that the word may belong to two POS categories, you can specify the search by typing also the POS category to the word. For example, if you type 'viranomainen_A AND kieli_N', you will find the lines, which have the adjective viranomainen and the noun kieli.
If you type 'viranomainen_N AND kieli_N', the system will print lines, where viranomainen is a noun.
There should be one or more empty spaces between words.

Curretly there are the following search engines with extension -PRO:
FIND-PERLAKI-PRO (Suomen Perustuslaki)
FIND-VALMAKI-PRO (Suomen Valmiuslaki 2013)
FIND-RAAM1938-PRO (Suomenkielinan Raamattu 1938)
FIND-RAAM1992-PRO (Suomenkielinen raamattu 1992)
FIND-RAAM1992-VT-PRO (Suomenkielinen Raamattu, Vanha Testamentti 1992)
FIND-RAAM1992-UT-PRO (Suomenkielinen Raamattu, Uusi Testamentti 1992)
FIND-APO-PRO (Apokryfikirjat)

Separate search of Vanha Testamentti and Uusi Testamentti

Search can be targeted to the Vanha Testamentti by using the search engine FIND-RAAM1992-VT-PRO.
Search can be targeted to the Uusi Testamentti by using the search engine FIND-RAAM1992-UT-PRO.

Search engines with the extension -COUNT

Search engines with the extension -COUNT make it possible to produce statistical lists of the word or words found in text. The output lists the words in alphabetical order and shows the number of hits for each word.
You can produce several kinds of lists.
The system produces lists depending of what kinds of marked words are in the output.

Base form lists
If you indicate in the search string that you look for base forms, the system produces lists of words surrounded by curly braces { and }.

Lists of full surface words
If your search string targets to full surface words, the system produces lists of words surrounded by angle bracket < and >.

Lists of partial surface words
If your search string targets to partial surface words, the system produces lists of words surrounded by square bracket [ and ].

Mixed lists
There are cases where the search string matches with all three kinds of hits. In these cases the list contains all three types of hits.

Lists of POS categories
You may produce lists on the basis of POS categories.
These are the POS categories currently in the system:
V = verb
N = noun
A = adjective
ADV = adverb
NUM = numeral
PREP = preposition
POST = postposition
PRON = pronoun
QUEST = question
CONJ = conjunction
EXCLAM = exclamation
CC = coordinating conjunction
PROPN = proper name
ERISN = erisnimi
NEG = negation
You just type the POS tag into te search box and you will get a list of base words in that category.
Note that if you type a single word tag such as A, N, or V, you should add the end mark } after it. Otherwise the system will also find words with such capital initial that matches with the key letter.
Therefore A} and not bare A, and so on.

Search using lemma lists

FIND-RAAM1992-TEST is an implementation of the search engine using a disambiguated lemma list. The list was produced using the SALAMA analyser of Finnish.
With this search engine one may search for the occurrences of various surface word forms of the base form. You must type the word in the base form into the search box.
The system does not have POS information. Therefore, the search result is not fully accurate.

It is also possible to use Boolean operators, such as OR and AND for combining words in search string. The search words must be in base form. With OR you can combine several words, with the operator OR between each search word. With AND you can combine two words, which must be on the same line, and in the same order as in search string. It is also possible to combine these operators.
In output, the hit words are surrounded with curly braces { and }.

Feedback

The Bible search system (find-raam) gives the following kind of feedback:
1. If the word written in the search box is a valid Finnish word, but it does not occur in the Bible, a message on it will be output.
2. If the word written in the search box is not a valid Finnish word, the message on it will be output.
3. If you write in the search box a type of word, which is valid Finnish, but which is not recognised by the system, the message on missing word types, such as pronouns, conjunctions, prepositions and postpositions, will be output.
4. If you write a word, which occurs in the Bible, all such Bible verses, where the word occurs (in any form), will be output in the order, where they occur in the Bible.

Retrieving whole texts

It is possible to retrieve also whole texts. Perustuslaki and Apokryfikirjat can be retrieved to the right side box, each in one piece. Both Bible translations have been cut into three sections, two Old Testament section and one New Testament section. This is because of capacity restrictions.
The print engines start with he word PRINT-.
Into the search box you should type one of the following keys. search, hae, print, tulosta, printtaa. If you type something else, the whole text will not be printed.

If you type into the search box something that IS NOT in text, you will get a short info on how to use search keys in this application.

Retrieving a section of text

Particularly useful is the application, where you can retrieve a section of text for reading. Each verse of the Bible has a uniq code. By entering the code, or part of it, into the search box, you will get the part of Bible text, which starts with the matching string. For example, if you enter into the search box the code Mat_1, you will get the text starting from the beginning of the New Testament and ending 1500 verses later. This is currently the size of retrieved text, but it may change later.
For using this option, you should choose PRINT-RAAM1938 or PRINT-RAAM1992 from the drop-down menu.
There are also alternative program names in Finnish, LUE-RAAMATTUA1938 and LUE-RAAMATTUA1992.
You can retrieve a list of all identification codes by typing into the left text box something that IS NOT in target text. The short introduction and the list of codes will appear on the right text box.
Instead of the verse identification code, you can use also other search keys. The system retrieves 1500 lines of text starting from the first hit it encounters in Bible.

Swahili Bible search

There are two search systems for the Swahili Bible, Union version.
FIND-BIBLIA-EASY allows the search using any wordform, provided that it is Swahili language. However, the search is restricted to nouns (N), adjectives (A), proper names (PROPN) and verbs (V). All other word types are excluded. It is assumed that the search system serves the normal needs of the user with this setup. One may type any wordform in the search box, and the system finds all occurrences of the words, which have the same lemma and POS class as the search word.

A particular feature in this system is that one may search also for multiword expressions, such as 'kutoa amri' and 'kuona njaa'. The words must be written as separate words in the search box. The verb can be in any form.

FIND-BIBLIA-PRO allows the search of the surface wordforms, such as they are written in text, and also the search on the basis of analysed wordforms. If precise search is used, one must formulate the search word so that the system searches the base forms instead of inflected forms. The charcter '{' serves as te left boundary and '}' as the right boundary. If the underscore '_' appears in the search string, search is always directed to base form. For example, the serach form 'ona_' finds all occurrences of the verb 'kuona'. The full search string would be '{ona_V}', but also the less full search string finds all occurrences. However, the string 'ona_' would also find words susch as {pona_V}. Therefore, if one wants to be sure that only the wanted occurrences are retrieved, one should demarcate the search string carefully. Remember that when using the underscore, you must put the word in base form, not in any inflected form.
Proper names do not inflect in Swahili. Therefore, they can be found using the written form of the proper name, for example 'Adamu'. However, in this implementation the proper names can also be searched using the low-initial writing, e.g. 'adamu', or 'adamu_', or 'adamu_PROPN', or 'adamu_P', or any such form which identifies it as the searched proper name. The full form of the word is '{adamu_PROPN}', and you can figure out the search form on the basis of it.
With this option it is possible to use also Boolean operators, such as 'OR' and 'AND'. For more details see Using Boolean operators above.

For searching partial strings, you can use the asterisk '*' to show that only part of the search word is typed into the search box. If you put the asterisk in the end of the word, all words which match with the search word before asterisk will be printed. If you put the asterisk in the beginning of the word, all words that match with the search word after asterisk will be printed. Note that if you use asterisk, the part of the word that you type must be the beginning or end of the word. You cannot type just some letters inside the word.
If you want to get hits that match with selected consecutive letters inside the word, you just type those letters into the search box without asterisk.

Reading sections of Swahili Bible

It is also possible to read a section of Swahili Bible by using the program SOMA-BIBLIA. Instructions are the same as in Retrieving a section of text above. The keys for all sections of Swahili Bible will be printed, if you type to the left text box something that IS NOT in target text and press Tab.

Salama Tagger

performs the analysis and disambiguation of Swahili text. It covers comprehensively the vocabulary of current standard Swahili, as it is used in news media, in government reports, and in prose literature. The system is constantly being updated as soon as new words appear to the language. In addition to word analysis, the system has several thousands of multiword-expressions of various types. Salama Tagger is different from most taggers also in that it has disambiguated glosses in English.
The tagger produces two formats,
(1) basic format, which is the output of the parser,
(2) xml-format, which is suitable for compiling such text corpora that can be browsed with work benches such as Korp.

Morpho

Morpho gives plain and restricted analysis of text without disambiguation. The format of the output may change from time to time.

Vocabulary Compiler

There are five options for compiling vocabularies for a given text. The option VOC-ALL lists all lexical words of the test. It suits to the absolute beginner. The other options reduce common words in various degrees, as follows.
VOC-BASIC cuts off the 500 most common words.
VOC-MEDIUM cuts off the 1000 most common words.
VOC-ADVANCED cuts off the 1500 most common words.
VOC-EXPERT cuts off the 2000 most common words.
For example, the learner in initial stages of learning would perhaps need the option VOC-BASIC. It cuts off the 500 most common words of the language and produces the rest, if they are in the text. The option VOC-EXPERT would produce a much shorter list and be sufficient for an advanced student. It cuts off the 2000 most common words and produces the rest.
Each extended verb is considered a separate lexical entry. Therefore, a very common verb may appear on the list, if the extended form of the verb is not common.

A new feature compared with normal vocabularies is that the system tries to figure out whether a cluster of words should be considered as a multiword expression. This depends on the context in sentence. Such clusters of multiword expressions are treated as single lexical entries with appropriate meanings in English.

Learning Swahili interactively

This website has a special learning environment for Swahili under the tab 'Learn'. It includes a number of guided lessons. In addition to this, we have a more flexible implementation of the learning system here under 'Tagger'.

There are two versions (Finnish and English) of introductory programs for learnig how the system works. You may select SWA-LEARN-INTRO-FIN or SWA-LEARN-INTRO-ENG and type to the search box anything that is NOT Swahili. You will be led to the interactive introductory environment. Finally you will be guided to start the real learnig system SWA-LEARN.

If you select the option SWA-LEARN, you may enter any Swahili word in the text box. Information on how to proceed will be displayed in the right side box.

You may proceed either by following the guided tour through the basic linguistic structures of the language. Or you may use your own vocabulary, whereby the system guides you if you make mistakes. It controls whether your word order is correct and tries to help you if possible. It also controls that the concordance in your constructions is correct. The third thing that it controls is the misspellings.

You may retrieve lists of various types of Swahili words. The words are grouped into various clusters according to part-of-speech and noun class. Words in the lists are arranged in frequency order, the most common ones in the beginning. The lists are for aiding the learner to choose common words for practising structures.
The list of key words for printing various lists can be retrieved by typing into the left text box something that IS NOT in target text. Then press Tab. The list appears in the right side box.
More ditailed information is in the web address:
http://www.njas.helsinki.fi/salama/rule-based-language-technology-and-self-tutored-language-learning-systems.pdf

Learning Finnish interactively

If you select the option SUO-LEARN, you may enter any Finnish word in the text box. Information on how to proceed will be displayed in the right side box.

Currently the learning system helps in learning Finnish noun phrases, and inflection of all Finnish verb types. You may use your own vocabulary, whereby the system guides you if you make mistakes. It controls whether your word order is correct and tries to help you if possible. It also controls that the concordance in your constructions is correct. The third thing that it controls is the misspellings.
More detailed information is in the web address:
http://www.njas.helsinki.fi/salama/rule-based-language-technology-applied-to-learning-finnish.pdf

Learning Finnish subject and object cases

It is aso possible to learn Finnish subject and object cases in short sentences. The cases are not easy to learn. Therefore, the learning system tests, whether the cases are correct in each context. The system analyses the sentence and tells how the subject and object cases should be formulated in each case. More detailed information is in the web address:
http://www.njas.helsinki.fi/salama/self-tutored-learning-of-finnish-subject-and-object-case.pdf

Converting Standard Finnish to Kitee dialect

When you select SUO-TO-KITEE, you can enter any type of Finnish text to the left box, and the result comes to the right box, converted into Kitee dialect. The text should be written in Standard Finnish. More detailed information is in the web address:
http://www.njas.helsinki.fi/salama/converting-standard-finnish-to-kitee-dialect.pdf

Correcting Finnish text

By selecting CORRECT-SUO, you can enter any Finnish text to the left box, and the result comes to the right box. The result is correct Finnish, where wrongly chosen words are replaced with correct ones. The system does not perform spell checking. It only keeps track of such language use that is not considered correct. More detailed information is in the web address:
http://www.njas.helsinki.fi/salama/correcting-text-via-language-analysis.pdf

Reading Finnish Bible in Kitee dialect

The Finnish Bible translation from year 1938 was converted into Kitee dialect, and this version can now be read online. Instructions on how to use the application are above under the heading Retrieving a section of text. Please note that if you want to find the list of codes for the books of the Bible, type to the left box something that is NOT part of Bible text.