Authors and developers: Grigori Sidorov,
Go to the personal page of Grigori Sidorov.
Go to the personal page of Alexander Gelbukh.
NEW:
A complete wordlist (beta-version) generated with this system is available, see
below.
This is a program that
performs lemmatization and provides grammar information of each word form in
the sentence. See detailed description below.
The system is an EXE file for
Windows. DLL is available on request.
Performance of the system is
about 5 KB per second on Pentium 4 computer.
For the time being, we use BDE
for data access, thus, in future, the performance will be improved.
Dictionary size is about 25,000
head words.
Current release is Beta. We
would be grateful for your comments, suggestions and bug reports.
We will provide actualized
versions of the dictionary.
If you need a list of words
added to the system's dictionary, please, contact us.
The system contains a text
file "complex.dic" with compound words (a_partir_de, etc.). The same file can be used as a user
dictionary for single words (though we do not recommend it).
1. You can use this
program freely for academic purposes. No warranty.
2. You should inform
us about the usage of the program, and
3. You should cite the
corresponding paper (see below) in your publications obtained with the help of
the program or derived from any project, thesis, etc., that used the program.
The citation is to be put under the section References of the paper and not in
Acknowledgements, footnotes, etc. You are also invited to cite other related
papers, see www.cic.ipn.mx/~sidorov
and www.gelbukh.com/CV/CV.htm#Publications.
We would be grateful if you inform us about such citations.
Paper for citing:
A. Gelbukh,
G. Sidorov. Approach
to construction of automatic morphological analysis systems for inflective
languages with little effort. In:
Computational Linguistics and Intelligent Text Processing (CICLing-2003),
Lecture Notes in Computer Science, N 2588, Springer-Verlag,
2003, pp. 215–220.
Downloading means that you
accept the license. Thank you.
Download
the system (release 18/04/2007).
Download
the updated EXE only (instead of reinstall the system, you can just
overwrite the EXE file) (release 18/04/2007).
Download wordlist (release 14/07/2007).
Note that in irregular forms and
adverbs stem and flexion are not are not separate, instead, the mark “@” is
added in the end.
abalanz-are V0SF3S0 abalanzar
abalanz-áremos V0SF1P0 abalanzar
abalanz-areis V0SF2P0 abalanzar
abalanz-aren V0SF3P0 abalanzar
abalanz-a V0R02S0 abalanzar
abalánz-ame [me] V0R02S0 abalanzar
abalánz-ate [te] V0R02S0 abalanzar
...
abrazadera- NCFS000 abrazadera
abrazadera-s NCFP000 abrazadera
abrazo- NCMS000 abrazo
abrazo-s NCMP000 abrazo
...
aerospacial- AGIS000 aerospacial
aerospacial-es AGIP000 aerospacial
The input file is any standard
text file (mind the encoding – DOS or ANSI).
Output is the file with the
same name plus prefix “c_”, for example, “archive1.txt” →
“c_archive1.txt”.
Format of the output:
Word (word_number_in_sentence)
lemma1 (*info1) (number_of_grammar_info1) lemma2 (*info2)
(number_of_grammar_info2)...
Example of the output:
me (13) yo (*PP1CSR0) (0)
encontré
(14) encontrar (*VMID1S0) (0)
a (15) a
(*SPS00) (0)
toda
(16) todo (*DI3FS00)
(0) todo (*PI3FS00) (1)
la (17) ella (*PP3FSR0) (0)
la (*TDFS0) (1)
tripulación
(18) tripulación (*NCFS000) (0)
hacinada
(19) hacinar (*VMP0000) (0)
a (20) a
(*SPS00) (0)
un (21) un
(*MCMS00) (0) un (*TIMS0) (1)
lado
(22) lado (*NCMS000)
(0)
navío
(24) navío (*NCMS000)
(0)
, (25) , (*FC) (0)
In the field "Info",
the asterisk (*) should be ignored.
The encoding scheme is similar
to the scheme used in the corpus LEXESP, for example, the first symbol means: N
- noun, V - verb, A - adjective, R - adverb, C - conjunction, S - preposition,
P - pronoun, etc.
There is an example of
decoding in the downloaded file.
Go to the personal page of Grigori Sidorov.
Go to the personal page of Alexander Gelbukh.