System for automatic morphological analysis of Spanish

 

Authors and developers: Grigori Sidorov, Alexander Gelbukh, Francisco Velazquez, Liliana Chanona.

 

Go to the personal page of Grigori Sidorov.

Go to the personal page of Alexander Gelbukh.

 

NEW: A complete wordlist (beta-version) generated with this system is available, see below.

 

This is a program that performs lemmatization and provides grammar information of each word form in the sentence. See detailed description below.

The system is an EXE file for Windows. DLL is available on request.

 

Performance of the system is about 5 KB per second on Pentium 4 computer.

For the time being, we use BDE for data access, thus, in future, the performance will be improved.

 

Dictionary size is about 25,000 head words.

 

Current release is Beta. We would be grateful for your comments, suggestions and bug reports.

 

We will provide actualized versions of the dictionary.

If you need a list of words added to the system's dictionary, please, contact us.

 

The system contains a text file "complex.dic" with compound words (a_partir_de, etc.). The same file can be used as a user dictionary for single words (though we do not recommend it).

LICENSE:

1.      You can use this program freely for academic purposes. No warranty.

2.      You should inform us about the usage of the program, and

3.      You should cite the corresponding paper (see below) in your publications obtained with the help of the program or derived from any project, thesis, etc., that used the program. The citation is to be put under the section References of the paper and not in Acknowledgements, footnotes, etc. You are also invited to cite other related papers, see www.cic.ipn.mx/~sidorov and www.gelbukh.com/CV/CV.htm#Publications. We would be grateful if you inform us about such citations.

 

Paper for citing:

A. Gelbukh, G. Sidorov. Approach to construction of automatic morphological analysis systems for inflective languages with little effort. In: Computational Linguistics and Intelligent Text Processing (CICLing-2003), Lecture Notes in Computer Science, N 2588, Springer-Verlag, 2003, pp. 215–220.

Download:

Downloading means that you accept the license. Thank you.

 

Download the system (release 18/04/2007).

 

Download the updated EXE only (instead of reinstall the system, you can just overwrite the EXE file) (release 18/04/2007).

 

Download wordlist (release 14/07/2007).

 

Example of the wordlist

 

Note that in irregular forms and adverbs stem and flexion are not are not separate, instead, the mark “@” is added in the end.

 

abalanz-are V0SF3S0 abalanzar

abalanz-áremos V0SF1P0 abalanzar

abalanz-areis V0SF2P0 abalanzar

abalanz-aren V0SF3P0 abalanzar

abalanz-a V0R02S0 abalanzar

abalánz-ame [me] V0R02S0 abalanzar

abalánz-ate [te] V0R02S0 abalanzar

...

abrazadera- NCFS000 abrazadera

abrazadera-s NCFP000 abrazadera

abrazo- NCMS000 abrazo

abrazo-s NCMP000 abrazo

...

aerospacial- AGIS000 aerospacial

aerospacial-es AGIP000 aerospacial

 

Detailed description of the system for automatic morphological analysis of Spanish:

 

The input file is any standard text file (mind the encoding – DOS or ANSI).

 

Output is the file with the same name plus prefix “c_”, for example, “archive1.txt” → “c_archive1.txt”.

 

Format of the output:

Word (word_number_in_sentence) lemma1 (*info1) (number_of_grammar_info1) lemma2 (*info2) (number_of_grammar_info2)...

 

Example of the output:

 

me (13) yo (*PP1CSR0)  (0)

encontré (14) encontrar (*VMID1S0)  (0)

a (15) a (*SPS00)  (0)

toda (16) todo (*DI3FS00)  (0) todo (*PI3FS00)  (1)

la (17) ella (*PP3FSR0)  (0) la (*TDFS0)  (1)

tripulación (18) tripulación (*NCFS000)  (0)

hacinada (19) hacinar (*VMP0000)  (0)

a (20) a (*SPS00)  (0)

un (21) un (*MCMS00)  (0) un (*TIMS0)  (1)

lado (22) lado (*NCMS000)  (0)

del (23) del (*SPCMS)  (0)

navío (24) navío (*NCMS000)  (0)

, (25) , (*FC)  (0)

 

In the field "Info", the asterisk (*) should be ignored.

The encoding scheme is similar to the scheme used in the corpus LEXESP, for example, the first symbol means: N - noun, V - verb, A - adjective, R - adverb, C - conjunction, S - preposition, P - pronoun, etc.

There is an example of decoding in the downloaded file.

 

Go to the personal page of Grigori Sidorov.

Go to the personal page of Alexander Gelbukh.