Aligned English-Spanish corpus of literary texts (sentence level, 12 large novels)

The corpus is aligned at sentence level using Vanilla aligner. Corpora with paragraph level alignment and alignment using genetic algorithm coming very soon.

Download corpus (sentence level alignment, Vanilla aligner).

Papers for citing:

1. Alexander Gelbukh, Grigori Sidorov, José Ángel Vera-Félix. A Bilingual Corpus of Novels Aligned at Paragraph Level. Lecture Notes in Artificial Intelligence, N 4139, ISSN 0302-9743, Springer-Verlag, 2006,  pp 16-23.

2. Grigori Sidorov, Juan-Pablo Posadas-Durán, Héctor Jiménez-Salazar, Liliana Chanona-Hernández. A New Combined Lexical and Statistical based Sentence Level Alignment Algorithm for Parallel Texts. INTERNATIONAL JOURNAL OF COMPUTATIONAL LINGUISTICS AND APPLICATIONS (ISSN 0976-0962), Vol 2 (1-2), 2011, pp. 257-263.

Text from the corpus (the majority from the Gutenberg project):

Author

English title

Paragraphs

Spanish title

Paragraphs

Carroll, Lewis

Alice’s adventures in wonderland

905

Alicia en el país de las maravillas

1,148

Carroll, Lewis

Through the looking-glass

1,190

Alicia a través del espejo

1,230

Conan Doyle, Arthur

The adventures of Sherlock Holmes

2,260

Las aventuras de Sherlock Holmes

2,550

James, Henry

The turn of the screw

820

Otra vuelta de tuerca

1,141

Kipling, Rudyard

The jungle book

1,219

El libro de la selva

1,428

Shelley, Mary

Frankenstein

787

Frankenstein

835

Stoker, Bram

Dracula

2,276

Drácula

2,430

Ubídia, Abdón

Advances in genetics2

116

De la genética y sus logros

109

Verne, Jules

Five weeks in a balloon

2,068

Cinco semanas en globo

2,860

Verne, Jules

From the earth to the moon

894

De la tierra a la luna

1,235

Verne, Jules

Michael Strogoff

2464

Miguel Strogoff

3,059

Verne, Jules

Twenty thousand leagues under the sea3

3,702

Veinte mil leguas de viaje submarino

3,515


2 This is a fiction text, not a scientific text.

3 There are two English translations of this novel available.