SpaDeLeF: A Dataset for Hierarchical Classification of Lexical Functions for Collocations in Spanish ——————————————————————————————— Authors and contact information: Yevhen Kostiuk (kosteugeneo@gmail.com), Grigori Sidorov (sidorov@cic.ipn.mx) , and Olga Kolesnikova (kolesolga@gmail.com) Link to the code: https://github.com/YevhenKost/spadelef ——————————————————————————————— SpaDeLeF is a dataset of most frequent Spanish verb-noun collocations and sentences where they occur, each collocation is assigned to one of 37 lexical functions defined as classes for a hierarchical classification task. Each class represents a relation between the noun and the verb in a collocation involving their semantic and syntactic features. We combine the classes in a tree-based structure, and introduce classification objectives for each level of the structure. The dataset was created by dependency tree parsing and matching of the phrases in Spanish news. ——————————————————————————————— Data Structure The directory is structured as follows: SpaDeLeF: -AntiPermOper1 prohibir reproducción.jsonl -AntiReal3 violar derecho.jsonl -Caus1Func1 hacer parte.jsonl sacar conclusión.jsonl sacar provecho.jsonl …. The root directory contains subdirectories named after the specific class of lexical functions: AntiPermOper1, Caus1Func1, Oper1 etc. Under each directory the jsonl files are stored, named after verb-noun collocation: “ .jsonl”. Each jsonl file has a list of processed sentences that includes the verb-noun collocation of the corresponding lexical function class. Every row of jsonl file contains a json. For example: {"start_cc_index": 15, "end_cc_index": 16, "tokens": ["pero", "redondo", "desprecia", "a", "los", "dirigentes", "que", "est\\u00e1n", "hoy", "al", "frente", "del", "pp", "y", "busca", "sacar", "provecho", "de", "las", "dudas", "de", "los", "barones", "populares", "sobre", "si", "en", "estos", "momentos", "representan", "o", "no", "una", "alternativa", "de", "estado", "."], "lemmas": ["pero", "redondo", "desprecia", "a", "el", "dirigente", "que", "estar", "hoy", "al", "frente", "del", "pp", "y", "buscar", "sacar", "provecho", "de", "el", "duda", "de", "el", "bar\\u00f3n", "popular", "sobre", "si", "en", "este", "momento", "representar", "o", "no", "uno", "alternativa", "de", "estado", "."], "lf": "sacar provecho"} The description of the keys can be put as follows: tokens: list of strings, tokenized text with stanza word tokenizer for Spanish lemmas: list of strings, lemmatized tokenized text with stanza lemmatized for Spanish start_cc_index: int, index of verb in tokens that is a part of collocation end_cc_index: int, index of noun in tokens that is a part of collocation lf: str, space-separated lexical function, “ ” Some of the files could be empty, which indicates that there were no sentences in the corpus associated with this collocation of this class. ——————————————————————————————— Scripts For loading data from jsonl file, we suggest using the following script: ``` def read_jsonl(path): output = [] with open(path, "r") as f: for line in f.read().split("\n"): if line: output.append(json.loads(line)) return output ``` ——————————————————————————————— Citing @misc{kostiuk2023spadelef, title={SpaDeLeF: A Dataset for Hierarchical Classification of Lexical Functions for Collocations in Spanish}, author={Yevhen Kostiuk and Grigori Sidorov and Olga Kolesnikova}, year={2023}, eprint={2311.04189}, archivePrefix={arXiv}, primaryClass={cs.CL} }