This is a new version of the repository. Do let us know (dspace-clarin-it-ilc-help@ilc.cnr.it) if you encounter any issues.
What's New
corpusILC
Author(s):
Description:
The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain. With respect to **StarwarsNER French Italian Corpus - sample 1.0**, this new version adds English translations of the annotation guidelines and the plain texts, so that people who do not speak Italian or French can better understand and potentially replicate the experiment. It supports research in: - Information extraction - Relation extraction - Entity linking The corpus consists of manually annotated parallel French and Italian documents, aligned at the sentence level. Annotations follow a domain-specific schema based on the Sewer Network Ontology <http://hdl.handle.net/20.500.11752/ILC-1037>. For copyright reasons, this release contains only a sample of the original corpus, namely 8 French documents from public administrations and their Italian translations. --- ## Resource Creation 1. **French corpus** - Collected from reports, regulations, and local media texts. - Manually annotated according to the STARWARS schema. 2. **Italian (and English) corpus** - Produced via machine translation of the French texts. - Reviewed and corrected by bilingual translation students and expert hydrologists. 3. **Annotation process** - Conducted with the **INCEpTION** annotation platform. - Ensured consistent alignment between French and Italian. For details, please refer to the publication: F.A. Cardillo, F. Debole, F. Frontini, M. Aelami, N. Chahinian, S. Conrad (2025) “Novel Benchmark for NER in the Wastewater and Stormwater Domain”, Proceedings of the 6th IEEE MNLP Conf. (CiST-MNLP’2025) 4-10 October 2025, Marrakech, Morocco. <https://arxiv.org/abs/2506.01938> --- ## Contents of this Package - **Texts**: Provided in plain text, in French, with translations in Italian and English (the latter for reference only). - **Annotations**: Provided in **CONLL 2003 format, as exported from INCEpTION**, for the French and Italian texts. - **Annotation guidelines**: Included in **French**, with translations in **English** and **Italian**, as used by annotators.
This item contains 1 file (825.38 KB).
Publicly Available
corpusOPEN
Author(s):
Description:
The Words of Labour corpus – English Version 1.0 (WoLC-En1) contains 30 English legal texts dealing with labour law from 26 African countries. The texts were selected by experts in the law field, as the most relevant at the time of corpus construction (2024/25). The primary use envisaged for the corpus is the investigation of legal segmentation in African Global South countries.
This item contains 2 files (7.17 MB).
Publicly Available
corpusOPEN
Author(s):
Description:
This corpus contains 75 written autobiographical narratives related to affective objects, extracted from a larger dataset of 236 autobiographical narratives produced by Italian university students as part of a bilingual Italian–Spanish research protocol on self-concept and identity. Participants were asked to describe an object, person, or place they considered important or indispensable in their lives; this subcorpus includes exclusively the 75 responses in which participants chose to write about an object. Participants were prompted with the following question: "C'è qualche oggetto, persona o luogo importante per te o indispensabile nella tua vita? Descrivilo in dettaglio: la sua funzione, perché è così importante per te, ecc." Data were collected at three institutions in Emilia-Romagna: Università di Bologna (Political Science, Social and International Sciences), Università di Bologna (Modern Languages and Civilizations), and Università di Parma (Modern Languages and Civilizations). All participants provided written informed consent. The corpus is intended for research in corpus linguistics, psycholinguistics, and the study of autobiographical memory, identity, material culture, and anthropology. The corpus is structured as a CSV file with four columns: participant ID, gender (M/F), degree programme and institution, and narrative text.
This item contains 1 file (37.79 KB).
Publicly Available
Most Viewed Items - Last Month
lexicalConceptualResourceILC
Author(s):
Description:
ItalWordNet (IWN) is a lexical-semantic database developed in the framework of two different research projects: EuroWordNet (EWN) and Sistema Integrato per il Trattamento Automatico del Linguaggio (SI-TAL). IWN is structured in the same way as the Princeton WordNet, namely around the notion of synset. Following the model designed in EWN, IWN encodes a rich set of semantic relations. In addition to the internal language relations, equivalence relations were also encoded between Italian synsets and the closest concepts in an Inter-Lingual Index (ILI), a separate language-independent module containing all WN1.5 synsets but not the relations among them. IWN now contains information about Italian Nouns, Verbs, Adjectives and Adverbs. This SQL version of IWN v2.0 contains a corrected and revised version of the original IWN: 49350 Synsets (of which: 3459 proper nouns, 32073 nominal, 8903 verbal, 4374 adjectival, 541 adverbial) 48416 Lemmas (of which: 3918 proper nouns, 29527 nouns, 8015 verbs, 5808 adjectives, 1090 adverbs) 68478 Senses
This item contains 2 files (4.81 MB).
Publicly Available
corpusOPEN
Author(s):
Description:
COME CITARE: Cereser E., Mastrantonio, D. (a cura di), ALEF. Archivio per lo studio della Lingua degli Elaborati studenteschi - Ca’ Foscari, progetto digitale di F. Boschetti, Venezia, Università Ca’ Foscari, 2026. ALEF è un archivio dell’Università Ca’ Foscari che raccoglie elaborati studenteschi provenienti dalle scuole superiori. La raccolta e lo studio delle produzioni studentesche si inseriscono nell’ambito del Ce.Do.Di (Centro di documentazione e ricerca sulla scuola e la didattica del Dipartimento di Studi Umanistici dell’Università Ca’ Foscari). Per la creazione di una rete di contatti con le scuole è stata importante la mediazione della sezione scuola di ASLI (Associazione per la Storia della Lingua Italiana), già coordinata da Rita Fresu. I testi sono stati acquisiti da Ca’ Foscari grazie a convenzioni siglate tra il Dipartimento di Studi Umanistici e i vari istituti scolastici italiani. Per ragioni di privacy, non sono qui menzionati i nomi delle tante e dei tanti docenti che hanno reso possibile la raccolta degli elaborati. Nella forma attuale (marzo 2026) ALEF contiene 227 testi scritti da studentesse e studenti di scuole secondarie di secondo grado di varie regioni d’Italia. Il nucleo di testi attualmente pubblicato (dal testo 270 al testo 496) è legato alla tesi di dottorato di Eugenio Cereser, Analisi e classificazione degli errori lessicali per un archivio digitale di testi studenteschi contemporanei, 38° ciclo, finanziata con fondi PNRR, Dottorato di Italianistica dell’Università Ca’ Foscari, di cui sono stati supervisori Davide Mastrantonio, Federico Boschetti e Michele Colombo (discussione della tesi prevista nell’aprile 2026). Altri nuclei di testi, non ancora pubblicati, erano stati precedentemente raccolti per le tesi di laurea magistrale di Chiara Marino (Strategie argomentative degli studenti delle scuole secondarie di secondo grado, Università Ca’ Foscari, a.a. 2022/2023, relatore D. Mastrantonio) e di Giulia Corrocher (Connettivi e relazioni logiche negli elaborati degli studenti delle scuole secondarie di secondo grado, Università Ca’ Foscari, a.a. 2022/2023, relatore D. Mastrantonio), le quali hanno entrambe collaborato alla trascrizione dei testi. Ogni testo presente in ALEF è identificato dalle seguenti informazioni: anno scolastico, anno d’istruzione, tipologia di istituto, regione, tipologia di prova secondo la riforma Fedeli, traccia della prova, numero progressivo del singolo testo. I testi provengono da esercitazioni e compiti in classe e le tracce delle prove dipendono dalle singole classi. Ogni testo è stato trascritto integralmente per offrire la trascrizione diplomatica degli elaborati. I criteri adottati sono stati i seguenti: sono stati rispettati gli a capo ; le cancellature leggibili sono state rese con il carattere barrato; per le cancellature non intellegibili si sono usati invece gli asterischi (*); le aggiunte sopra il rigo sono state trascritte in apice; eventuali errori di ortografia vengono segnalati con (sic).
This item contains 1 file (1.66 MB).
Publicly Available
corpusOPEN
Author(s):
Description:
Musisque Deoque, the whole corpus of the Latin poets, from the beginnings to the end of VIIth century, was established at the end of 2005 with the main goal of creating a singular database of Latin poetry, supported by a critical and exegetical electronic apparatus. At present, main collections of classical texts have been transferred onto digital device while resources, mostly online, allow quicker lexical searches. In most cases, however, search engine inquiry only provides results of a key inside a fix and ‘authoritarian’ text. The aim of Musisque Deoque is to overcome these limitations, allowing to locate not only the forms chosen from the text of a reference edition, but also the variants in its critical apparatus. Lately, the website has been implemented with new functions. These are the most important: Epigraphica, i. e. a peculiar handling of the Carmina Latina Epigraphica, with a search by corpora, by incipit, other information about place of origin, dating, when existing a paratext in prose, etc.; in addition, a photographic archive of the inscriptions on catalogue has been set up. Witnesses: the site has been supplied, in the apparatuses, with a standard nomenclature of the manuscripts, displaying the current proper names of city, library, collection and the signature; a list of poets and works that are present in the same manuscript; a link to the library’s website and, if existing, to the digitized images of the codex. Search by lemmas: available in the advanced search; Metrical scan of all the works in dactylic verses, performed by the Pedecerto application. Co-occurrences: starting from a chosen source text, the whole corpus is investigated to find verbal or non-verbal rhythmic similarities. Hellenica: a digital archive of Greek poetry.
This item contains 1 file (16.96 MB).
Publicly Available