What's New
corpusOPEN
Author(s):
Description:
The UNITE corpus is a learner corpus of English consisting of written interactions between Italian university students and AI-based chatbots. The data were collected in 2024 in English as a Foreign Language (EFL) learning scenarios in which students interacted with chatbots in small talk and role-play tasks. Data collection took place at three universities in Italy: the University of Bologna, the University of Macerata, and the University of Naples "L'Orientale". The corpus contains 326 interactions (722,537 tokens) produced by as many learners aged 19–25, with mostly low- to upper-intermediate proficiency levels, enrolled in non-linguistic degree courses. The participants also include learners with disabilities and specific learning disorders, reflecting the project’s focus on inclusive language learning practices. The UNITE corpus is described using the Core Metadata Schema for Learner Corpora (Paquot et al. 2024) and is distributed as part of this repository in two versions. The first is a minimally annotated text version including metadata at the text, learner, task, and turn levels. The second version features learner-error annotation based on an adapted version of the Louvain Error Tagging scheme (Granger et al. 2022). The corpus is also accessible through the NoSketch Engine platform hosted at https://corpora.dipintra.it. The version of the corpus available there is further enriched with linguistic annotation (part-of-speech tags and lemmas).
This item contains 2 files (5.65 MB).
Publicly Available
corpusOPEN
Author(s):
Description:
ALEF è un archivio dell’Università Ca’ Foscari che raccoglie elaborati studenteschi provenienti dalle scuole superiori. La raccolta e lo studio delle produzioni studentesche si inseriscono nell’ambito del Ce.Do.Di (Centro di documentazione e ricerca sulla scuola e la didattica del Dipartimento di Studi Umanistici dell’Università Ca’ Foscari). Per la creazione di una rete di contatti con le scuole è stata importante la mediazione della sezione scuola di ASLI (Associazione per la Storia della Lingua Italiana), già coordinata da Rita Fresu. I testi sono stati acquisiti da Ca’ Foscari grazie a convenzioni siglate tra il Dipartimento di Studi Umanistici e i vari istituti scolastici italiani. Per ragioni di privacy, non sono qui menzionati i nomi delle tante e dei tanti docenti che hanno reso possibile la raccolta degli elaborati. Nella forma attuale (marzo 2026) ALEF contiene 227 testi scritti da studentesse e studenti di scuole secondarie di secondo grado di varie regioni d’Italia. Il nucleo di testi attualmente pubblicato (dal testo 270 al testo 496) è legato alla tesi di dottorato di Eugenio Cereser, Analisi e classificazione degli errori lessicali per un archivio digitale di testi studenteschi contemporanei, 38° ciclo, finanziata con fondi PNRR, Dottorato di Italianistica dell’Università Ca’ Foscari, di cui sono stati supervisori Davide Mastrantonio, Federico Boschetti e Michele Colombo (discussione della tesi prevista nell’aprile 2026). Altri nuclei di testi, non ancora pubblicati, erano stati precedentemente raccolti per le tesi di laurea magistrale di Chiara Marino (Strategie argomentative degli studenti delle scuole secondarie di secondo grado, Università Ca’ Foscari, a.a. 2022/2023, relatore D. Mastrantonio) e di Giulia Corrocher (Connettivi e relazioni logiche negli elaborati degli studenti delle scuole secondarie di secondo grado, Università Ca’ Foscari, a.a. 2022/2023, relatore D. Mastrantonio), le quali hanno entrambe collaborato alla trascrizione dei testi. Ogni testo presente in ALEF è identificato dalle seguenti informazioni: anno scolastico, anno d’istruzione, tipologia di istituto, regione, tipologia di prova secondo la riforma Fedeli, traccia della prova, numero progressivo del singolo testo. I testi provengono da esercitazioni e compiti in classe e le tracce delle prove dipendono dalle singole classi. Ogni testo è stato trascritto integralmente per offrire la trascrizione diplomatica degli elaborati. I criteri adottati sono stati i seguenti: sono stati rispettati gli a capo ; le cancellature leggibili sono state rese con il carattere barrato; per le cancellature non intellegibili si sono usati invece gli asterischi (*); le aggiunte sopra il rigo sono state trascritte in apice; eventuali errori di ortografia vengono segnalati con (sic).
This item contains 1 file (1.66 MB).
Publicly Available
corpusOPEN
Author(s):
Description:
NomadLingo is the first publicly available corpus documenting multilingual, naturally occurring interactions among European digital nomads, a rapidly growing yet understudied transnational community (Tedesco 2025). The corpus contains transcripts of extracts from naturally-occurring conversations which were audio-recorded between November 2023 and April 2024 at social events organised and promoted within digital nomad communities based in Madeira and Canary Islands. The total time of transcribed recording is 11 hours 38 mins. For further information about the texts in the corpus see Section 4. The version 1.1 open is an updated version of the older NomadLingo1.0 open (Tedesco et al. 2025). In this version two annotation layers have been added (see Section 8) and transcripts were further revised. The folder NomadLingo1.1 open contains: - a readme file; - a folder with the three annotated versions of the corpus (Annotated_versions_NomadLingo1.1), including Trans&repair&misunderstanding&interactionAnnotated_NomadLingo1.1, Trans&repairAnnotated_NomadLingo1.1, and TranslanguagingAnnotated_NomadLingo1.1; - a folder named Naked_NomadLingo1.1 containing files which only include transcribed conversation, structural and contextual annotation but no linguistic annotation; - a folder named MetadataNomadLingo1.1 containing information about speakers and sessions in the corpus in .xls and .csv formats; - a folder named Annotation_schema_NomadLingo1.1 containing the annotation scheme in .xls and .csv formats; - the corpus in .csv format. Moreover, another version of the corpus, for which access is restricted for privacy reasons, includes two folders, namely Recordings, containing the .wav files, and Privacy Notice and Informed Consent documents, which contains legal and ethical documentation. To get access to the integral version of the dataset you can contact novella.tedesco2@unibo.it. You can find the project on GitHub (https://github.com/novella-tedesco/FLO) and OSF (DOI 10.17605/OSF.IO/UK9NY), where codes for data processing and analysis are shared and updated.
This item contains 1 file (7.14 MB).
Publicly Available
Most Viewed Items - Last Month
corpusOPEN
Author(s):
Description:
NomadLingo is the first publicly available corpus documenting multilingual, naturally occurring interactions among European digital nomads, a rapidly growing yet understudied transnational community (Tedesco 2025). The corpus contains transcripts of extracts from naturally-occurring conversations which were audio-recorded between November 2023 and April 2024 at social events organised and promoted within digital nomad communities based in Madeira and Canary Islands. The total time of transcribed recording is 11 hours 38 mins. For further information about the texts in the corpus see Section 4. The version 1.1 open is an updated version of the older NomadLingo1.0 open (Tedesco et al. 2025). In this version two annotation layers have been added (see Section 8) and transcripts were further revised. The folder NomadLingo1.1 open contains: - a readme file; - a folder with the three annotated versions of the corpus (Annotated_versions_NomadLingo1.1), including Trans&repair&misunderstanding&interactionAnnotated_NomadLingo1.1, Trans&repairAnnotated_NomadLingo1.1, and TranslanguagingAnnotated_NomadLingo1.1; - a folder named Naked_NomadLingo1.1 containing files which only include transcribed conversation, structural and contextual annotation but no linguistic annotation; - a folder named MetadataNomadLingo1.1 containing information about speakers and sessions in the corpus in .xls and .csv formats; - a folder named Annotation_schema_NomadLingo1.1 containing the annotation scheme in .xls and .csv formats; - the corpus in .csv format. Moreover, another version of the corpus, for which access is restricted for privacy reasons, includes two folders, namely Recordings, containing the .wav files, and Privacy Notice and Informed Consent documents, which contains legal and ethical documentation. To get access to the integral version of the dataset you can contact novella.tedesco2@unibo.it. You can find the project on GitHub (https://github.com/novella-tedesco/FLO) and OSF (DOI 10.17605/OSF.IO/UK9NY), where codes for data processing and analysis are shared and updated.
This item contains 1 file (7.14 MB).
Publicly Available
lexicalConceptualResourceILC
Author(s):
Description:
ItalWordNet (IWN) is a lexical-semantic database developed in the framework of two different research projects: EuroWordNet (EWN) and Sistema Integrato per il Trattamento Automatico del Linguaggio (SI-TAL). IWN is structured in the same way as the Princeton WordNet, namely around the notion of synset. Following the model designed in EWN, IWN encodes a rich set of semantic relations. In addition to the internal language relations, equivalence relations were also encoded between Italian synsets and the closest concepts in an Inter-Lingual Index (ILI), a separate language-independent module containing all WN1.5 synsets but not the relations among them. IWN now contains information about Italian Nouns, Verbs, Adjectives and Adverbs. This SQL version of IWN v2.0 contains a corrected and revised version of the original IWN: 49350 Synsets (of which: 3459 proper nouns, 32073 nominal, 8903 verbal, 4374 adjectival, 541 adverbial) 48416 Lemmas (of which: 3918 proper nouns, 29527 nouns, 8015 verbs, 5808 adjectives, 1090 adverbs) 68478 Senses
This item contains 2 files (4.81 MB).
Publicly Available
corpusILC
Author(s):
Description:
The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain. It supports research in: - Information extraction - Relation extraction - Entity linking The corpus consists of manually annotated parallel French and Italian documents, aligned at the sentence level. Annotations follow a domain-specific schema based on the Sewer Network Ontology <http://hdl.handle.net/20.500.11752/ILC-1037>. For copyright reasons, this release contains only a sample of the original corpus, namely 8 French documents from public administrations and their Italian translations. --- ## Resource Creation 1. **French corpus** - Collected from reports, regulations, and local media texts. - Manually annotated according to the STARWARS schema. 2. **Italian corpus** - Produced via machine translation of the French texts. - Reviewed and corrected by bilingual translation students and expert hydrologists. 3. **Annotation process** - Conducted with the **INCEpTION** annotation platform. - Ensured consistent alignment between French and Italian. For details, please refer to the publication: F.A. Cardillo, F. Debole, F. Frontini, M. Aelami, N. Chahinian, S. Conrad (2025) “Novel Benchmark for NER in the Wastewater and Stormwater Domain”, Proceedings of the 6th IEEE MNLP Conf. (CiST-MNLP’2025) 4-10 October 2025, Marrakech, Morocco. <https://arxiv.org/abs/2506.01938> --- ## Contents of this Package - **Texts**: Provided in plain text. - **Annotations**: Provided in **CONLL 2003 format, as exported from INCEpTION**. - **Annotation guidelines**: Included in both **French** and **Italian**, as used by annotators.
This item contains 1 file (559.42 KB).
Publicly Available