Slide 1 of 3 

Linguistic Data and NLP Tools
Find
Citation Support (with Persistent IDs)
Slide 2 of 3
Deposit Free and Safe
License of your Choice (Open licenses encouraged)
Easy to Find
Easy to Cite
Slide 3 of 3
“There ought to be only one grand dépôt of art in the world, to which the artist might repair with his works, and on presenting them receive what he required... ”Ludwig van Beethoven, 1801

Author
Subject
corpusOPEN
Author(s):
Description:
NomadLingo is the first publicly available corpus documenting multilingual, naturally occurring interactions among European digital nomads, a rapidly growing yet understudied transnational community (Tedesco 2025). The corpus contains transcripts of extracts from naturally-occurring conversations which were audio-recorded between November 2023 and April 2024 at social events organised and promoted within digital nomad communities based in Madeira and Canary Islands. The total time of transcribed recording is 11 hours 38 mins. For further information about the texts in the corpus see Section 4. The version 1.1 open is an updated version of the older NomadLingo1.0 open (Tedesco et al. 2025). In this version two annotation layers have been added (see Section 8) and transcripts were further revised.
The folder NomadLingo1.1 open contains:
- a readme file;
- a folder with the three annotated versions of the corpus (Annotated_versions_NomadLingo1.1), including Trans&repair&misunderstanding&interactionAnnotated_NomadLingo1.1, Trans&repairAnnotated_NomadLingo1.1, and TranslanguagingAnnotated_NomadLingo1.1;
- a folder named Naked_NomadLingo1.1 containing files which only include transcribed conversation, structural and contextual annotation but no linguistic annotation;
- a folder named MetadataNomadLingo1.1 containing information about speakers and sessions in the corpus in .xls and .csv formats;
- a folder named Annotation_schema_NomadLingo1.1 containing the annotation scheme in .xls and .csv formats;
- the corpus in .csv format.
Moreover, another version of the corpus, for which access is restricted for privacy reasons, includes two folders, namely Recordings, containing the .wav files, and Privacy Notice and Informed Consent documents, which contains legal and ethical documentation. To get access to the integral version of the dataset you can contact novella.tedesco2@unibo.it.
You can find the project on GitHub (https://github.com/novella-tedesco/FLO) and OSF (DOI 10.17605/OSF.IO/UK9NY), where codes for data processing and analysis are shared and updated.
This item contains 1 file (7.14 MB).
Publicly Available
corpusOPEN
Author(s):
Description:
The KIP corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The KIP corpus was compiled within the framework of the LEAdhoC project – Linguistic Expression of Ad Hoc Categories, funded by the Italian Ministry of Education, University and Research (MIUR) under the SIR 2016 call.
It consists of approximately 70 hours of spoken data collected at the Universities of Bologna and Turin. The interactions, recorded between 2016 and 2019, involved over 180 speakers, including university students and professors from various regions of Italy, and took place in five different types of communicative situations: lessons, exams, office hours, semi-structured interviews, free conversations (among students).
The transcriptions have been anonymized. Overall, the module is made up of 121 conversations and includes 184 speakers. This repository contains:
- metadata for both speakers (age, origin, occupation, gender) and conversations (type of interaction), in the metadata subfolder
- descriptions of the set of transcription conventions used for this module (Transcription conventions)
- transcripts of the recorded conversations in the following formats: .eaf file in eaf/ folder (time-aligned Jefferson-style transcriptions), .txt file in linear-jefferson/ folder (linearized Jefferson-style transcription), .txt file in linear-orthographic/ folder (linearized transcription retaining only orthographic words), .tsv file in tsv/ folder (tokenised version of the transcription)
More information can be found in the README.md file.
Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, please contact the corpus coordinators through the KIParla website and follow the provided procedure.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This item contains 4 files (12.97 MB).
Publicly Available
corpusOPEN
Author(s):
Description:
The KIPasti corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface.
The ParlaBO corpus was compiled within the framework of “DiverSIta – Diversity in spoken Italian” project, funded by the Italian Ministry of University and Research (MUR) (PRIN 2022 PNRR Call).
It consists of over 40 hours of spoken data collected in thirteen different Italian regions (Abruzzo, Basilicata, Calabria, Campania, Emilia-Romagna, Lazio, Lombardy, Marche, Apulia, Sardinia, Tuscany, Umbria, Veneto) during mealtime conversations, generally within family settings. The interactions, recorded between 2020 and 2024, involved 145 speakers with different origins, ages, education levels, and occupations. Italian is predominantly used in all interactions, but in most of them (78%), various passages in dialect are also present. The transcriptions have been anonymized. Overall, the module is made up of 63 conversations.
This repository contains:
- metadata for both speakers (occupation, gender, age, origin, L1, educational achievement) and conversations (collection point, year, languages used), in the metadata subfolder
- descriptions of the set of transcription conventions used for this module
- for each conversation you will find: .eaf file in eaf/ folder (time-aligned Jefferson-style transcriptions); .txt file in linear-jefferson/ folder (linearized Jefferson-style transcription); .txt file in linear-orthographic/ folder (linearized transcription retaining only orthographic words); .tsv file in tsv/ folder (tokenised version of the transcription).
More information can be found in the README.md file.
Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, please contact the corpus coordinators through the KIParla website and follow the provided procedure.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This item contains 4 files (9.89 MB).
Publicly Available
Most Viewed Items - Last Month
lexicalConceptualResourceILC
Author(s):
Description:
ItalWordNet (IWN) is a lexical-semantic database developed in the framework of two different research projects: EuroWordNet (EWN) and Sistema Integrato per il Trattamento Automatico del Linguaggio (SI-TAL).
IWN is structured in the same way as the Princeton WordNet, namely around the notion of synset. Following the model designed in EWN, IWN encodes a rich set of semantic relations. In addition to the internal language relations, equivalence relations were also encoded between Italian synsets and the closest concepts in an Inter-Lingual Index (ILI), a separate language-independent module containing all WN1.5 synsets but not the relations among them.
IWN now contains information about Italian Nouns, Verbs, Adjectives and Adverbs.
This SQL version of IWN v2.0 contains a corrected and revised version of the original IWN:
49350 Synsets (of which: 3459 proper nouns, 32073 nominal, 8903 verbal, 4374 adjectival, 541 adverbial)
48416 Lemmas (of which: 3918 proper nouns, 29527 nouns, 8015 verbs, 5808 adjectives, 1090 adverbs)
68478 Senses
This item contains 2 files (4.81 MB).
Publicly Available
corpusILC
Author(s):
Frontini, Francesca ; et al.
show everyone
Description:
The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain.
It supports research in:
- Information extraction
- Relation extraction
- Entity linking
The corpus consists of manually annotated parallel French and Italian documents, aligned at the sentence level. Annotations follow a domain-specific schema based on the Sewer Network Ontology <http://hdl.handle.net/20.500.11752/ILC-1037>.
For copyright reasons, this release contains only a sample of the original corpus, namely 8 French documents from public administrations and their Italian translations.
---
## Resource Creation
1. **French corpus**
- Collected from reports, regulations, and local media texts.
- Manually annotated according to the STARWARS schema.
2. **Italian corpus**
- Produced via machine translation of the French texts.
- Reviewed and corrected by bilingual translation students and expert hydrologists.
3. **Annotation process**
- Conducted with the **INCEpTION** annotation platform.
- Ensured consistent alignment between French and Italian.
For details, please refer to the publication:
F.A. Cardillo, F. Debole, F. Frontini, M. Aelami, N. Chahinian, S. Conrad (2025) “Novel Benchmark for NER in the Wastewater and Stormwater Domain”, Proceedings of the 6th IEEE MNLP Conf. (CiST-MNLP’2025) 4-10 October 2025, Marrakech, Morocco. <https://arxiv.org/abs/2506.01938>
---
## Contents of this Package
- **Texts**: Provided in plain text.
- **Annotations**: Provided in **CONLL 2003 format, as exported from INCEpTION**.
- **Annotation guidelines**: Included in both **French** and **Italian**, as used by annotators.
This item contains 1 file (559.42 KB).
Publicly Available
corpusOPEN
Author(s):
Description:
NomadLingo is the first publicly available corpus documenting multilingual, naturally occurring interactions among European digital nomads, a rapidly growing yet understudied transnational community (Tedesco 2025). The corpus contains transcripts of extracts from naturally-occurring conversations which were audio-recorded between November 2023 and April 2024 at social events organised and promoted within digital nomad communities based in Madeira and Canary Islands. The total time of transcribed recording is 11 hours 38 mins. For further information about the texts in the corpus see Section 4. The version 1.1 open is an updated version of the older NomadLingo1.0 open (Tedesco et al. 2025). In this version two annotation layers have been added (see Section 8) and transcripts were further revised.
The folder NomadLingo1.1 open contains:
- a readme file;
- a folder with the three annotated versions of the corpus (Annotated_versions_NomadLingo1.1), including Trans&repair&misunderstanding&interactionAnnotated_NomadLingo1.1, Trans&repairAnnotated_NomadLingo1.1, and TranslanguagingAnnotated_NomadLingo1.1;
- a folder named Naked_NomadLingo1.1 containing files which only include transcribed conversation, structural and contextual annotation but no linguistic annotation;
- a folder named MetadataNomadLingo1.1 containing information about speakers and sessions in the corpus in .xls and .csv formats;
- a folder named Annotation_schema_NomadLingo1.1 containing the annotation scheme in .xls and .csv formats;
- the corpus in .csv format.
Moreover, another version of the corpus, for which access is restricted for privacy reasons, includes two folders, namely Recordings, containing the .wav files, and Privacy Notice and Informed Consent documents, which contains legal and ethical documentation. To get access to the integral version of the dataset you can contact novella.tedesco2@unibo.it.
You can find the project on GitHub (https://github.com/novella-tedesco/FLO) and OSF (DOI 10.17605/OSF.IO/UK9NY), where codes for data processing and analysis are shared and updated.
This item contains 1 file (7.14 MB).
Publicly Available





