What's New
corpusOPEN
Author(s):
Description:
The KIP corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The KIP corpus was compiled within the framework of the LEAdhoC project – Linguistic Expression of Ad Hoc Categories, funded by the Italian Ministry of Education, University and Research (MIUR) under the SIR 2016 call. It consists of approximately 70 hours of spoken data collected at the Universities of Bologna and Turin. The interactions, recorded between 2016 and 2019, involved over 180 speakers, including university students and professors from various regions of Italy, and took place in five different types of communicative situations: lessons, exams, office hours, semi-structured interviews, free conversations (among students). The transcriptions have been anonymized. Overall, the module is made up of 121 conversations and includes 184 speakers. This repository contains: - metadata for both speakers (age, origin, occupation, gender) and conversations (type of interaction), in the metadata subfolder - descriptions of the set of transcription conventions used for this module (Transcription conventions) - transcripts of the recorded conversations in the following formats: .eaf file in eaf/ folder (time-aligned Jefferson-style transcriptions), .txt file in linear-jefferson/ folder (linearized Jefferson-style transcription), .txt file in linear-orthographic/ folder (linearized transcription retaining only orthographic words), .tsv file in tsv/ folder (tokenised version of the transcription) More information can be found in the README.md file. Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, please contact the corpus coordinators through the KIParla website and follow the provided procedure. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This item contains 4 files (12.97 MB).
Publicly Available
corpusOPEN
Author(s):
Description:
The KIPasti corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The ParlaBO corpus was compiled within the framework of “DiverSIta – Diversity in spoken Italian” project, funded by the Italian Ministry of University and Research (MUR) (PRIN 2022 PNRR Call). It consists of over 40 hours of spoken data collected in thirteen different Italian regions (Abruzzo, Basilicata, Calabria, Campania, Emilia-Romagna, Lazio, Lombardy, Marche, Apulia, Sardinia, Tuscany, Umbria, Veneto) during mealtime conversations, generally within family settings. The interactions, recorded between 2020 and 2024, involved 145 speakers with different origins, ages, education levels, and occupations. Italian is predominantly used in all interactions, but in most of them (78%), various passages in dialect are also present. The transcriptions have been anonymized. Overall, the module is made up of 63 conversations. This repository contains: - metadata for both speakers (occupation, gender, age, origin, L1, educational achievement) and conversations (collection point, year, languages used), in the metadata subfolder - descriptions of the set of transcription conventions used for this module - for each conversation you will find: .eaf file in eaf/ folder (time-aligned Jefferson-style transcriptions); .txt file in linear-jefferson/ folder (linearized Jefferson-style transcription); .txt file in linear-orthographic/ folder (linearized transcription retaining only orthographic words); .tsv file in tsv/ folder (tokenised version of the transcription). More information can be found in the README.md file. Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, please contact the corpus coordinators through the KIParla website and follow the provided procedure. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This item contains 4 files (9.89 MB).
Publicly Available
corpusOPEN
Author(s):
Description:
The ParlaTO corpus is part of the larger KIParla collection (www.kiparla.it), which can be freely queried through the NoSketch Engine interface. The ParlaTO corpus was was funded by the CRT Foundation ("ParlaTO - Corpus del Parlato di Torino" project). It consists of about 50 hours of interactions collected in Turin and its province through semi-structured interviews. The interviews, conducted between 2018 and 2020, involved 88 speakers with different origins, ages, education levels, and types of occupation, and addressed personal life experiences in the city (study, work, leisure activities, retirement, memories of the past, etc.). The transcriptions have been anonymized. Overall, the module is made up of 68 conversations and includes 100 speakers. This repository contains: • metadata for both speakers (occupation, gender, age, origin, L1, educational achievement) and conversations (collection point, year, languages used), in the metadata subfolder • descriptions of the set of transcription conventions used for this module • for each conversation you will find: .eaf file in eaf/ folder (time-aligned Jefferson-style transcriptions); .txt file in linear-jefferson/ folder (linearized Jefferson-style transcription); .txt file in linear-orthographic/ folder (linearized transcription retaining only orthographic words); .tsv file in tsv/ folder (tokenised version of the transcription). More information can be found in the README.md file. Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, please contact the corpus coordinators through the KIParla website and follow the provided procedure. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This item contains 4 files (14.46 MB).
Publicly Available
Most Viewed Items - Last Month
toolServiceILC
Author(s):
Description:
DigItAnt-search is the GUI web application od the DigItAnt platform, designed to explore, visualise and navigate the different sources of information created or linked within the national ItAnt project (https://www.prin-italia-antica.unifi.it/). DigItAnt is an innovative platform designed to support historical linguistic and epigraphic studies, and researchers in the creation, management and consultation of digital linguistic resources for the fragmentary ancient languages. DigItAnt-search allows to explore interactively various sources of information in a unified and easily accessible environment. The development of DigItAnt was funded by the Ministry of University and Research under the program Research Projects of Relevant National Interest (PRIN) 2017. This front-end application was developed by Michele Mallia under the supervision of Valeria Quochi and thanks to continuous discussion and exchange with the team, composed of: Andrea Bellandi, Alessandro Tommasi, Cesare Zavattari, Silvia Piccini, Michela Bandini, Chiara Fazzone.
This item contains no files.
lexicalConceptualResourceILC
Author(s):
Description:
ItalWordNet (IWN) is a lexical-semantic database developed in the framework of two different research projects: EuroWordNet (EWN) and Sistema Integrato per il Trattamento Automatico del Linguaggio (SI-TAL). IWN is structured in the same way as the Princeton WordNet, namely around the notion of synset. Following the model designed in EWN, IWN encodes a rich set of semantic relations. In addition to the internal language relations, equivalence relations were also encoded between Italian synsets and the closest concepts in an Inter-Lingual Index (ILI), a separate language-independent module containing all WN1.5 synsets but not the relations among them. IWN now contains information about Italian Nouns, Verbs, Adjectives and Adverbs. This SQL version of IWN v2.0 contains a corrected and revised version of the original IWN: 49350 Synsets (of which: 3459 proper nouns, 32073 nominal, 8903 verbal, 4374 adjectival, 541 adverbial) 48416 Lemmas (of which: 3918 proper nouns, 29527 nouns, 8015 verbs, 5808 adjectives, 1090 adverbs) 68478 Senses
This item contains 2 files (4.81 MB).
Publicly Available
corpusILC
Author(s):
Description:
The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain. It supports research in: - Information extraction - Relation extraction - Entity linking The corpus consists of manually annotated parallel French and Italian documents, aligned at the sentence level. Annotations follow a domain-specific schema based on the Sewer Network Ontology <http://hdl.handle.net/20.500.11752/ILC-1037>. For copyright reasons, this release contains only a sample of the original corpus, namely 8 French documents from public administrations and their Italian translations. --- ## Resource Creation 1. **French corpus** - Collected from reports, regulations, and local media texts. - Manually annotated according to the STARWARS schema. 2. **Italian corpus** - Produced via machine translation of the French texts. - Reviewed and corrected by bilingual translation students and expert hydrologists. 3. **Annotation process** - Conducted with the **INCEpTION** annotation platform. - Ensured consistent alignment between French and Italian. For details, please refer to the publication: F.A. Cardillo, F. Debole, F. Frontini, M. Aelami, N. Chahinian, S. Conrad (2025) “Novel Benchmark for NER in the Wastewater and Stormwater Domain”, Proceedings of the 6th IEEE MNLP Conf. (CiST-MNLP’2025) 4-10 October 2025, Marrakech, Morocco. <https://arxiv.org/abs/2506.01938> --- ## Contents of this Package - **Texts**: Provided in plain text. - **Annotations**: Provided in **CONLL 2003 format, as exported from INCEpTION**. - **Annotation guidelines**: Included in both **French** and **Italian**, as used by annotators.
This item contains 1 file (559.42 KB).
Publicly Available