StarwarsNER French Italian Corpus - sample
Please use the following text to cite this item or export to a predefined format:
Frontini, Francesca; et al., 2025, StarwarsNER French Italian Corpus - sample, ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics "A. Zampolli", http://hdl.handle.net/20.500.11752/ILC-1052
Authors
Frontini, Francesca ; et al.
Item identifier
Referenced by
Date issued
2025-10-07
Size
8 files
Description
The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain.
It supports research in:
- Information extraction
- Relation extraction
- Entity linking
The corpus consists of manually annotated parallel French and Italian documents, aligned at the sentence level. Annotations follow a domain-specific schema based on the Sewer Network Ontology <http://hdl.handle.net/20.500.11752/ILC-1037>.
For copyright reasons, this release contains only a sample of the original corpus, namely 8 French documents from public administrations and their Italian translations.
---
## Resource Creation
1. **French corpus**
- Collected from reports, regulations, and local media texts.
- Manually annotated according to the STARWARS schema.
2. **Italian corpus**
- Produced via machine translation of the French texts.
- Reviewed and corrected by bilingual translation students and expert hydrologists.
3. **Annotation process**
- Conducted with the **INCEpTION** annotation platform.
- Ensured consistent alignment between French and Italian.
For details, please refer to the publication:
F.A. Cardillo, F. Debole, F. Frontini, M. Aelami, N. Chahinian, S. Conrad (2025) “Novel Benchmark for NER in the Wastewater and Stormwater Domain”, Proceedings of the 6th IEEE MNLP Conf. (CiST-MNLP’2025) 4-10 October 2025, Marrakech, Morocco. <https://arxiv.org/abs/2506.01938>
---
## Contents of this Package
- **Texts**: Provided in plain text.
- **Annotations**: Provided in **CONLL 2003 format, as exported from INCEpTION**.
- **Annotation guidelines**: Included in both **French** and **Italian**, as used by annotators.
Acknowledgement
European Union's Horizon research and innovation program
Project code:10108625
Project name:STARWARS (STormwAteR and Wastew- AteR networkS heterogeneous data AI-driven management)
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- StarwarsCorpus.zip
- Size
- 559.42 KB
- Format
- application/zip
- Description
- Zip
- MD5
- 335f0f1037273b0ba3c1f347842cf962

The file preview has not been generated yet. Please try again later or contact the system administrator dspace-clarin-it-ilc-help@ilc.cnr.it

