This is a new version of the repository. Do let us know (dspace-clarin-it-ilc-help@ilc.cnr.it) if you encounter any issues.
 

StarwarsNER French Italian Corpus - sample 2.0

Please use the following text to cite this item or export to a predefined format:
Frontini, Francesca; et al., 2026, StarwarsNER French Italian Corpus - sample 2.0, ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics "A. Zampolli", http://hdl.handle.net/20.500.11752/ILC-2180
Date issued
2026-05-19
Size
8 files
Language(s)
Description
The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain. With respect to **StarwarsNER French Italian Corpus - sample 1.0**, this new version adds English translations of the annotation guidelines and the plain texts, so that people who do not speak Italian or French can better understand and potentially replicate the experiment. It supports research in: - Information extraction - Relation extraction - Entity linking The corpus consists of manually annotated parallel French and Italian documents, aligned at the sentence level. Annotations follow a domain-specific schema based on the Sewer Network Ontology <http://hdl.handle.net/20.500.11752/ILC-1037>. For copyright reasons, this release contains only a sample of the original corpus, namely 8 French documents from public administrations and their Italian translations. --- ## Resource Creation 1. **French corpus** - Collected from reports, regulations, and local media texts. - Manually annotated according to the STARWARS schema. 2. **Italian (and English) corpus** - Produced via machine translation of the French texts. - Reviewed and corrected by bilingual translation students and expert hydrologists. 3. **Annotation process** - Conducted with the **INCEpTION** annotation platform. - Ensured consistent alignment between French and Italian. For details, please refer to the publication: F.A. Cardillo, F. Debole, F. Frontini, M. Aelami, N. Chahinian, S. Conrad (2025) “Novel Benchmark for NER in the Wastewater and Stormwater Domain”, Proceedings of the 6th IEEE MNLP Conf. (CiST-MNLP’2025) 4-10 October 2025, Marrakech, Morocco. <https://arxiv.org/abs/2506.01938> --- ## Contents of this Package - **Texts**: Provided in plain text, in French, with translations in Italian and English (the latter for reference only). - **Annotations**: Provided in **CONLL 2003 format, as exported from INCEpTION**, for the French and Italian texts. - **Annotation guidelines**: Included in **French**, with translations in **English** and **Italian**, as used by annotators.
Acknowledgement

Version History

Showing 1 - 2 out of 2 results
VersionDateSummary
3*
2026-05-19 15:19:45
Adding English Translations for corpus files
2025-10-17 18:39:28
* Selected version
This item isPublicly Available
and licensed under:
 Files in this item
Name
StarwarsCorpus - 2.0.zip
Size
825.38 KB
Format
application/zip
Description
Zip
MD5
0b0225f8061be7a7fee3dce4c5853fb6
Preview
  File Preview