Please use the following text to cite this item or export to a predefined format:
Frontini, Francesca; et al., 2026, StarwarsNER French Italian Corpus - sample 2.0, ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics "A. Zampolli", http://hdl.handle.net/20.500.11752/ILC-2180
| dc.contributor.author | Frontini, Francesca |
| dc.contributor.author | Chahinian, Nanée |
| dc.contributor.author | Aelami, Mitra |
| dc.contributor.author | Cardillo, Franco Alberto |
| dc.contributor.author | Conard, Serge |
| dc.contributor.author | Debole, Franca |
| dc.date.accessioned | 2026-05-21T13:09:23Z |
| dc.date.available | 2025-10-17T18:39:28Z |
| dc.date.available | 2026-05-21T13:09:23Z |
| dc.date.issued | 2026-05-19 |
| dc.description | The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain. With respect to **StarwarsNER French Italian Corpus - sample 1.0**, this new version adds English translations of the annotation guidelines and the plain texts, so that people who do not speak Italian or French can better understand and potentially replicate the experiment. It supports research in: - Information extraction - Relation extraction - Entity linking The corpus consists of manually annotated parallel French and Italian documents, aligned at the sentence level. Annotations follow a domain-specific schema based on the Sewer Network Ontology <http://hdl.handle.net/20.500.11752/ILC-1037>. For copyright reasons, this release contains only a sample of the original corpus, namely 8 French documents from public administrations and their Italian translations. --- ## Resource Creation 1. **French corpus** - Collected from reports, regulations, and local media texts. - Manually annotated according to the STARWARS schema. 2. **Italian (and English) corpus** - Produced via machine translation of the French texts. - Reviewed and corrected by bilingual translation students and expert hydrologists. 3. **Annotation process** - Conducted with the **INCEpTION** annotation platform. - Ensured consistent alignment between French and Italian. For details, please refer to the publication: F.A. Cardillo, F. Debole, F. Frontini, M. Aelami, N. Chahinian, S. Conrad (2025) “Novel Benchmark for NER in the Wastewater and Stormwater Domain”, Proceedings of the 6th IEEE MNLP Conf. (CiST-MNLP’2025) 4-10 October 2025, Marrakech, Morocco. <https://arxiv.org/abs/2506.01938> --- ## Contents of this Package - **Texts**: Provided in plain text, in French, with translations in Italian and English (the latter for reference only). - **Annotations**: Provided in **CONLL 2003 format, as exported from INCEpTION**, for the French and Italian texts. - **Annotation guidelines**: Included in **French**, with translations in **English** and **Italian**, as used by annotators. |
| dc.identifier.uri | http://hdl.handle.net/20.500.11752/ILC-2180 |
| dc.language.iso | ita |
| dc.language.iso | fra |
| dc.publisher | Istituto di Linguistica Computazionale “A. Zampolli” - Consiglio Nazionale delle Ricerche (ILC-CNR) |
| dc.publisher | Institute of Information Science and Technologies "Alessandro Faedo" - National Research Council of Italy (ISTI CNR) |
| dc.publisher | Institut de Recherche pour le Développement |
| dc.publisher | Université de Montpellier |
| dc.relation.isreferencedby | https://doi.org/10.1109/CiSt65886.2025.11224095 |
| dc.relation.isreplacedby | http://hdl.handle.net/20.500.11752/ILC-2169 |
| dc.relation.replaces | http://hdl.handle.net/20.500.11752/ILC-1052 |
| dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
| dc.rights.label | PUB |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0 |
| dc.source.uri | https://sites.google.com/view/horizoneurope2020-starwars/ |
| dc.subject | Named Entity Recognition |
| dc.subject | Sewer Network |
| dc.title | StarwarsNER French Italian Corpus - sample 2.0 |
| dc.type | corpus |
| local.branding | ILC |
| local.contact.person | Francesca Frontini francesca.frontini@cnr.it Istituto di Linguistica Computazionale “A. Zampolli” - Consiglio Nazionale delle Ricerche (ILC-CNR) |
| local.files.count | 1 |
| local.files.size | 845192 |
| local.has.files | yes |
| local.language.name | Italian |
| local.language.name | French |
| local.size.info | 8 files |
| local.sponsor | Other euFunds 10108625 European Union's Horizon research and innovation program STARWARS (STormwAteR and Wastew- AteR networkS heterogeneous data AI-driven management) |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- StarwarsCorpus - 2.0.zip
- Size
- 825.38 KB
- Format
- application/zip
- Description
- Zip
- MD5
- 0b0225f8061be7a7fee3dce4c5853fb6

The file preview has not been generated yet. Please try again later or contact the system administrator dspace-clarin-it-ilc-help@ilc.cnr.it

