Show simple item record

 
dc.contributor.author Perišić, Olja
dc.contributor.author Stanković, Ranka
dc.contributor.author Vitas, Duško
dc.contributor.author Krstev, Cvetana
dc.contributor.author Moderc, Saša
dc.date.accessioned 2022-09-22T12:55:44Z
dc.date.available 2022-09-22T12:55:44Z
dc.date.issued 2022-09-12
dc.identifier.uri http://hdl.handle.net/20.500.11752/OPEN-980
dc.description It-Sr-NER-corp is the Italian/Serbian bilingual corpus with 10,000 aligned sentences compiled in the scope of the It-Sr-project from samples of several Italian novels translated to Serbian and vice versa, with the aim of the development of the CLARIN compatible NER web service for parallel text with the case study on Italian and Serbian. The set of 10,000 natural language segments is split into 4 files: 1*1000+3*3000. The corpus comprises of: 1) text versions, Italian and Serbian, with one segment per line 2) TMX (Translation Memory eXchange) bilingual aligned segments; 3) monolingual text and TMX files with automatically annotated named entities for six NER classes: demonyms (DEMO), works of art (WORK), person names (PERS), places (LOC), events (EVENT) and organizations (ORG). It-Sr-NER annotation uses a powerful Convolutional Neural Network architecture within the spaCy tool, for Italien WikiNER (Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, James R Curran) and for Serbian SrpCNNER (Cvetana Krstev, Ranka Stanković, Milica Ikonić Nešić, Branislava Šandrih Todorović).
dc.language.iso srp
dc.language.iso ita
dc.publisher Università degli studi di Torino
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0
dc.rights.label PUB
dc.source.uri https://github.com/rankastankovic/It-Sr-NER/
dc.subject NER
dc.subject TXM
dc.subject Named Entity Recognition
dc.subject aliged corpus
dc.subject Serbian
dc.subject Italian
dc.title It-Sr-NER: CLARIN compatible NER and geoparsing web services for parallel texts: case study Italian and Serbian
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding OPEN
demo.uri https://github.com/rankastankovic/It-Sr-NER/tree/main/corpus
contact.person Olja Perišić olja.perisic@unito.it Università degli studi di Torino
sponsor CLARIN ERIC CLARIN Bridging Gaps project CLARIN Bridging Gaps Other
size.info 10000 sentences
files.size 7282687
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
It-Sr-NER-corp.zip
Size
6.95 MB
Format
application/zip
Description
It-Sr-NER corpus
MD5
f4086b834ae5a57a51111d35b750b248
 Download file  Preview
 File Preview  
  • corpus
    • monolingual
      • Sr-txt-NER
        • txt_1-sr.txt.ner450 kB
        • txt_2-sr.txt.ner452 kB
        • txt_0-sr.txt.ner182 kB
        • txt_3-sr.txt.ner457 kB
      • It-txt-NER
        • txt_1-it.txt.ner500 kB
        • txt_2-it.txt.ner497 kB
        • txt_0-it.txt.ner203 kB
        • txt_3-it.txt.ner503 kB
      • Sr-txt
        • txt_3-sr.txt439 kB
        • txt_2-sr.txt434 kB
        • txt_1-sr.txt433 kB
        • txt_0-sr.txt170 kB
      • It-txt
        • txt_3-it.txt480 kB
        • txt_2-it.txt474 kB
        • txt_1-it.txt477 kB
        • txt_0-it.txt188 kB
    • readme.md1 kB
    • bilingual
      • It-Sr-tmx-NER
        • txt_3-it-sr-TMX.xml.ner1 MB
        • txt_1-it-sr-TMX.xml.ner1 MB
        • txt_2-it-sr-TMX.xml.ner1 MB
        • txt_0-it-sr-TMX.xml.ner613 kB
      • It-Sr-tmx
        • txt_2-it-sr-TMX.xml1 MB
        • readme.md106 B
        • txt_1-it-sr-TMX.xml1 MB
        • txt_3-it-sr-TMX.xml1 MB
        • txt_0-it-sr-TMX.xml595 kB
      • It-Sr-html
        • txt_3-it-sr.html1 MB
        • txt_1-it-sr.html1 MB
        • readme.md18 B
        • txt_2-it-sr.html1 MB
        • txt_0-it-sr.html452 kB

Show simple item record