Development of computational linguistic resources for automated detection of textual cyberbullying threats in Roman Urdu language

Autores/as

DOI:

https://doi.org/10.17993/3c%20tic.2021.102.101-121

Resumen

Automatic Cyberbullying detection has remained very challenging task since social media content and conversations are usually posted in unstructured free-text form leaving behind the language norms. The major concern and gap in formulating cyberbullying detection strategies is scarcity of available linguistic resources typically for newly evolved languages. Roman Urdu has recently emerged and hence is a resource poor language. Urdu has been widely known as the national language of Pakistan. However, because of socio-cultural and multilingual aspects, Roman Urdu is used widely on the Internet by Asians and more specifically Pakistanis.

To fulfil the above stated gap, this research work presents guidelines for data annotation process and developed two linguistic resources: (i) Annotated corpus in Roman Urdu Language for cyberaggression and offensive language detection. The process of data annotation involved bilingual annotators instead of crowdsourcing. It has the benefit of correctly annotating instances that constitute clear cases of cyberbullying without compromising data quality. The developed corpus is highly balanced (with almost negligible skew) unlike most of the existing corpuses even in mature languages. (ii) Processing textual information for NLP tasks involves Stop-word elimination as a sub phase. Stop words carry least semantic information and increase feature space as compared to the other tokens and index terms in corpora. We have developed domain specific stop words for Roman Urdu Language considering all the lexical variants and typically in the context of aggression detection and collected data. The work has been carried out using python programming language and Pycharm IDE.

Biografía del autor/a

Amirita Dewani, Mehran University of Engineering & Technology, Jamshoro, Sindh, (Pakistan).

Mehran University of Engineering & Technology, Jamshoro, Sindh, (Pakistan).

Mohsin Ali Memon, Mehran University of Engineering & Technology, Jamshoro, Sindh, (Pakistan).

Mehran University of Engineering & Technology, Jamshoro, Sindh, (Pakistan).

Sania Bhatti, Mehran University of Engineering & Technology, Jamshoro, Sindh, (Pakistan).

Mehran University of Engineering & Technology, Jamshoro, Sindh, (Pakistan).

Descargas

Publicado

2021-06-29

Cómo citar

Dewani, A., Ali Memon, M., & Bhatti, S. (2021). Development of computational linguistic resources for automated detection of textual cyberbullying threats in Roman Urdu language. 3C TIC. Cuadernos De Desarrollo Aplicados a Las TIC, 10(2), 101–121. https://doi.org/10.17993/3c tic.2021.102.101-121