{"id":256,"date":"2024-02-10T09:21:16","date_gmt":"2024-02-10T09:21:16","guid":{"rendered":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/"},"modified":"2024-12-03T09:22:22","modified_gmt":"2024-12-03T09:22:22","slug":"pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert","status":"publish","type":"post","link":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/","title":{"rendered":"<b style=\"color: darkblue\">H. Ijazul,<\/b> Q. Weidong, G. Jie, and T. Peng, &#8220;Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT,&#8221; PeerJ Computer Science, 2023. <b style=\"color: #b85b00\">(SCI)<\/b>"},"content":{"rendered":"\n<h5 class=\"wp-block-heading\">Abstract<\/h5>\n\n\n\n<p>Social media platforms have become inundated with offensive language. This issue must be addressed for the growth of online social networks (OSNs) and a healthy online environment. While significant research has been devoted to identifying toxic content in major languages like English, this remains an open area of research in the low-resource Pashto language. This study aims to develop an AI model for the automatic detection of offensive textual content in Pashto. To achieve this goal, we have developed a benchmark dataset called the Pashto Offensive Language Dataset (POLD), which comprises tweets collected from Twitter and manually classified into two categories: \u201coffensive\u201d and \u201cnot offensive\u201d. To discriminate these two categories, we investigated the classic deep learning classifiers based on neural networks, including CNNs and RNNs, using static word embeddings: Word2Vec, fastText, and GloVe as features. Furthermore, we examined two transfer learning approaches. In the first approach, we fine-tuned the pre-trained multilingual language model, XLM-R, using the POLD dataset, whereas, in the second approach, we trained a monolingual BERT model for Pashto from scratch using a custom-developed text corpus. Pashto BERT was then fine-tuned similarly to XLM-R. The performance of all the deep learning and transformer learning models was evaluated using the POLD dataset. The experimental results demonstrate that our pre-trained Pashto BERT model outperforms the other models, achieving an F1-score of 94.34% and an accuracy of 94.77%.<\/p>\n\n\n\n<p>Full Paper: <a href=\"https:\/\/doi.org\/10.7717\/peerj-cs.1617\">https:\/\/doi.org\/10.7717\/peerj-cs.1617<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Abstract Social media platforms have become inundated with offensive language. This issue must be addressed for the growth of online social networks (OSNs) and a healthy online environment. While significant research has been devoted to identifying toxic content in major languages like English, this remains an open area of research in the low-resource Pashto language. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":295,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[44,25,8,6,24,7,43,33],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.0 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &quot;Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT,&quot; PeerJ Computer Science, 2023. (SCI) - Ijazul Haq<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &quot;Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT,&quot; PeerJ Computer Science, 2023. (SCI) - Ijazul Haq\" \/>\n<meta property=\"og:description\" content=\"Abstract Social media platforms have become inundated with offensive language. This issue must be addressed for the growth of online social networks (OSNs) and a healthy online environment. While significant research has been devoted to identifying toxic content in major languages like English, this remains an open area of research in the low-resource Pashto language. [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/\" \/>\n<meta property=\"og:site_name\" content=\"Ijazul Haq\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/ijaz.phd\" \/>\n<meta property=\"article:author\" content=\"https:\/\/www.facebook.com\/ijaz.phd\" \/>\n<meta property=\"article:published_time\" content=\"2024-02-10T09:21:16+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-12-03T09:22:22+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/pold-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"500\" \/>\n\t<meta property=\"og:image:height\" content=\"322\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"ijaz.me\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"ijaz.me\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/\"},\"author\":{\"name\":\"ijaz.me\",\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f\"},\"headline\":\"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &#8220;Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT,&#8221; PeerJ Computer Science, 2023. (SCI)\",\"datePublished\":\"2024-02-10T09:21:16+00:00\",\"dateModified\":\"2024-12-03T09:22:22+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/\"},\"wordCount\":258,\"publisher\":{\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f\"},\"image\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/pold-1.png\",\"keywords\":[\"emotion recognition in pashto\",\"Low-resource languages\",\"NLP\",\"offensive language detection\",\"OSNs Security and Privacy\",\"Pashto language\",\"pashto sentiment analysis\",\"sentiment analysis\"],\"articleSection\":[\"papers\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/\",\"url\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/\",\"name\":\"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \\\"Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT,\\\" PeerJ Computer Science, 2023. (SCI) - Ijazul Haq\",\"isPartOf\":{\"@id\":\"https:\/\/ijaz.me\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/pold-1.png\",\"datePublished\":\"2024-02-10T09:21:16+00:00\",\"dateModified\":\"2024-12-03T09:22:22+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#primaryimage\",\"url\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/pold-1.png\",\"contentUrl\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/pold-1.png\",\"width\":500,\"height\":322},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/ijaz.me\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &#8220;Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT,&#8221; PeerJ Computer Science, 2023. (SCI)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/ijaz.me\/#website\",\"url\":\"https:\/\/ijaz.me\/\",\"name\":\"ijazul Haq\",\"description\":\"AI Researcher\",\"publisher\":{\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/ijaz.me\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f\",\"name\":\"ijaz.me\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2020\/03\/img.jpg\",\"contentUrl\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2020\/03\/img.jpg\",\"width\":350,\"height\":350,\"caption\":\"ijaz.me\"},\"logo\":{\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/image\/\"},\"description\":\"Artificial Intelligence (AI) Researcher, Software Engineer, Programmer, and Entrepreneur. Exploring Natural Language Processing (NLP), Large Language Models (LLMs) and Computer Vision.\",\"sameAs\":[\"http:\/\/ijaz.me\",\"https:\/\/www.facebook.com\/ijaz.phd\",\"https:\/\/www.instagram.com\/ijaz.me\",\"https:\/\/www.linkedin.com\/in\/drijazulhaq\/\"],\"url\":\"https:\/\/ijaz.me\/index.php\/author\/ijaz-me\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \"Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT,\" PeerJ Computer Science, 2023. (SCI) - Ijazul Haq","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/","og_locale":"en_US","og_type":"article","og_title":"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \"Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT,\" PeerJ Computer Science, 2023. (SCI) - Ijazul Haq","og_description":"Abstract Social media platforms have become inundated with offensive language. This issue must be addressed for the growth of online social networks (OSNs) and a healthy online environment. While significant research has been devoted to identifying toxic content in major languages like English, this remains an open area of research in the low-resource Pashto language. [&hellip;]","og_url":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/","og_site_name":"Ijazul Haq","article_publisher":"https:\/\/www.facebook.com\/ijaz.phd","article_author":"https:\/\/www.facebook.com\/ijaz.phd","article_published_time":"2024-02-10T09:21:16+00:00","article_modified_time":"2024-12-03T09:22:22+00:00","og_image":[{"width":500,"height":322,"url":"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/pold-1.png","type":"image\/png"}],"author":"ijaz.me","twitter_card":"summary_large_image","twitter_misc":{"Written by":"ijaz.me","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#article","isPartOf":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/"},"author":{"name":"ijaz.me","@id":"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f"},"headline":"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &#8220;Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT,&#8221; PeerJ Computer Science, 2023. (SCI)","datePublished":"2024-02-10T09:21:16+00:00","dateModified":"2024-12-03T09:22:22+00:00","mainEntityOfPage":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/"},"wordCount":258,"publisher":{"@id":"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f"},"image":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#primaryimage"},"thumbnailUrl":"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/pold-1.png","keywords":["emotion recognition in pashto","Low-resource languages","NLP","offensive language detection","OSNs Security and Privacy","Pashto language","pashto sentiment analysis","sentiment analysis"],"articleSection":["papers"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/","url":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/","name":"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \"Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT,\" PeerJ Computer Science, 2023. (SCI) - Ijazul Haq","isPartOf":{"@id":"https:\/\/ijaz.me\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#primaryimage"},"image":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#primaryimage"},"thumbnailUrl":"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/pold-1.png","datePublished":"2024-02-10T09:21:16+00:00","dateModified":"2024-12-03T09:22:22+00:00","breadcrumb":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#primaryimage","url":"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/pold-1.png","contentUrl":"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/pold-1.png","width":500,"height":322},{"@type":"BreadcrumbList","@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/pashto-offensive-language-detection-a-benchmark-dataset-and-monolingual-pashto-bert\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/ijaz.me\/"},{"@type":"ListItem","position":2,"name":"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &#8220;Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT,&#8221; PeerJ Computer Science, 2023. (SCI)"}]},{"@type":"WebSite","@id":"https:\/\/ijaz.me\/#website","url":"https:\/\/ijaz.me\/","name":"ijazul Haq","description":"AI Researcher","publisher":{"@id":"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ijaz.me\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f","name":"ijaz.me","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ijaz.me\/#\/schema\/person\/image\/","url":"https:\/\/ijaz.me\/wp-content\/uploads\/2020\/03\/img.jpg","contentUrl":"https:\/\/ijaz.me\/wp-content\/uploads\/2020\/03\/img.jpg","width":350,"height":350,"caption":"ijaz.me"},"logo":{"@id":"https:\/\/ijaz.me\/#\/schema\/person\/image\/"},"description":"Artificial Intelligence (AI) Researcher, Software Engineer, Programmer, and Entrepreneur. Exploring Natural Language Processing (NLP), Large Language Models (LLMs) and Computer Vision.","sameAs":["http:\/\/ijaz.me","https:\/\/www.facebook.com\/ijaz.phd","https:\/\/www.instagram.com\/ijaz.me","https:\/\/www.linkedin.com\/in\/drijazulhaq\/"],"url":"https:\/\/ijaz.me\/index.php\/author\/ijaz-me\/"}]}},"_links":{"self":[{"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/posts\/256"}],"collection":[{"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/comments?post=256"}],"version-history":[{"count":8,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/posts\/256\/revisions"}],"predecessor-version":[{"id":576,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/posts\/256\/revisions\/576"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/media\/295"}],"wp:attachment":[{"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/media?parent=256"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/categories?post=256"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/tags?post=256"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}