H. Ijazul, Research on Several Key Technologies of NLP for Low Resource Pashto Language, Shanghai Jiao Tong University. (PhD Thesis)

Abstract

Emerged in the last decade, Online Social Networks (OSNs) have become an integral part of our daily lives, providing numerous benefits, such as information sharing and ease of business. However, with the increased usage of these platforms, there has been a rise in the prevalence of abusive activities, such as the dissemination of offensive language, hate speech, and cyberbullying. This issue must be addressed for the growth of OSNs and a healthy online environment. However, the sheer volume of data posted every day makes it challenging for organizations to monitor, analyze, and understand the content. Therefore, Natural Language Processing (NLP) techniques are adapted to tackle these issues. A substantial amount of research has been devoted to incorporating NLP in information security and privacy, but the focus is predominantly centered on major languages like English, inadvertently neglecting low-resource languages like Pashto.

This study aims to address these issues in the low-resource Pashto language, with a particular focus on identifying offensive textual content on OSNs. Pashto, despite being the language of over 50 million people, remains largely unexplored in NLP research and lacks off-the-shelf resources and tools, even for fundamental text-processing tasks, such as text tokenization and chunking. This is pioneering NLP research in Pashto; therefore, we opt to develop everything from scratch. The key contributions of this research include the development of four NLP models for Space Correction, Word Segmentation, POS Tagging, and Offensive Language Detection, especially adapted for the low-resource Pashto language. All these models are based on supervised learning, trained on labeled datasets, where the creation of these benchmark datasets is another invaluable contribution of this study, because the scarcity of structured content is the main challenge to NLP research in Pashto. All our contributions can be summarized as follows:

(1) Text chunking is the initial task in many NLP applications, in which the input text is split into its basic units for processing. However, chunking Pashto text is not intuitive because there are no standard rules for the proper usage of whitespace in Pashto writing. The inconsistent usage of whitespace leads to poor tokenization, which undesirably affects the performance of downstream applications. Hence, we developed a supervised tokenizer that predicts the correct position of whitespace in the text before splitting it into tokens. For this task, we pretrained a Pashto BERT model on character level. The model is then fine-tuned on a labeled dataset to adapt it to take a sequence of input characters and predict whether each character is followed by a whitespace or not. The dataset we used for fine-tuning is a Pashto text corpus annotated for the correct positions of whitespaces, specifically developed for this task.

(2) The tokenizer may be sufficient for the majority of NLP applications, but it cannot retrieve compound words (words having more than one sub-part separated by space), where a significant portion of the Pashto lexicon consists of compound words. Some NLP applications, such as part-of-speech (POS) tagging or named-entity-recognition (NER), require the retrieval of “full words” rather than space-delimited tokens. To achieve this goal, another chunking approach, called word segmentation, is commonly employed. However, for Pashto language, there is no machine learning-based word segmenter available so far, and the baseline lexicon-based technique results in the so-called out-of-vocabulary (OOV) errors. Hence, this study includes the development of a specialized word segmenter for Pashto. For this task, we employed the classic BERT architecture, trained on WordPiece-level tokens, and fine-tuned using a labeled dataset. The model takes a sequence of WordPiece tokens as input and predicts whether the space following each token is a word delimiter or not.

(3) We developed a specialized POS tagger for Pashto. This module includes four contributions. (i) We developed the pioneering POS tagset for Pashto, which is very concise and pragmatic, consisting of 36 grammatical categories, encompassing all Pashto vocabulary. (ii) According to this tagset, we annotated a Pashto text corpus using a web application specially developed for this task to expedite the manual tagging and ensure annotating quality. (iii) Using the POS-annotated corpus, we evaluated the performance of various up-to-date machine learning and deep learning sequence tagging models for tagging Pashto text sequences to establish a solid baseline for future sequence tagging tasks in Pashto NLP. (iv) Finally, we developed our Pashto POS tagger that utilizes pretrained BERT embeddings and lexical features, concatenating them to get a better representation of the input words, which is then passed to a linear layer to predict the grammatical category of the words.

(4) The fourth and last model is for Pashto Offensive Language Detection on OSNs. In the development of this model, we basically made two contributions for Pashto NLP; firstly, we addressed the issue of offensive content on OSNs, and secondly, we used this application as a case study and thoroughly investigated the sequence tagging task in Pashto. While the main objective was to examine the performance of various deep learning and transfer learning models to discriminate between toxic and non-toxic Pashto text, we also proposed a POS-enhanced BERT model for this task. By capturing the POS information, the model can yield better performance than the underlying BERT model in the sequence classification tasks, thus it may overcome the limitations of resource-scarcity to some extent. In addition to the previously mentioned contributions, this study also involves the pretraining of the pioneering monolingual Pashto BERT model (PsBERT), as well as static word embeddings, including Word2Vec, fastText, and GloVe. To train PsBERT and word embeddings, we have compiled a Pashto text corpus comprising nearly 30 million words. Finally, all the results of this study, which encompass annotated corpora, lexicons, benchmark datasets, static word embeddings, and pretrained language models, have been packed in a toolkit named NLPashto. This toolkit is a standard Python library distributed publicly on GitHub and PyPi Hub. All these tools and resources developed for the low-resource Pashto language represent a significant milestone. As a pioneering study in Pashto NLP, this research will provide crucial support and accelerate future investigations in this domain.