{"id":267,"date":"2024-02-10T10:01:48","date_gmt":"2024-02-10T10:01:48","guid":{"rendered":"https:\/\/ijaz.me\/?p=267"},"modified":"2024-12-03T09:22:12","modified_gmt":"2024-12-03T09:22:12","slug":"correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf","status":"publish","type":"post","link":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/","title":{"rendered":"<b style=\"color: darkblue\">H. Ijazul,<\/b> Q. Weidong, G. Jie, and T. Peng, &#8220;Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF,&#8221; Speech Communication, vol. 153, p. 102970, 2023. <b style=\"color: #b85b00\">(SCI)<\/b>"},"content":{"rendered":"\n<h5 class=\"wp-block-heading\">Abstract<\/h5>\n\n\n\n<p>Word segmentation is the process of splitting up the text into words. In English and most European languages, word boundaries are identified by whitespace, while in Pashto, there is no explicit word delimiter. Pashto uses whitespace for word separation but not consistently, and it cannot be considered a reliable word-boundary identifier. This inconsistency makes the Pashto word segmentation unique and challenging. Moreover, Pashto is a low-resource, non-standardized language with no established rules for the correct usage of whitespace that leads to two typical spelling errors, space-omission, and space-insertion. These errors significantly affect the performance of the word\u00a0segmenter This study aims to develop a state-of-the-art word segmenter for Pashto, with a proofing tool to identify and correct the position of space in a noisy text. The CRF algorithm is incorporated to train two machine learning models for these tasks. For models&#8217; training, we have developed a text corpus of nearly 3.5 million words, annotated for the correct positions of spaces and explicit word boundary information using a lexicon-based technique, and then manually checked for errors. The experimental results of the model are very satisfactory, where the F1-scores are 99.2% and 96.7% for the proofing model and word segmenter, respectively.<\/p>\n\n\n\n<p>Full Paper: <a href=\"https:\/\/doi.org\/10.1016\/j.specom.2023.102970\">https:\/\/doi.org\/10.1016\/j.specom.2023.102970<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Abstract Word segmentation is the process of splitting up the text into words. In English and most European languages, word boundaries are identified by whitespace, while in Pashto, there is no explicit word delimiter. Pashto uses whitespace for word separation but not consistently, and it cannot be considered a reliable word-boundary identifier. This inconsistency makes [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":286,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[18,19,20,17,21,8,26,23,27],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.0 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &quot;Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF,&quot; Speech Communication, vol. 153, p. 102970, 2023. (SCI) - Ijazul Haq<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &quot;Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF,&quot; Speech Communication, vol. 153, p. 102970, 2023. (SCI) - Ijazul Haq\" \/>\n<meta property=\"og:description\" content=\"Abstract Word segmentation is the process of splitting up the text into words. In English and most European languages, word boundaries are identified by whitespace, while in Pashto, there is no explicit word delimiter. Pashto uses whitespace for word separation but not consistently, and it cannot be considered a reliable word-boundary identifier. This inconsistency makes [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/\" \/>\n<meta property=\"og:site_name\" content=\"Ijazul Haq\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/ijaz.phd\" \/>\n<meta property=\"article:author\" content=\"https:\/\/www.facebook.com\/ijaz.phd\" \/>\n<meta property=\"article:published_time\" content=\"2024-02-10T10:01:48+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-12-03T09:22:12+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/ws-lex-based-annotation.drawio.png\" \/>\n\t<meta property=\"og:image:width\" content=\"500\" \/>\n\t<meta property=\"og:image:height\" content=\"101\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"ijaz.me\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"ijaz.me\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/\"},\"author\":{\"name\":\"ijaz.me\",\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f\"},\"headline\":\"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &#8220;Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF,&#8221; Speech Communication, vol. 153, p. 102970, 2023. (SCI)\",\"datePublished\":\"2024-02-10T10:01:48+00:00\",\"dateModified\":\"2024-12-03T09:22:12+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/\"},\"wordCount\":231,\"publisher\":{\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f\"},\"image\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/ws-lex-based-annotation.drawio.png\",\"keywords\":[\"ijaz AI developer\",\"ijaz AI researcher\",\"Ijaz ul haq\",\"ijazul haq NLP\",\"ijazul haq nlp researcher\",\"NLP\",\"Pashto\",\"Text Processing\",\"Word Segmentation\"],\"articleSection\":[\"papers\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/\",\"url\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/\",\"name\":\"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \\\"Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF,\\\" Speech Communication, vol. 153, p. 102970, 2023. (SCI) - Ijazul Haq\",\"isPartOf\":{\"@id\":\"https:\/\/ijaz.me\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/ws-lex-based-annotation.drawio.png\",\"datePublished\":\"2024-02-10T10:01:48+00:00\",\"dateModified\":\"2024-12-03T09:22:12+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#primaryimage\",\"url\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/ws-lex-based-annotation.drawio.png\",\"contentUrl\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/ws-lex-based-annotation.drawio.png\",\"width\":500,\"height\":101},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/ijaz.me\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &#8220;Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF,&#8221; Speech Communication, vol. 153, p. 102970, 2023. (SCI)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/ijaz.me\/#website\",\"url\":\"https:\/\/ijaz.me\/\",\"name\":\"ijazul Haq\",\"description\":\"AI Researcher\",\"publisher\":{\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/ijaz.me\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f\",\"name\":\"ijaz.me\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2020\/03\/img.jpg\",\"contentUrl\":\"https:\/\/ijaz.me\/wp-content\/uploads\/2020\/03\/img.jpg\",\"width\":350,\"height\":350,\"caption\":\"ijaz.me\"},\"logo\":{\"@id\":\"https:\/\/ijaz.me\/#\/schema\/person\/image\/\"},\"description\":\"Artificial Intelligence (AI) Researcher, Software Engineer, Programmer, and Entrepreneur. Exploring Natural Language Processing (NLP), Large Language Models (LLMs) and Computer Vision.\",\"sameAs\":[\"http:\/\/ijaz.me\",\"https:\/\/www.facebook.com\/ijaz.phd\",\"https:\/\/www.instagram.com\/ijaz.me\",\"https:\/\/www.linkedin.com\/in\/drijazulhaq\/\"],\"url\":\"https:\/\/ijaz.me\/index.php\/author\/ijaz-me\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \"Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF,\" Speech Communication, vol. 153, p. 102970, 2023. (SCI) - Ijazul Haq","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/","og_locale":"en_US","og_type":"article","og_title":"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \"Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF,\" Speech Communication, vol. 153, p. 102970, 2023. (SCI) - Ijazul Haq","og_description":"Abstract Word segmentation is the process of splitting up the text into words. In English and most European languages, word boundaries are identified by whitespace, while in Pashto, there is no explicit word delimiter. Pashto uses whitespace for word separation but not consistently, and it cannot be considered a reliable word-boundary identifier. This inconsistency makes [&hellip;]","og_url":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/","og_site_name":"Ijazul Haq","article_publisher":"https:\/\/www.facebook.com\/ijaz.phd","article_author":"https:\/\/www.facebook.com\/ijaz.phd","article_published_time":"2024-02-10T10:01:48+00:00","article_modified_time":"2024-12-03T09:22:12+00:00","og_image":[{"width":500,"height":101,"url":"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/ws-lex-based-annotation.drawio.png","type":"image\/png"}],"author":"ijaz.me","twitter_card":"summary_large_image","twitter_misc":{"Written by":"ijaz.me","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#article","isPartOf":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/"},"author":{"name":"ijaz.me","@id":"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f"},"headline":"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &#8220;Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF,&#8221; Speech Communication, vol. 153, p. 102970, 2023. (SCI)","datePublished":"2024-02-10T10:01:48+00:00","dateModified":"2024-12-03T09:22:12+00:00","mainEntityOfPage":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/"},"wordCount":231,"publisher":{"@id":"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f"},"image":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#primaryimage"},"thumbnailUrl":"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/ws-lex-based-annotation.drawio.png","keywords":["ijaz AI developer","ijaz AI researcher","Ijaz ul haq","ijazul haq NLP","ijazul haq nlp researcher","NLP","Pashto","Text Processing","Word Segmentation"],"articleSection":["papers"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/","url":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/","name":"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, \"Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF,\" Speech Communication, vol. 153, p. 102970, 2023. (SCI) - Ijazul Haq","isPartOf":{"@id":"https:\/\/ijaz.me\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#primaryimage"},"image":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#primaryimage"},"thumbnailUrl":"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/ws-lex-based-annotation.drawio.png","datePublished":"2024-02-10T10:01:48+00:00","dateModified":"2024-12-03T09:22:12+00:00","breadcrumb":{"@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#primaryimage","url":"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/ws-lex-based-annotation.drawio.png","contentUrl":"https:\/\/ijaz.me\/wp-content\/uploads\/2024\/02\/ws-lex-based-annotation.drawio.png","width":500,"height":101},{"@type":"BreadcrumbList","@id":"https:\/\/ijaz.me\/index.php\/2024\/02\/10\/correction-of-whitespace-and-word-segmentation-in-noisy-pashto-text-using-crf\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/ijaz.me\/"},{"@type":"ListItem","position":2,"name":"H. Ijazul, Q. Weidong, G. Jie, and T. Peng, &#8220;Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF,&#8221; Speech Communication, vol. 153, p. 102970, 2023. (SCI)"}]},{"@type":"WebSite","@id":"https:\/\/ijaz.me\/#website","url":"https:\/\/ijaz.me\/","name":"ijazul Haq","description":"AI Researcher","publisher":{"@id":"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ijaz.me\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/ijaz.me\/#\/schema\/person\/3bb27d2451420d1138b550fcb6de0c7f","name":"ijaz.me","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ijaz.me\/#\/schema\/person\/image\/","url":"https:\/\/ijaz.me\/wp-content\/uploads\/2020\/03\/img.jpg","contentUrl":"https:\/\/ijaz.me\/wp-content\/uploads\/2020\/03\/img.jpg","width":350,"height":350,"caption":"ijaz.me"},"logo":{"@id":"https:\/\/ijaz.me\/#\/schema\/person\/image\/"},"description":"Artificial Intelligence (AI) Researcher, Software Engineer, Programmer, and Entrepreneur. Exploring Natural Language Processing (NLP), Large Language Models (LLMs) and Computer Vision.","sameAs":["http:\/\/ijaz.me","https:\/\/www.facebook.com\/ijaz.phd","https:\/\/www.instagram.com\/ijaz.me","https:\/\/www.linkedin.com\/in\/drijazulhaq\/"],"url":"https:\/\/ijaz.me\/index.php\/author\/ijaz-me\/"}]}},"_links":{"self":[{"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/posts\/267"}],"collection":[{"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/comments?post=267"}],"version-history":[{"count":10,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/posts\/267\/revisions"}],"predecessor-version":[{"id":575,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/posts\/267\/revisions\/575"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/media\/286"}],"wp:attachment":[{"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/media?parent=267"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/categories?post=267"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ijaz.me\/index.php\/wp-json\/wp\/v2\/tags?post=267"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}