Felermino Ali

Projects and Scholarships

Preset	MOZNLP Community committed to promoting Mozambican languages digitally.
Sep 23	Google exploreCSR Award to aid higher education efforts to support students from historically marginalized groups to pursue graduate studies and research careers in computing
Oct 22 - Sep 24	Lacuna Project: Dataset Creation for Emakhuwa Language Awarded grant to create or expand machine learning datasets for low-resourced African languages
Sep 21 - Sep 24	PhD Graduate Research Scholarship Shortlisted as 1st place at "Consórcio de escolas de engenharias" of Science and Technology Foundation-PhD scholarships.
Mar 14 - Sep 16	MSc Graduate Research Scholarship SEGi University and Lurio University Scholarship

Publications

(Select NLP related publications)

Felermino Ali , et al. (2024). Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation WMT 2024 - Open Language Data Initiative shared tasks , Maimi, Florida, US

As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa. The data is publicly available at this link.

Felermino Ali , et al. (2024). Building Resources for Emakhuwa: Machine Translation and News Classification Benchmarks, The 2024 Conference on Empirical Methods in Natural Language Processing, Maimi, US

Accepted to Main Conference

Felermino Ali , et al. (2024). Detecting Loanwords in Emakhuwa: An Extremely Low-Resource Bantu Language Exhibiting Significant Borrowing From Portuguese LREC-COLING, Torino, Italy

The accurate identification of loanwords within a given text holds significant potential as a valuable tool for addressing data augmentation and mitigating data sparsity issues. Such identification can improve the performance of various natural language processing tasks, particularly in the context of low-resource languages, that lack standardized spelling conventions. This research proposes a supervised method to identify loanwords in Emakhuwa, borrowed from Portuguese. Our methodology encompasses a two-fold approach. Firstly, we employ traditional machine learning algorithms incorporating handcrafted features, including language-specific and similarity-based features. We build upon prior studies to extract similarity features and propose utilizing two external resources: a Sequence-to-Sequence model and a dictionary. This innovative approach allows us to identify loanwords solely by analyzing the target word without prior knowledge about its donor counterpart. Furthermore, we fine-tune the pre-trained CANINE model for the downstream task of loanword detection, which culminates in the impressive achievement of F1-score of 93\%. To the best of our knowledge, this study is the first of its kind focusing on Emakhuwa, and the preliminary results are promising, as they pave the way to further advancements. We make our loanword dataset and source code publicly available to foster further research.

Felermino Ali , et al. (2024). Network-based Approach for Stopwords Detection. First Workshop on NLP for Indigenous Languages of Lusophone Countries , Santiago de Compostela, Galicia

Stopword lists, an essential resource for natural language processing and information retrieval, are often unavailable for low-resource languages. Creating these lists is time-consuming and expensive, making automated stopword detection a viable alternative. This paper introduces a novel stopword detection approach that exploits the topological properties of co-occurrence networks to identify function words. By leveraging the connectivity patterns of function words in these networks, the proposed approach aims to achieve higher precision compared to traditional frequency-based methods. To assess the effectiveness of the network-based approach, we constructed co-occurrence networks for Tetun and Emakhuwa (low-resourced languages), as well as English and Portuguese. We then compared the performance of this approach with traditional frequency-based methods. The results indicate that the network-based approach consistently outperforms traditional methods, with in-degree emerging as the most reliable indicator of function words. This finding suggests promising prospects for automatically generating stopword lists in other low-resource languages, paving the way for developing natural language processing tools for these linguistic contexts.

Shamsuddeen Hassan Muhammad, Felermino Ali , et al. (2023). AfriSenti: A Benchmark Twitter Sentiment Analysis Dataset for African Languages 4th Workshop on African Natural Language Processing.

Africa, which is home to over 2000 languages from more than six language families, has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the presence of labeled datasets by native speakers. In this paper, we introduce 14 sentiment labeled Twitter datasets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yoruba) from four language families (Afro-Asiatic, English Creole, Indo European, and Niger-Congo). We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We also build different sentiment classification baseline models on the datasets and discuss their usefulness.

Felermino DMA Ali, Andrew Caines, Jaimito LA Malavi (2021). Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique. AfricanNLP Workshop 2021 arXiv preprint arXiv:2104.05753

Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of large-scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the best of our knowledge, few parallel corpora including Emakhuwa already exist. In this paper we describe the creation of the Emakhuwa-Portuguese parallel corpus, which is a collection of texts from the Jehovah's Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization processes which remain to be completed, the corpus will be made freely available for research use.

Open-Source Libraries

Felermino DMA Ali (2023). EmakuwaLoans: a package for preprocessing Emakhuwa. https://github.com/felerminoali/emakhuwa-nlp/tree/master/Preprocessing/loan_detetion

This package provides a mechanism to automatically detect words in Emakhuwa borrowed from Portuguese.

Felermino DMA Ali (2023). MusaLing: Online Dictionary and spell checking from Mozambican languages. https://www.musaling.com/pt/

The first spell checking tools for Emakhuwa Languanges and online Dictionary.

Participation in Conferences

Felermino DMA Ali., Henrique L. C., Rui, Sousa-Silva. (2023). Data Augmentation for Low-Resource Neural Machine Translation Using Loanword Spelling Inconsistencies LITHME WG1-WG7 Joint Workshop: Bridging the gap between technology and professionals , October 28th - 30th, 2023, Budapest, Hungary.

Felermino DMA Ali., Henrique L. C., Rui, Sousa-Silva. (2022). Loanword detection in Emakhuwa language. Language in the Human-Machine Era" (LITHME) , October 15th - 16th, 2022, University of Groningen — Campus Fryslân, Leeuwarden, Netherlands.

AfricAI Conference + Lacuna Fund Grantee Convening (2022), June 12-14, 2022, Kigali, Rwanda.

Felermino D. M. A. Ali., Henrique L. C., Rui, Sousa-Silva. (2022). Machine Translation for Emakhuwa of Mozambique Doctoral Symposium, EPIA 2022, 21st EPIA Conference on Artificial Intelligence , Lisbon, Portugal. (Book)

Emakhuwa is a Mozambican language under the low-resource category despite being widely spoken in Mozambique (i.e. over 6 million speakers). To the best of our knowledge, no Machine Translation tools exist for Emakhuwa . However, in recent years, there has been a huge col- laborative effort from African Natural Language research communities to develop techniques for Neural Machine Translation adequate to African Languages. This led to the development of corpora, text representation techniques, pre-trained models, and Neural Network Architectures, all of which are benchmarks for improving current Machine Translation of low-resourced languages, and in particular the African language family. Therefore, this study aims to investigate how this development can be helpful to assist Machine Translation of Emakhuwa, and also propose a suitable approach and resources to develop such system

Felermino, D. M. A. Ali, Andrew Caines, Jaimito L. A. Malavi (2021). Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique AfricaNLP Workshop Strengthening African NLP. EACL 2104.05753

Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of large-scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the best of our knowledge, few parallel corpora including Emakhuwa already exist. In this paper we describe the creation of the Emakhuwa-Portuguese parallel corpus, which is a collection of texts from the Jehovah's Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization processes which remain to be completed, the corpus will be made freely available for research use.

Invited Talks and Tutorials

Aug 2023	Seminar on challenges of Artificial Intelligence in Multilingual Context (Center of African Studies of University Eduardo Mondlane) Invited speaker Felermino, D. M. A. Ali, Challenges and Opportunities for Mozambican Languages in the age of Large Language Models \| Poster
Jun 2023	Workshop on Efficiency, Consistency, Collaborativeness and Quality in Translations (Radio of Mozambique) Invited 2-hour Tutorial Felermino, D. M. A. Ali, Using Computer Aided Translation Tools (CAT) on Mozambican languages \| Video
26 Apr 2022	1st Seminar on Security and resiliency of comunications in Mozambique Invited Talk Felermino Ali, Saide M. Saide - Simishing SMS attacts detection on Mobile Money Transfer user \| Video
Oct 27 - 29 Oct 2022	Workshop on Translation Technology (Center of Linguistics of Porto University 2023) Invited 3-hour Tutorial Felermino, D. M. A. Ali
05 Oct - 06 Oct 2021	XII Jornadas de Rádiodifusão (XII Radio Broadcasting Conferences) Invited Talk Felermino, D. M. A. Ali, Computational Linguistics and challenges for Mozambican languages
26 Jul - 28 Jul 2021	IndabaX Mozambique (2021) Invited Talk Felermino, D. M. A. Ali Mozambican languages in the Natural Language Processing panorama. \| video

Teaching

Teaching Assistant

2012 - 2021	Department of Computer Engineering, Faculty of Engineering, Lurio University Subjects: Object Oriented Programming, Data Structure, Database Lab, Software Engineering Lab
2013 - 2019	Faculty of Tourism Management and IT Subjects: Artificial Intelligence, Multimedia, Programming Foundation

Trainings

15 - 19 July 2024	11th INTERNATIONAL SCHOOL ON DEEP LEARNING (Maia, Portugal) Summer School: 6-day event (15-19 July, 2024) cover multiple topics including: computer vision, neurosciences, speech recognition, language processing, human-computer interaction, drug discovery, health informatics, medical image analysis, recommender systems, advertising, fraud detection, robotics, games, finance, biotechnology, physics experiments, biometrics, communications, climate sciences, geographic information systems, signal processing, genomics, materials design, video technology, social systems, etc. etc. \| Certificate
5 - 7 Feb 2024	MEDCIDS - Winter School 2024 (Porto, Portugal) Winter School: 3-day event (5-7 Feburary, 2024) Reliability and agreement studies \| Certificate
24 - 29 jul 2022	The 12th Lisbon Machine Learning (Lisbon, Portugal) Summer School: 6-day event (24-29 July, 2022) covering a range of machine learning topics, from theory to practice, that are important in solving natural language processing problems arising in different application areas \| Certificate
2 June 22	NVIDIA DLI Certificate Fundamentals of Deep Learning \| Certificate
3 Mar 22	Natural Language Processing Process and Generate Text (Porto, Portugal) Natural Language Processing Process and Generate Text\| Certificate
July 2021	ICMC 50: Python for Natural Language Processing - Institute of Mathematics Sciences and Computation, Brazil Python for Natural Language Processing \| Certificate
17 - 21 Aug 2020	Business Data Science Summer School (2020) - (Amsterdam, Netherlands) Summer School: Covering deep learning \| Certificate

About Me

Felermino Ali

About me

Research Interests

Resume

Education and Professional Preparation

PhD

Master of Science in
Computer Science

Bachelor in Computer Engineering
Faculty of Engineering
Lurio University
Pemba, Mozambique

Projects and Scholarships

Publications

Open-Source Libraries

Participation in Conferences

Invited Talks and Tutorials

Teaching

Teaching Assistant

Trainings

Trainings

About Me

Felermino Ali

About me

Research Interests

Resume

Education and Professional Preparation

PhD

Master of Science in Computer Science

Bachelor in Computer Engineering Faculty of Engineering Lurio University Pemba, Mozambique

Projects and Scholarships

Publications

Open-Source Libraries

Participation in Conferences

Invited Talks and Tutorials

Teaching

Teaching Assistant

Trainings

Trainings

Master of Science in
Computer Science

Bachelor in Computer Engineering
Faculty of Engineering
Lurio University
Pemba, Mozambique