About Me

Felermino Ali

PhD Candidate
Laboratory of Artificial Intelligence and Computer Science
Porto University, Portugal

About me

I am a PhD student at the Faculty of Engineering of Porto University, Portugal. My research focuses on Neural Machine Translation for low-resource languages. Particularly Mozambican languages and currently I am working on Emakhuwa which is the most widely spoken language in Mozambique. I am passionate about AI, and I am involved in initiatives towards advocating for the necessity of dataset creation and capacity building for marginalized communities so that they can keep up with new technological advancements and contribute to the development of more inclusive tools for the good of the general society.

Research Interests

  • Machine Translation
  • Natural Language Processing
  • Low-resource Languanges
  • Speech Processing
  • Deep Learning
  • Computational Linguistics

Resume

Education and Professional Preparation

  • Sep 2021 - Present

    PhD

    Laboratory of Artificial Intelligence and Computer Science
    Faculty of Engineering, Porto University
    Porto, Portugal
    Project: Machine Translation for Emakhuwa, an Extremely Low-resource Bantu Language
    Advisor: Prof. Henrique Lopes Cardoso
    Co-Advisor: Prof. Rui Sousa Silva
  • Fall 2017

    Master of Science in
    Computer Science

    SEGi University
    Kuala Lumpur, Malaysia
    Master's thesis: A Web Service Framework for Mining Educational Data
    Advisor: Prof. Ashley Ng Sok Choo
  • Oct 2012

    Bachelor in Computer Engineering
    Faculty of Engineering
    Lurio University
    Pemba, Mozambique

Projects and Scholarships

Preset MOZNLP
Community committed to promoting Mozambican languages digitally.
Sep 23 Google exploreCSR
Award to aid higher education efforts to support students from historically marginalized groups to pursue graduate studies and
research careers in computing
Oct 22 - Sep 24 Lacuna Project: Dataset Creation for Emakhuwa Language
Awarded grant to create or expand machine learning datasets for low-resourced African languages
Sep 21 - Sep 24 PhD Graduate Research Scholarship
Shortlisted as 1st place at "Consórcio de escolas de engenharias" of Science and Technology Foundation-PhD scholarships.
Mar 14 - Sep 16 MSc Graduate Research Scholarship
SEGi University and Lurio University Scholarship

Publications

(Select NLP related publications)
As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa. The data is publicly available at this link.
Accepted to Main Conference
The accurate identification of loanwords within a given text holds significant potential as a valuable tool for addressing data augmentation and mitigating data sparsity issues. Such identification can improve the performance of various natural language processing tasks, particularly in the context of low-resource languages, that lack standardized spelling conventions. This research proposes a supervised method to identify loanwords in Emakhuwa, borrowed from Portuguese. Our methodology encompasses a two-fold approach. Firstly, we employ traditional machine learning algorithms incorporating handcrafted features, including language-specific and similarity-based features. We build upon prior studies to extract similarity features and propose utilizing two external resources: a Sequence-to-Sequence model and a dictionary. This innovative approach allows us to identify loanwords solely by analyzing the target word without prior knowledge about its donor counterpart. Furthermore, we fine-tune the pre-trained CANINE model for the downstream task of loanword detection, which culminates in the impressive achievement of F1-score of 93\%. To the best of our knowledge, this study is the first of its kind focusing on Emakhuwa, and the preliminary results are promising, as they pave the way to further advancements. We make our loanword dataset and source code publicly available to foster further research.
Stopword lists, an essential resource for natural language processing and information retrieval, are often unavailable for low-resource languages. Creating these lists is time-consuming and expensive, making automated stopword detection a viable alternative. This paper introduces a novel stopword detection approach that exploits the topological properties of co-occurrence networks to identify function words. By leveraging the connectivity patterns of function words in these networks, the proposed approach aims to achieve higher precision compared to traditional frequency-based methods. To assess the effectiveness of the network-based approach, we constructed co-occurrence networks for Tetun and Emakhuwa (low-resourced languages), as well as English and Portuguese. We then compared the performance of this approach with traditional frequency-based methods. The results indicate that the network-based approach consistently outperforms traditional methods, with in-degree emerging as the most reliable indicator of function words. This finding suggests promising prospects for automatically generating stopword lists in other low-resource languages, paving the way for developing natural language processing tools for these linguistic contexts.
Africa, which is home to over 2000 languages from more than six language families, has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the presence of labeled datasets by native speakers. In this paper, we introduce 14 sentiment labeled Twitter datasets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yoruba) from four language families (Afro-Asiatic, English Creole, Indo European, and Niger-Congo). We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We also build different sentiment classification baseline models on the datasets and discuss their usefulness.
Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of large-scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the best of our knowledge, few parallel corpora including Emakhuwa already exist. In this paper we describe the creation of the Emakhuwa-Portuguese parallel corpus, which is a collection of texts from the Jehovah's Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization processes which remain to be completed, the corpus will be made freely available for research use.

Open-Source Libraries

This package provides a mechanism to automatically detect words in Emakhuwa borrowed from Portuguese.
The first spell checking tools for Emakhuwa Languanges and online Dictionary.

Participation in Conferences

Emakhuwa is a Mozambican language under the low-resource category despite being widely spoken in Mozambique (i.e. over 6 million speakers). To the best of our knowledge, no Machine Translation tools exist for Emakhuwa . However, in recent years, there has been a huge col- laborative effort from African Natural Language research communities to develop techniques for Neural Machine Translation adequate to African Languages. This led to the development of corpora, text representation techniques, pre-trained models, and Neural Network Architectures, all of which are benchmarks for improving current Machine Translation of low-resourced languages, and in particular the African language family. Therefore, this study aims to investigate how this development can be helpful to assist Machine Translation of Emakhuwa, and also propose a suitable approach and resources to develop such system
Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of large-scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the best of our knowledge, few parallel corpora including Emakhuwa already exist. In this paper we describe the creation of the Emakhuwa-Portuguese parallel corpus, which is a collection of texts from the Jehovah's Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization processes which remain to be completed, the corpus will be made freely available for research use.

Invited Talks and Tutorials

Aug 2023 Seminar on challenges of Artificial Intelligence in Multilingual Context (Center of African Studies of University Eduardo Mondlane)
Invited speaker
Felermino, D. M. A. Ali, Challenges and Opportunities for Mozambican Languages in the age of Large Language Models | Poster
Jun 2023 Workshop on Efficiency, Consistency, Collaborativeness and Quality in Translations (Radio of Mozambique)
Invited 2-hour Tutorial
Felermino, D. M. A. Ali, Using Computer Aided Translation Tools (CAT) on Mozambican languages | Video
26 Apr 2022 1st Seminar on Security and resiliency of comunications in Mozambique
Invited Talk
Felermino Ali, Saide M. Saide - Simishing SMS attacts detection on Mobile Money Transfer user | Video
Oct 27 - 29 Oct 2022 Workshop on Translation Technology (Center of Linguistics of Porto University 2023)
Invited 3-hour Tutorial
Felermino, D. M. A. Ali
05 Oct - 06 Oct 2021 XII Jornadas de Rádiodifusão (XII Radio Broadcasting Conferences)
Invited Talk
Felermino, D. M. A. Ali, Computational Linguistics and challenges for Mozambican languages
26 Jul - 28 Jul 2021 IndabaX Mozambique (2021)
Invited Talk
Felermino, D. M. A. Ali Mozambican languages in the Natural Language Processing panorama. | video

Teaching

Teaching Assistant

2012 - 2021 Department of Computer Engineering, Faculty of Engineering, Lurio University
Subjects: Object Oriented Programming, Data Structure, Database Lab, Software Engineering Lab
2013 - 2019 Faculty of Tourism Management and IT
Subjects: Artificial Intelligence, Multimedia, Programming Foundation

Trainings

Trainings

15 - 19 July 2024 11th INTERNATIONAL SCHOOL ON DEEP LEARNING (Maia, Portugal)
Summer School: 6-day event (15-19 July, 2024) cover multiple topics including: computer vision, neurosciences, speech recognition, language processing, human-computer interaction, drug discovery, health informatics, medical image analysis, recommender systems, advertising, fraud detection, robotics, games, finance, biotechnology, physics experiments, biometrics, communications, climate sciences, geographic information systems, signal processing, genomics, materials design, video technology, social systems, etc. etc. | Certificate
5 - 7 Feb 2024 MEDCIDS - Winter School 2024 (Porto, Portugal)
Winter School: 3-day event (5-7 Feburary, 2024) Reliability and agreement studies | Certificate
24 - 29 jul 2022 The 12th Lisbon Machine Learning (Lisbon, Portugal)
Summer School: 6-day event (24-29 July, 2022) covering a range of machine learning topics, from theory to practice, that are important in solving natural language processing problems arising in different application areas | Certificate
2 June 22 NVIDIA DLI Certificate
Fundamentals of Deep Learning | Certificate
3 Mar 22 Natural Language Processing Process and Generate Text (Porto, Portugal)
Natural Language Processing Process and Generate Text| Certificate
July 2021 ICMC 50: Python for Natural Language Processing - Institute of Mathematics Sciences and Computation, Brazil
Python for Natural Language Processing | Certificate
17 - 21 Aug 2020 Business Data Science Summer School (2020) - (Amsterdam, Netherlands)
Summer School: Covering deep learning | Certificate