Behind the Paper

Unlocking Arabic Formality Transfer with Large Language Models: A Deep Dive into Dialect-to-MSA Translation

Formality transfer converts text between informal and formal registers while preserving its meaning. Abdu, F., Mughaus, R., Abudalfa, S., Ahmed, M., & Abdelali, A. (2025). An empirical evaluation of Arabic text formality transfer: A comparative study. Language Resources and Evaluation.

Shadi Abudalfa Jul 22, 2025

Introduction

Formality transfer—the task of converting text between informal and formal registers while preserving its semantic meaning—has gained considerable momentum within the field of natural language processing (NLP). Traditionally studied as a subdomain of style transfer, formality transfer is now recognized as an essential tool in enhancing the adaptability and user-responsiveness of language models across a variety of practical applications. Whether used to refine conversational agents, adjust content for different reading levels, or mediate linguistic variation in multilingual platforms, the ability to dynamically control text formality is increasingly central to user-centered NLP.

In recent years, large language models (LLMs) have emerged as the dominant paradigm in NLP, showing remarkable performance on a wide array of tasks, including machine translation (MT), text generation, summarization, and more. Given the functional overlap between formality transfer and MT—especially when transferring meaning between stylistic domains—the success of LLMs in translation tasks naturally extends to formality-sensitive transformations as well. However, this progress has largely been confined to high-resource languages like English and Chinese, where large annotated corpora and standardized benchmarks are readily available. In contrast, languages such as Arabic remain underexplored, both in terms of resources and model optimization.

This study aims to bridge that gap by offering a comprehensive evaluation of Arabic-based LLMs and their ability to perform formality transfer—specifically, the transformation of Arabic dialects (ADs) into Modern Standard Arabic (MSA). This task is not only linguistically significant but also socially and technologically relevant, as it addresses a major barrier in Arabic NLP: the wide chasm between everyday spoken language and the standardized form used in media, education, and official communication.

Arabic Formality Transfer as Translation

One of the conceptual breakthroughs in the study of formality transfer is its framing as a translation problem. Rather than seeing it merely as a matter of tone adjustment, researchers increasingly treat formality transfer as a mapping between two linguistic registers—each with its own syntax, vocabulary, and cultural norms. This perspective is particularly apt in Arabic, a language characterized by diglossia: the coexistence of formal (MSA) and informal (dialectal) varieties that differ significantly at multiple linguistic levels.

In Arabic-speaking communities, dialects are used in daily communication and vary by region, while MSA serves as the lingua franca for written and official discourse. The transformation from dialect to MSA, therefore, mirrors a translation-like task, akin to converting between two languages. This motivates the use of architectures and methods from neural machine translation (NMT), such as sequence-to-sequence (seq2seq) models, transformer architectures, and encoder-decoder frameworks.

Given the scarcity of dialect-to-MSA parallel corpora, many of the strategies developed for low-resource MT can be adapted here. This includes transfer learning, few-shot learning, and in-context learning—all of which become especially powerful when powered by LLMs pre-trained on large Arabic corpora.

Evaluating Arabic-Based LLMs for Dialect-to-MSA Translation

To address these limitations, our study undertakes a structured evaluation of several prominent LLMs in their capacity to perform Arabic formality transfer. We focus on both general-purpose and Arabic-centric models: Jais, AceGPT, ArabianGPT, and LLaMA. While LLaMA serves as a strong multilingual baseline primarily trained on English and other European languages, the remaining three models represent emerging efforts to build high-performance Arabic-first or Arabic-friendly LLMs.

Our evaluation employs four publicly available datasets:

MADAR (Multidialectal Arabic Dialect Corpus): Contains parallel sentences across 25 Arabic dialects aligned with MSA.
PADIC (Parallel Arabic Dialect Corpus): Focuses on Levantine and Egyptian dialects in contrast to MSA.
MDC (Multi-Dialect Corpus): Offers a rich blend of social media, broadcast, and spontaneous speech across dialects.
BIBLE (Arabic Bible Dialect Translation Corpus): Offers literary and standardized text aligned with various dialect renderings.

We assess performance under three learning conditions:

Zero-shot: Testing the models with no prior task-specific tuning.
Few-shot: Providing limited in-context examples.
Fine-tuning in context: Using small task-specific data during inference to guide outputs.

Evaluation metrics include BLEU, COMET, ChrF1, and BERTScore, covering both surface-level and semantic fidelity of the transformations. These scores offer a balanced view of how well each model retains meaning, adjusts style, and preserves linguistic naturalness.

Key Findings

Across all scenarios, Jais and AceGPT emerged as top performers, consistently outperforming both ArabianGPT and LLaMA. Their advantage lies in domain-specific pretraining on Arabic data, which equips them to better handle morphological agreement, dialectal irregularities, and formality cues. For instance, Jais showed notable robustness in converting verb conjugations and word order patterns typical of Egyptian and Gulf dialects into MSA-standard equivalents.

In contrast, LLaMA, while impressive on English tasks, struggled with Arabic dialects—especially in zero-shot scenarios—due to its limited exposure to Arabic morphology and dialectal variation. ArabianGPT, though better aligned with Arabic, fell behind due to less sophisticated model tuning and smaller training corpora compared to Jais and AceGPT.

Few-shot and in-context fine-tuning scenarios significantly improved all models’ performance, affirming the value of minimal but targeted supervision in low-resource settings. Notably, even small in-context prompts helped models disambiguate polysemous words and adjust register-specific idioms.

Implications and Research Gaps

Our results point to several broader takeaways for the NLP community:

Arabic-Centric Pretraining Matters: Models pre-trained on Arabic (especially dialectal and MSA variants) demonstrate significantly better understanding of intra-lingual variation and formal style norms.
Need for Robust Benchmarks: The field urgently needs standardized datasets and shared tasks to unify evaluation protocols, similar to WMT for MT or GLUE for general language understanding.
Sociolinguistic Dimensions: Collaborations with sociolinguists could help refine annotation criteria for what constitutes “formality” in different Arabic-speaking communities, which may differ substantially by region and context.
Practical Applications Lag Behind: Despite technical progress, real-world implementations of Arabic formality transfer—such as adaptive educational tools, dialect-aware virtual assistants, and register-switching chatbots—are still rare.
Ethical Considerations: As with any generative technology, automated formality transfer can introduce unintended shifts in tone, politeness, or even perceived authority. This is especially sensitive in legal, medical, or religious domains. Transparency, user control, and culturally aware design must guide future deployments.

Conclusion

Formality transfer in Arabic, particularly from dialects to MSA, represents a linguistically rich and technically demanding task. As this study illustrates, the emergence of Arabic-optimized LLMs like Jais and AceGPT signals a promising shift toward inclusive and culturally grounded language technologies. By treating formality transfer as a machine translation problem and leveraging recent advances in LLM architecture and training, we open the door to more adaptive, respectful, and intelligent language applications in the Arabic-speaking world.

Nonetheless, there is much work ahead. Building robust benchmarks, expanding parallel corpora, and fostering interdisciplinary research will be essential in unlocking the full potential of Arabic LLMs for formality-sensitive applications. As NLP becomes increasingly multilingual, ensuring that low-resource languages like Arabic are not left behind is not just a technical imperative—it is a cultural and ethical one.