Epistemic Rupture, Language Hegemony and its Disruptions within Machine Translation

CDC Research Colloquium

02. Jul

Seyi Olojo (UC Berkeley / Weizenbaum Institute)

At present, a majority of linguistic corpora within natural language processing (NLP) is from the English language. The over-representation of English starkly juxtaposes the minimal linguistic representation of the global majority within language models. These languages, otherwise known as 'Low resource languages' (LRLs) lack the data needed to perform NLP tasks well. Within the context of machine translation (MT) systems, languages and by extension, cultural identities get lost in translation. We examine the performance of machine translation tasks for three Nigerian languages: Hausa, Ìgbò and Yorùbá. Through 20 semi-structured interviews with Nigerian native speakers, our findings illustrate participant perceptions of MT system usability within Nigerian contexts and the ideal use-cases that they imagine. Participants also discuss the technical failures they observe, situating them within complex linguistic attributes of their native tongues. We identify the prevalence of an 'Anglophone lens', ways of knowing that characterize the colonial and hegemonic power of the English language. We then discuss, how the participant observed (socio)technical failures reflecting the often unseen epistemic violences that are an outcome of the operationalization of English-biased MT systems. In doing so, this paper highlights the difficulty of holistically representing the complex social, political and cultural contexts embedded within Nigerian languages. Therefore, expanding discourses of critique to the level of epistemology, which in consequence, invite us to a line of inquiry that questions the knowledge frameworks that inform how to define a 'good machine translation' practice in relation to low-resourced indigenous languages.

  • 02.07. / 12-2pm
  • C40.320


  • Randi Heinrichs