Synthetic Data and the Problematic of Representation

10. Jun

Talk by Benjamin Jacobsen (University of York) 10.06. / 2 – 4 pm / Online (Zoom) / CDC Colloquium

What happens when there is not enough data to train machine learning algorithms? And what happens when the data used to train algorithmic models is not sufficiently representative of certain data attributes, such as gender or race? Algorithms and generative AI models have not become increasingly interwoven in contemporary society, they have also been noted for their capacity to reinforce stereotypical and culturally entrenched representations in biased outputs based on characteristics such as race, class, and gender. Synthetic data have emerged, in part, as a response to this problem of representation in AI training datasets. Synthetic data embody an explicit claim to actively generate diverse data points, such as images or text data representing racialised minority populations in a healthcare dataset. This has significant and disruptive ethical implications, because synthetic data intervene into our understanding of long-standing issues such as bias, fairness, and algorithmic injustice. In this talk, drawing on the work of Jacques Derrida and Ramon Amaro, I will explore this issue of representation in synthetic data through two main ideas: imbalance and absence. In other words, I will refer to cases where the data distribution on which an algorithm is trained is seen as skewed or imbalanced and where certain classes of data are wholly absent from the data distribution. Referring to company documents as well as semi-structured interviews with AI researchers and computer scientists, I will show the tensions that emerge when synthetic data is used to addressed these two issues of imbalance and absence and what is says about the current landscape of AI and ethics.

Benjamin N. Jacobsen is a Lecturer in Sociology at the University of York as well as a Visiting Fellow on Professor Louise Amoore’s ‘Algorithmic Societies’ project at Durham University. His research broadly examines the ethicopolitical effects of data and machine learning algorithms on society and culture. He has published extensively on the intersection on algorithms and everyday memory practices and his book Social

Media and the Automatic Production of Memory (co-authored with Prof David Beer) was published in 2021 by Bristol University Press. Benjamin is currently examining the political implications of generative modelling and synthetic data on society and this work has been published in journals such as Big Data & Society and Theory, Culture & Society.