
Sociology
Moving back to the future of big data-driven research: reflecting on the social in genomics
M. Goisauf, K. Akyüz, et al.
This article delves into the impact of big data-driven research in genomics on social research and theory. The authors critically reflect on how societal relations and categorizations shape genomic data, using a genome-wide association study on sexual orientation as a prime example. Join Melanie Goisauf, Kaya Akyüz, and Gillian M. Martin in exploring the link between data, theory, and societal context.
~3 min • Beginner • English
Introduction
The paper interrogates how big data–driven genomics incorporates and reproduces social categories and relations in knowledge production. Against the backdrop of postgenomic understandings of gene–environment interplay and the rise of genome editing, the authors ask how associations between biology and social phenomena (e.g., sexual orientation, educational attainment) are produced and interpreted. They argue that the social is embedded at every stage of genomics research, and that the growth of data-intensive methods risks marginalizing social theory and qualitative methodologies. Using a recent GWAS on same-sex sexual behaviour as a focal case, the paper explores how methodological choices, data infrastructures, and classificatory practices shape findings and their societal implications.
Literature Review
The paper synthesizes classic and contemporary sociological and STS scholarship on categorization and knowledge production. It draws on Bourdieu’s theory of social structure, habitus, and power relations to show how gender and class classifications are embodied and reproduced. Foucault’s concepts of the medical gaze and historical orders of knowledge illustrate how bodies are classified and governed, including historical medicalization of homosexuality. Feminist and gender theories (e.g., Butler, West & Zimmerman, Kessler & McKenna, Lorber, Connell) highlight the social construction and performativity of sex/gender and sexuality, challenging binary and heteronormative assumptions. STS work on classification systems (Bowker & Star) and the ICD shows classifications’ ethical and social consequences. Postgenomic perspectives (Fox Keller; Landecker & Panofsky; Meloni) emphasize dynamic, environmentally responsive genomes and the social as a biological signal. Novas & Rose theorize genetic risk and the emergence of ‘genetic responsibilities’ and ‘genetically at risk’ categories. The review also engages literature on geneticization of sexuality and its media dynamics (Conrad; Nelkin & Lindee; O’Riordan; Dar-Nimrod & Heine; Boysen & Vogel; Mitchell & Dezarn), debates on inclusion and race in genomics (Reardon; Bliss; Prainsack), and datafication/big data epistemologies (Savage & Burrows; Kitchin; Leonelli; Ruckenstein & Schüll; Mayer-Schönberger & Cukier).
Methodology
This is a conceptual, critical analysis grounded in Science and Technology Studies and sociology of knowledge. The authors employ a case-based approach, using the 2019 large-scale GWAS on same-sex sexual behaviour (Ganna et al. 2019b) as an illustrative example. They analyze the research process and outputs from three angles: (1) how societal relations and categorizations are inscribed into genomics research; (2) how big data–driven research shifts away from theory and methodological concept-formation; and (3) how claims of being ‘free from theory’ mask consequential choices constrained by available data infrastructures. The analysis triangulates published study materials (the Science paper, preregistration, communications), data sources (e.g., UK Biobank, 23andMe), survey items, and public/ethical debates, situating them within broader theoretical literatures. No new empirical data are collected; instead, the paper critically interrogates methodological decisions (e.g., variable operationalization, sampling frames, genomic measures), data limitations, and the social implications of knowledge production.
Key Findings
- The GWAS case shows that genomics research necessarily relies on socially constructed categories (e.g., binary sex, heteronormative assumptions), which shape operationalization and findings.
- The study by Ganna et al. (2019b) analyzed ~500,000 individuals of European ancestry and identified five genetic variants associated with self-reported same-sex sexual behaviour, with two variants relevant only for males, one for females, and two for both. These variants explain <1% of variance, while SNP-based heritability was estimated at 8–25% with large samples. Authors state results are not predictive of sexual orientation.
- Data sourcing from biobanks entails using existing survey items. UK Biobank questions focus on sexual behaviour (e.g., “Have you ever had sexual intercourse with someone of the same sex?”) rather than desire or identity, limiting construct validity for ‘sexual orientation’.
- Sampling and cohort effects matter: in UK Biobank, reported same-sex sexual behaviour varies substantially by birth cohort, increasing several-fold from 1940 to 1970 cohorts; 23andMe’s subsample showed 18.9% reporting same-sex behaviour, potentially due to self-selection. These context-dependent patterns were not deeply problematized in the genomic analysis.
- Claims of theory-free, purely data-driven science are untenable. Choices are made at multiple levels: (a) privileging SNP genotyping data over other genomic/epigenomic variation due to infrastructure availability; (b) reducing complex social phenomena to limited survey proxies; (c) de-emphasizing original evolutionary hypotheses used in data access proposals when results do not fit.
- Big data correlations can obscure causality and social complexity; the ‘triumph of correlations’ risks re-inscribing power-laden categories into genomes (e.g., re-inscribing traditional epidemiological categories) and marginalizing qualitative, contextual knowledge.
- Broader parallels in social genomics show escalating numbers of associated variants with larger samples (e.g., educational attainment GWAS: 74 SNPs in n≈293k vs. 1,271 SNPs in n≈1.1M), illustrating polygenicity and small effect sizes, complicating deterministic narratives.
- Ethical and governance challenges persist: broad consent scope (e.g., UK Biobank’s ‘health-related research’) may not anticipate sensitive topics like sexuality; downstream uses (e.g., “How gay are you?” app) demonstrate rapid translation into individualizing, potentially discriminatory tools despite researchers’ caution and engagement efforts.
- Genome editing debates (e.g., He Jiankui; proposed edits for congenital deafness) underscore how genomic knowledge can intersect with normative ‘corrections,’ raising questions about who defines ‘editable’ conditions and on what social bases.
Discussion
The analysis demonstrates that genomics research on socially complex phenomena is co-produced with social categories, infrastructures, and interpretive frames. Addressing the research question—how social categorizations are incorporated into genomic knowledge production—the paper shows that the GWAS operationalization of ‘same-sex sexual behaviour’ embeds binary sex and heteronormative assumptions, and that cohort effects and social context influence self-reports. The findings challenge the notion of ‘end of theory’: data are never neutral; they reflect prior theoretical, methodological, and infrastructural choices. The implications are significant: re-inscribing social categories into biological data can both mirror and reshape social dynamics, potentially reinforcing inequalities and fueling geneticization in public imaginaries. The authors argue for re-centering social theory and methodological reflexivity in data-intensive genomics to avoid oversimplification, misinterpretation, and harmful applications (e.g., stigmatizing apps, justificatory uses in policy or editing). Interdisciplinary collaboration and ethical engagement beyond procedural compliance are needed to contextualize correlations, interrogate measures, and recognize the limits of prediction for complex social behaviours.
Conclusion
The paper contributes a critical, theoretically informed reflection on big data–driven genomics, showing that claims of theory-free analysis are untenable and that social categories are integral to knowledge production and translation. Using the GWAS on same-sex sexual behaviour as a lens, it cautions against the marginalization of social science and the oversimplification of complex social phenomena through statistical correlations. The authors propose orienting future work around three interrelated dimensions: (1) recognizing the contingency and societal impact of research choices; (2) embracing ethical responsibilities that extend beyond procedural review to substantive understanding of social contexts; and (3) fostering interdisciplinarity to integrate multiple lenses (ELSI, STS, sociology, genetics, social genomics). They call for a renewed, multi-perspectival framework that does justice to the intertwined social and biological processes, guiding more responsible research design, analysis, communication, and governance.
Limitations
The analysis centers on a single illustrative case (the 2019 GWAS on same-sex sexual behaviour), which limits generalizability across all socio-genomic research. The paper is conceptual and does not present new empirical data; it relies on secondary sources, published materials, and theoretical synthesis. The authors note that while their discussion flags the urgency for a new framework, it does not constitute a comprehensive, universally applicable framework.
Related Publications
Explore these studies to deepen your understanding of the subject.