Computer Science
How ethics combine with big data: a bibliometric analysis
M. Kuc-czarnecka and M. Olczyk
Marta Kuc-Czarnecka and Magdalena Olczyk conducted a bibliometric analysis revealing that the exploration of ethical concerns in Big Data is surprisingly sparse in scientific literature, showing slow growth with a primary focus on health and technology.
~3 min • Beginner • English
Introduction
Big Data has rapidly emerged as a major research and practice domain due to the volume, velocity, variety, and veracity of data, yet ethical implications have been less frequently examined in tandem with technical advances. The paper asks how ethical concerns about Big Data have developed in the scholarly literature and what domains they concentrate on. The authors focus on Big Data within the broader landscape of data science (data analytics, machine learning, and Big Data) and pose the central question: what ethical issues arise in the use and analysis of Big Data, and how are these reflected in publication patterns? Motivated by concerns such as bias, opacity, privacy, manipulation, and cyberwarfare, the study aims to map and quantify the evolution of the ethics–Big Data nexus using bibliometric methods. This fills a gap in existing research, where ethics and Big Data are often treated separately despite a growing need for integrated understanding.
Literature Review
The paper situates its inquiry within discussions on the ethical implications of Big Data highlighted by Boyd and Crawford (2012), O’Neil (2016), and others, including concerns about discrimination, bias, labor practices, consumer exploitation, cyberwarfare, and democratic processes (e.g., election interference). It references debates on algorithmic harms, digital authoritarianism, autonomous weapons, and the rapid translation of speculative scenarios into reality. Public-health controversies (e.g., vaccination debates and COVID-19 misinformation) exemplify the salience of Big Data ethics in public life. Foundational works include Helbing and Balietti (2011) on ethical standards for socio-economic data, Boyd and Crawford (2012) on privacy and research ethics in social media, and subsequent papers addressing online data use for social research. The literature indicates that, despite extensive technical studies, integrated ethical analyses remain comparatively sparse and are often concentrated in biomedical and health contexts.
Methodology
Design: Bibliometric analysis focusing on quantitative citation features to trace the development of ethics in Big Data.
Data source: Web of Science Core Collection (WoS). Search conducted July 6, 2020, for Topic: "Big Data" AND "Ethic*" across all categories, covering 1900–July 2020. Topic search fields include title, abstract, author keywords, and Keywords Plus.
Sample: 892 records; 75 countries; 2696 authors; 1257 institutions; 591 journals; 19 languages. Total global citations (GCS) = 8621; total local citations (LCS) within the constructed database = 798.
Approach: Three bibliometric methods were applied:
1) Descriptive analysis: Trends in publications over time; LCR/LCS vs GCS distinctions. Local Cited References (LCR)/Local Citation Score (LCS) measure contribution within the field-specific database, whereas GCS measures total citations in WoS. Emphasis placed on local metrics to assess intra-field development.
2) Network-citation analysis: HistCite (2005) used to generate historiographs showing citation relationships among most-cited works in the local database (optimal around top 40 items; figure presented for top 25). Nodes represent papers (size reflects local citations), with directed edges indicating citation links and vertical time ordering.
3) Co-occurrence term analysis: VOSviewer (Van Eck & Waltman) used to perform NLP-based extraction of noun phrases from titles, abstracts, and keywords. From 156,555 unique terms, a minimum occurrence threshold of 20 yielded 172 terms. VOSviewer’s relevance scoring selected the top 60%, resulting in 103 terms; non-germane terms (e.g., place names, general statistical/time/quantity terms) were removed. Mapping employed SMACOF-based stress minimization and clustering via a resolution parameter γ to form thematic clusters. Distances reflect co-occurrence strength.
Database choice rationale: WoS preferred over Scopus due to concerns about Scopus citation matching and duplicates; acknowledged WoS limitations include undercoverage in social sciences/humanities, limited books, and underrepresentation of non-English journals.
Outputs: Time-series of publications, author and paper rankings by LCS, historiograph of key works, and term co-occurrence cluster map identifying thematic hotspots.
Key Findings
- Corpus characteristics: 892 documents; 75 countries; 2696 authors; 1257 institutions; 591 journals; 19 languages; GCS 8621; LCS 798.
- Dispersion: On average ~1.5 papers per journal; moderate concentration in authorship (about 4.5 authors per journal, as reported).
- Temporal trends: Initial key contributions in 2011 (Helbing & Balietti). Foundational 2012 paper by Boyd & Crawford catalyzed the field. Publications increased slowly with peaks in 2018 (173 papers) and 2019 (196). By July 6, 2020, 107 papers had appeared.
- Author impact (LCS): Top authors include Crawford (181 LCS; 7 publications), Boyd (160; 2), Floridi (90; 5), Mittelstadt (89; 5), Vayena (25; 11). Despite productivity, some authors (e.g., Vayena) have lower average LCS per paper, suggesting early-stage field dynamics. A gender tally reported 429 local citations for female scholars versus 211 for male scholars among top-cited items.
- Most-cited works (LCS): 1) Boyd & Crawford (2012) on Big Data ethical questions in social media (LCS 160); 2) Mittelstadt & Floridi (2015) on biomedical Big Data ethics (LCS 71); 3) Mittelstadt et al. (2016) mapping ethics of algorithms (LCS 18); others include works on journalism/data science ethics, market research, patient experience commodification, and health analytics.
- Citation network: Historiograph shows Boyd & Crawford (2012) as the pivotal node influencing subsequent landmark works; Mittelstadt & Floridi (2015) form the second major hub, especially in biomedical contexts.
- Thematic structure (co-occurrence analysis): Three major clusters:
• Legal/governance cluster: governance, regulation, law, rights (including internet-sourced private data issues).
• Scientific/knowledge-sharing cluster: ethics in data sharing, access to knowledge, research outputs.
• Medical/AI cluster: ethics in medicine, healthcare, and artificial intelligence.
High-frequency terms include science, health, medicine, governance, artificial intelligence, and knowledge.
- Domain emphasis: Ethical discussions on Big Data are concentrated in biomedical and health domains; fewer contributions from economics, management, business, sociology, and political science.
Discussion
The bibliometric evidence indicates that while Big Data has attracted vast technical attention, ethical analysis remains comparatively underrepresented and grows slowly. The dominance of a few foundational works (especially Boyd & Crawford, 2012; Mittelstadt & Floridi, 2015) suggests that key conceptual frameworks in privacy, governance, and biomedical ethics structured the field’s evolution. The co-occurrence clusters corroborate that ethical debates are anchored in legal/governance issues, scientific data-sharing norms, and especially medical/AI applications, reinforcing the observation that biomedical contexts drive much of the ethical discourse. The gap in social sciences, economics, and politics highlights underexplored areas such as fairness in markets, policy impacts, and societal power dynamics of data-driven systems. Emerging societal challenges (e.g., COVID-19 contact tracing, misinformation) further elevate the urgency of integrated ethical scrutiny, implying that future research and policy should broaden beyond biomedical contexts while leveraging robust, transparent methodological and governance frameworks.
Conclusion
Ethics-focused research on Big Data is relatively scarce and concentrates heavily in health and biomedical fields. A small set of seminal works has shaped subsequent scholarship, with citation networks revealing Boyd & Crawford (2012) as foundational and Mittelstadt & Floridi (2015) as a key biomedical anchor. Thematic mapping identifies legal/governance, scientific data-sharing, and medical/AI as core clusters. The authors note limited representation from economics (12 papers), management (42), business (38), sociology (27), and political science (20), indicating important opportunities for expansion. The study contributes by quantifying publication patterns, mapping influential works, and delineating thematic hotspots at a moment of accelerated relevance (e.g., COVID-19). Future research should broaden disciplinary engagement, deepen analysis of socio-economic and political implications, and develop adaptable ethical frameworks suitable for rapidly evolving data practices.
Limitations
- Database coverage: Reliance on WoS, which underrepresents social sciences/humanities, non-English journals, and books/book chapters, may bias results.
- Citation-based measures: Citations vary across fields and time; LCS/GCS may not fully capture influence or quality, and high GCS with low LCS may misalign with intra-field impact.
- Search strategy constraints: Topic search limited to terms "Big Data" and "Ethic*" may omit relevant works using different terminology.
- Bibliometric scope: Focus on quantitative patterns cannot substitute for in-depth qualitative assessments of ethical arguments or practices.
Related Publications
Explore these studies to deepen your understanding of the subject.

