Business
Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality
F. Dell'acqua, E. M. Iii, et al.
Discover how GPT-4 reshapes real-world consulting: in a pre-registered experiment with 758 consultants, AI access led to 12.2% more tasks completed, 25.1% faster completion, and over 40% higher quality on tasks within AI’s “jagged technological frontier,” while tasks outside that frontier fared worse. The research was conducted by Authors present in <Authors> tag.
~3 min • Beginner • English
Introduction
The paper investigates how AI—specifically GPT-4—affects knowledge worker performance within a "jagged technological frontier" where AI excels at some tasks but fails at others of seemingly similar difficulty. Using randomized controlled field experiments with highly skilled consultants, the authors examine when AI boosts productivity and quality and when it degrades performance. The context is the rapid rise of capable LLMs (e.g., ChatGPT) that overlap with professional tasks such as writing, analysis, and creativity, coupled with their opacity and unpredictable failure points. The purpose is to quantify AI’s impact on complex, realistic workflows and to understand how professionals can navigate this uneven frontier. The study is important for organizations and policymakers because LLMs can be adopted with minimal technical investment, potentially transforming high-end knowledge work while introducing risks where AI is outside its capability frontier.
Literature Review
The paper situates LLMs within broader AI and organization literature. Prior work on earlier AI generations suggested nonroutine, hard-to-codify tasks were shielded from automation (Autor et al., 2003; Acemoglu & Restrepo, 2019), and technology often impacted lower-skilled occupations (Goldin & Katz, 1998). The release of ChatGPT shifted the discussion: LLMs demonstrated unexpected proficiency in creative, analytical, and professional examinations (Kung et al., 2023; Girotra et al., 2023; Boussioux et al., 2023), suggesting higher exposure for creative, well-paid occupations (Eloundou et al., 2023; Felten et al., 2023). Early studies found productivity gains in writing and programming (Noy & Zhang, 2023; Peng et al., 2023) and idea generation (Girotra et al., 2023; Boussioux et al., 2023). Yet LLMs are opaque, can hallucinate, and have unclear failure points, raising issues of trust, validation, and appropriate collaboration (Lebovitz et al., 2021, 2022; Glikson & Woolley, 2020; Anthony et al., 2023; Raisch & Krakowski, 2021). The authors synthesize these strands to motivate examining AI’s uneven capabilities and their organizational implications via field experiments.
Methodology
The authors conducted two pre-registered randomized experiments (IRB-approved; OSF pre-registration) with 758 Boston Consulting Group individual-contributor consultants (~7% of global cohort). The study had three phases: (1) enrollment survey capturing demographics (gender, location, tenure, native language), psychological traits (Big Five, innovativeness, self-perceived creativity, paradox mindset), and tech openness; (2) main experimental phase with a pre-task survey, baseline assessment task (no AI) for within-subject skill measurement, and an experimental task with randomized AI conditions; (3) post-task survey and qualitative interviews. Participants were stratified by gender, location, tenure, innovativeness, and native English status across the two experiments and treatments. Incentives included office contribution recognition affecting bonuses, additional recognition for top 20%, and gifts for top 5%. Two separate experiments modeled realistic consulting workflows: Inside-the-frontier (creative product innovation) and Outside-the-frontier (business case analysis with quantitative data and interview insights). In both, participants first completed a similar assessment task (no AI) to establish baseline skills; then were assigned to one of three conditions: Control (no AI), GPT Only (GPT-4 via company platform with OpenAI API), or GPT + Overview (GPT-4 plus prompt engineering overview with videos/documents). The platform mirrored ChatGPT, logging all prompts and responses (GPT-4 version end of April 2023; default system prompts and temperature). Timing constraints: inside-frontier assessment 30 mins; experimental 90 mins; outside-frontier both tasks 60 mins, finishable early. For inside-frontier tasks, participants had to remain for full duration; timing analyses focused on time to reach the final question. The inside-frontier experimental task comprised 18 subtasks covering creativity, analysis, writing, and persuasion around designing and launching a niche-market shoe; quality was graded by two human graders (BCG/MBA) on each question and also by GPT-4, aggregated to composite Quality scores. The outside-frontier task was crafted so GPT-4 would produce a plausible but incorrect analysis unless carefully integrating interview insights with quantitative data; the primary outcome was binary correctness of strategic recommendation, alongside recommendation quality scores (1–10) by human graders using a rubric. Additional analyses included prompt logs, retainment (copy-paste similarity via Restricted Damerau-Levenshtein), and semantic diversity using Universal Sentence Encoder.
Key Findings
Inside the frontier (creative product development): - Quality gains: Human-graded composite quality increased by 42.5% for GPT + Overview (coef 1.746 over control mean 4.099) and 38% for GPT Only (1.556). GPT-graded quality increased 18.6% (1.349) and 16.8% (1.216), respectively (Table 1). - Completion: Task completion rose by ~12.2 percentage points (control mean 82.4%; GPT + Overview ~93%; GPT Only ~91%) (Table 2). - Speed: Time to reach the final question decreased by 22.5% (−1129s) for GPT + Overview and 27.6% (−1388s) for GPT Only versus control (Table 3). - Distributional effects: Bottom-half skill participants improved 43% vs. their baseline; top-half improved 17% (Figure 5). - Idea diversity: AI use increased average quality but reduced semantic variability across subjects; simulated “GPT-only” responses had the highest similarity, indicating homogenization risk (Figure 6; Appendix D). - Training effect: Prompt overview modestly outperformed GPT Only and increased retainment (copy-paste) behavior; higher retainment correlated with higher performance (Appendix C). Outside the frontier (business case analysis): - Correctness harms: AI reduced correctness relative to control by ~19 percentage points on average (control 84.4% vs. GPT + Overview ~60%, GPT Only ~70.6%) (Figure 7; Table 4: −24.5 pp and −13.9 pp, respectively). - Speed: Despite lower correctness, AI users were faster: GPT + Overview −689s (~30% faster), GPT Only −407s (~18% faster) vs. control (Table 5). - Recommendation quality: Even when wrong, AI users produced higher-quality recommendations (1–10 scale): GPT + Overview +25.1% (coef 1.475), GPT Only +17.9% (1.046); effects remained positive for both correct and incorrect subsets (Table 6; Figure 8). Human-AI practices: - Two successful navigation patterns emerged: “Centaurs” strategically split tasks between human and AI strengths; “Cyborgs” tightly integrate at a subtask level, iteratively validating, editing, and directing AI (Appendix E).
Discussion
Findings demonstrate a jagged AI capability frontier that creates large, uneven impacts on professional work. Within the frontier, GPT-4 substantially boosts quality, speed, and completion, particularly leveling up lower-performing professionals. Outside the frontier, reliance on AI without rigorous validation decreases accuracy, even as communication quality and speed improve—highlighting the need for expert judgment, active interrogation, and verification. The observed Centaur and Cyborg practices illuminate effective modes of human-AI collaboration: strategic delegation to AI strengths (e.g., writing) versus deep integration through iterative prompts, validation, persona assignment, and modularization. Organizationally, adoption decisions should shift from “AI or not” to workflow-level design—identifying which tasks are inside/outside the frontier and configuring human-AI roles accordingly. The reduction in idea diversity suggests potential homogenization risks; firms may need diverse AI models or human-only phases to preserve exploratory variability depending on their innovation goals. More broadly, responsible AI use and training are critical, especially for high-stakes tasks; as the frontier evolves, organizations must invest in navigation skills, validation protocols, and role reconfiguration to harness AI’s benefits while mitigating errors.
Conclusion
The paper contributes the concept of a jagged technological frontier and causal field evidence on AI’s heterogeneous effects in high-end knowledge work. GPT-4 can dramatically improve productivity and quality for tasks inside its capability frontier, with especially strong benefits for lower-performing professionals, while degrading correctness outside the frontier if used uncritically. The authors identify emergent human-AI collaboration patterns—Centaur and Cyborg—that may guide effective integration strategies. As LLM capabilities expand, impacts will grow but remain uneven; organizations should redesign workflows, institute validation practices, and build skills to navigate the frontier. Future research should deepen understanding of prompting quality versus retainment, measure mechanisms behind homogenization and creativity trade-offs, and rigorously evaluate how Centaur/Cyborg practices influence outcomes across domains and as models improve.
Limitations
- Generalizability: Participants were high human-capital consultants at a single firm (BCG), limiting external validity to other professions or organizational contexts. - Task design: The outside-frontier task was intentionally crafted so GPT-4 would err; results may depend on specific failure modes and may change as models improve. - Evolving technology: LLM capabilities and best-use practices are rapidly changing, making the frontier time-sensitive. - Measurement constraints: Inside-frontier tasks enforced fixed duration; timing improvements were measured as time to the final question rather than total task time. - Grading and rubrics: Human graders (BCG/MBA) and GPT-4 grading may introduce biases; GPT-4 was a more lenient grader. - Retainment inference: Copy-paste similarity (RDL) cannot fully disentangle high-quality iterative prompting from mere adoption; mechanisms behind improved performance with higher retainment remain unclear. - Idea diversity: Semantic similarity analysis suggests homogenization but does not isolate best-idea quality or individual heterogeneity in AI use. - Language and stratification: While stratified by English-native status and other factors, residual confounding may remain in collaborative behaviors and outcomes.
Related Publications
Explore these studies to deepen your understanding of the subject.

