logo
ResearchBunny Logo
Exploring The Design of Prompts For Applying GPT-3 based Chatbots: A Mental Wellbeing Case Study on Mechanical Turk

Computer Science

Exploring The Design of Prompts For Applying GPT-3 based Chatbots: A Mental Wellbeing Case Study on Mechanical Turk

H. Kumar, I. Musabirov, et al.

This innovative study explores the application of GPT-3 in crafting chatbots aimed at enhancing mood and mental well-being. Conducted by a team from the University of Toronto, it presents intriguing insights on prompt engineering, grounded in rigorous experimentation and user feedback.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper investigates how to design and evaluate prompts for applying GPT-3 to build chatbots for a specific task: brief (5-minute) conversations aimed at helping users manage mood and mental wellbeing. It highlights challenges due to the complexity of large language models and open-ended user inputs, questioning how early feasibility tests generalize at scale and how non-NLP experts can optimize prompts. The study focuses on a case study using a randomized factorial experiment to systematically vary chatbot identity, intent, and behavior, and to assess effects on conversation dynamics and user perceptions (risk, trust, expertise, willingness to re-engage). The purpose is to provide a methodology and empirical insights for HCI designers to leverage GPT-3 in sensitive domains, while acknowledging safety and ethical considerations.
Literature Review
The background reviews GPT-3 and prompt programming, noting that zero-shot prompts can outperform few-shot prompts and that safety, trust, and efficacy are concerns in sensitive contexts like eHealth. Prior work has explored AI-generated content (e.g., emails) and prompt engineering for text-to-image models. In mental health, chatbots (often rule-based) have been used for CBT and other supportive interventions (e.g., Woebot), encouraging self-disclosure and self-compassion, aiding digital counseling, and emotion regulation. Best practices for mental health chatbot design emphasize empathy, active listening, trustworthiness, and clear boundaries. This work situates GPT-3-based chatbots within that landscape, exploring whether LLM-driven, prompt-conditioned agents can exhibit desirable qualities while managing risks.
Methodology
Design: A randomized 2×3×3 factorial experiment manipulating three prompt dimensions for a GPT-3 chatbot: identity (Friend, Coach), intent (1: supportive, open-ended reflection and coping skills; 2: CBT-oriented identification/reframing of negative thoughts and development of coping strategies; 3: problem-solving: define problems, generate solutions, select best, and plan), and behavior (1: strong interpersonal skills; 2: trustworthy, active listener, empathetic, supportive attention without reciprocity; 3: optimistic, flexible, empathetic). This yielded 18 arms. Prompt structure: "The following conversation is with IDENTITY. The IDENTITY shows intent INTENT. The IDENTITY has behavior BEHAVIOR." User messages were appended to the running prompt and sent to GPT-3; model outputs formed the bot replies. Participants and setting: 945 valid MTurk participants engaged in a 5-minute chatbot interaction within a web-based interface that clearly disclosed the AI nature of the agent and its purpose. The interface began with an introductory AI disclosure and an open-ended greeting to encourage substantive sharing. Survey flow: (1) Pre-chat user information (energy, mood, therapeutic history, propensity toward technology), (2) transparency check about understanding the AI model, (3) 5-minute chatbot interaction, (4) post-chat evaluation measures (perceived risk, trust, expertise, willingness to interact again), (5) demographics (age, gender). The survey was implemented in Qualtrics. Data quality: Reverse-formulated items served as attention checks; 114 participants (12.1%) with dramatically similar profiles (e.g., uniform responses including opposites) were excluded. Neutral responders were retained. Analyses: Quantitative analyses summarized means and standard errors for outcomes. Nonparametric tests assessed group differences: Wilcoxon rank-sum for prior mental health help vs none and Kruskal-Wallis for propensity-to-technology groups. Effects of experimental factors were examined; no reliable differences across arms were detected. Qualitative thematic analysis (two coders using NVivo) assessed responses to comfort interacting with the chatbot and examined conversation logs to characterize conversational dynamics across prompt modifiers. A co-author with psychology training reviewed all transcripts for safety signals; none indicating serious risk were found (verified by a second co-author).
Key Findings
- Overall ratings (N=831 after exclusions): - Perceived Risk: 3.51 (SEM 0.04) - Trust: 2.98 (0.02) - Expertise: 4.07 (0.02) - Willingness to Interact Again: 3.44 (0.05) - Experimental factors (identity, intent, behavior): No statistically reliable differences detected across arms in primary outcomes. - Prior mental health help (Wilcoxon rank-sum, p<0.001 where noted): - Risk: 3.64 (0.04) with history vs 3.19 (0.08) no history (p<0.001) - Trust: 2.93 (0.02) with history vs 3.10 (0.04) no history (p<0.001) - Expertise: 4.05 (0.02) with history vs 4.10 (0.05) no history (p=0.069) - Willingness to interact again: 3.65 (0.05) with history vs 2.88 (0.11) no history (p<0.001) - Propensity to technology (Kruskal-Wallis, p<0.001 across measures): - Risk: low 2.62 (0.12), moderate 3.07 (0.07), high 3.74 (0.04) - Trust: low 3.16 (0.08), moderate 3.07 (0.04), high 2.93 (0.02) - Expertise: low 3.43 (0.15), moderate 3.75 (0.05), high 4.23 (0.02) - Willingness: low 2.98 (0.21), moderate 3.22 (0.09), high 3.55 (0.06) - Qualitative comfort and themes: ~70% positive interactions; ~60% comfortable or very comfortable. Negative themes centered on data privacy, storage, monitoring, repetition, and perceived lack of genuineness. - Conversation dynamics across prompt modifiers: - Identity: Friend elicited on average 6–7 more user words than Coach. - Intent: The supportive reflection intent led to shorter user responses than CBT or Problem Solving intents. - Common helpful dynamics: restating concerns, breaking down problems, eliciting elaboration (CBT), conveying that users are heard, maintaining context, offering rationales, proposing short- and long-term strategies, and acknowledging limitations/referrals. - Problematic pattern: occasional repetitiveness/inflexibility in problem-solving intent leading to user frustration.
Discussion
Findings suggest that while users generally rate GPT-3 chat interactions as moderately trustworthy, somewhat risky, and highly expert, willingness to re-engage is moderate. Prior experience with professional mental health support is associated with higher perceived risk, slightly lower trust, yet higher willingness to interact again, indicating nuanced expectations and openness among experienced users. Propensity to adopt technology correlates with higher perceived risk and expertise, lower trust, and higher willingness. The factorial manipulation of prompt identity, intent, and behavior did not yield strong quantitative differences across arms, indicating that these prompt-level factors, as instantiated here, may have limited impact on the global perceptions measured, or that effects are subtle. Nevertheless, qualitative analyses revealed meaningful conversational patterns associated with intents (e.g., CBT vs problem-solving) and minor engagement differences (Friend identity eliciting longer user responses). These insights can inform prompt design, emphasizing approaches that foster elaboration, contextual responsiveness, rationales for suggestions, and acknowledgment of limits. Safety considerations remain paramount: although no serious risk was detected in 900+ transcripts within a brief, disclosed, MTurk context, the authors caution against generalizing safety to other contexts or longer interactions. Real-time monitoring and automated risk detection are recommended. Overall, the work demonstrates how randomized factorial designs and mixed-methods analyses can inform prompt engineering for LLM chatbots in sensitive applications.
Conclusion
This paper provides a case study and methodology for designing and evaluating GPT-3-based chatbots for mental well-being conversations. Contributions include: (1) a proof-of-concept deployment enabling 5-minute mood support chats, (2) qualitative insights from 945 conversational sessions, and (3) a randomized factorial experiment exploring prompt dimensions (identity, intent, behavior) with associated quantitative and qualitative evaluations. While prompt modifiers showed limited quantitative effects on perceptions, qualitative findings highlight conversational strategies that appear helpful and pitfalls such as repetition and inflexibility. Future research directions include: expanding prompt dimensions and combinations, longer and more naturalistic deployments, integration of real-time monitoring and automated risk classifiers, chained LLMs for safety and coherence, and leveraging the collected conversation corpus for systematic evaluation across varied prompt designs.
Limitations
- Context and duration: Interactions were brief (5 minutes) and conducted on MTurk with explicit AI disclosure; findings may not generalize to clinical or high-risk settings or longer conversations. - Safety generalizability: Absence of observed serious risk in this corpus does not guarantee safety at larger scales or different contexts; longer interactions may surface issues. - Experimental sensitivity: No significant differences across prompt arms may reflect limited effect sizes, measurement sensitivity, or statistical power for interaction effects. - Data quality and sampling: MTurk sample characteristics and exclusion of 12.1% based on attention checks may influence generalizability; self-report measures are subject to bias. - Scope of prompts: Only two identities, three intents, and three behaviors were tested; other formulations or richer instruction sets may yield different outcomes.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny