Engineering and Technology

Creating Expressive Social Robots That Convey Symbolic and Spontaneous Communication

E. Fernández-rodicio, Á. Castro-gonzález, et al.

Robots are increasingly used in human-facing tasks; this paper models robot expressiveness as two independently generated dimensions—symbolic and spontaneous—combines predefined multimodal expressions with modulation strategies to convey mood and emotions, and validates that these strategies improve user perception and convey recognizable affective states. This research was conducted by Authors present in <Authors> tag.... show more

Introduction

The paper addresses how social robots can naturally communicate with humans by combining two components of communication proposed by Buck and VanLear: symbolic (task-oriented message content) and spontaneous (emotional/motivational display). The authors identify three problems: (1) designing and synchronizing multimodal symbolic expressions across a robot’s actuators; (2) continuously expressing the robot’s internal affective state (mood and emotions), not just through punctual gestures; and (3) fusing symbolic and spontaneous components so task-related content remains recognizable while the robot’s internal state is perceivable. The work proposes an approach that decouples symbolic and spontaneous dimensions, enabling developers to focus on task-related gestures while the system conveys affect via modulation. The study’s purpose is to evaluate, via subjective user questionnaires (warmth, competence, discomfort), the impact of adding spontaneous communication and fusing it with symbolic communication on users’ perception of a social robot. The contributions include: a method grounded in communication theory to separately generate symbolic and spontaneous dimensions; an expressiveness architecture using predefined multimodal gestures with dynamic adaptation and modulation strategies; and user experiments assessing perceptual effects.

Literature Review

Expressiveness generation is categorized into approaches using predefined gesture libraries and those generating gestures at runtime. Runtime generation often maps speech audio/text to body motions via neural architectures (e.g., bi-directional LSTMs, encoder–decoder, adversarial methods), sometimes conditioned on image schemas, prosody, sentiment, affect, or speaker style (Hasegawa, Kucherenko, Ginosar, Yoon, Ravenet, Spitale & Matarić, Zabala, Qi, Ahuja, Fares). Predefined-expression approaches model and synchronize multimodal behaviors and may adapt length/amplitude or select variants by mood/emotion (Meena, Xu, Glas, Ribeiro’s SERA, Groechel AR arms, Gomez). Multimodality typically includes speech and motions, plus gaze/facial expressions and LEDs; some add touch screens or breathing interfaces. Adaptability in generated methods arises from conditioning on prosody/content; predefined methods introduce modulation parameters (e.g., speed/amplitude) or emotion-specific variants. Several works modulate gestures by affect (mood/emotion), user personality, identity, or speaker style. Compared to prior art, the present work uniquely decouples symbolic and spontaneous dimensions within a multimodal framework, modulates predefined expressions via parameter-based and profile-based strategies, integrates paraphrasing and automatic non-verbal gesture prediction, and supports interface-specific modulation profiles for mood/emotions without requiring separate gesture variants for each affect.

Methodology

Platform: Mini, a tabletop social robot with 5 DoF (neck: 2; shoulders: 2; base: 1), colored LED heart, OLED eyes, and a touch screen. Software architecture includes input/output modules, context memory, applications, a Decision-Making System, and the Expression Manager. Expression Manager: Orchestrates expressiveness via (1) interface conflict tracking; (2) priority queues (high/medium/low) to schedule/cancel expressions; (3) modulation of expressiveness to display internal affective state; (4) translation of expression actions into actuator commands. It consists of: Expression Scheduler (plans execution, resolves conflicts, handles activation/cancellation), Expression Executor (loads and configures gestures), and Interface Players (per actuator: five joints, eyes, LEDs in heart/cheeks, voice/ETTS, touch screen), implemented with ROS ActionLib for enable/disable, feedback, and cancellation. Gesture modeling: Gestures are state machines using a modified FlexBE enabling parallel execution, streamlined GUI defaults, and a common gesture template. Expressions can also be generated dynamically from action lists at runtime. Symbolic adaptation techniques: (1) Dynamic reconfiguration—applications can replace specific actions’ parameters within a gesture (e.g., utterance content, final motor positions) to adapt to context (time of day, user identity) without adding/removing actions; standard action naming eases reuse. (2) Paraphrasing—two modules: a multilingual pipeline translating Spanish↔English and paraphrasing via transformer models (PMO-T5, Parrot, PEGASUS, GPT-3) with Google/DeepL/Argos translation, and a user-adaptive zero-shot GPT-4 prompt-based paraphraser tailored to user profile (name, age). (3) Automatic gesture prediction—token classification over the speech transcript using fine-tuned BERT/DistilBERT/RoBERTa to label token segments with gesture types (21 classes), then select suitable predefined gestures whose durations fit speech segments. Spontaneous (affect) expression: Affect generation blends moods and emotions in separate valence–arousal spaces (−100 to 100). Emotions (joy, sadness, anger, surprise) have continuous intensities (0–100), short-lived and high intensity; moods are discrete (happy, anxious, bored, relaxed), longer-term and lower intensity. The Expression Manager fuses affect into expressiveness with two modulation strategies: (a) Parameter-based modulation—global speed and amplitude adjustments (seven levels: big/medium/small decrease; normal; small/medium/big increase), applied across interfaces (voice prosody rate; joint speed; LED blinking speed; eye blinking speed; amplitude affects voice volume/pitch and LED brightness); values computed by Vf = Vi + s*(Vl − Vi), with s ∈ {0.33, 0.66, 1} and experimentally defined limits for naturalness. (b) Profile-based modulation—handcrafted interface-specific profiles mapping each internal state (four emotions, four moods, neutral) to parameter settings (e.g., pitch, prosody rate, gaze direction, blink frequency, posture, LED color/intensity/heart rate), loaded by Players and applied when expressions lack explicit parameter values. Integration of DL modules: Paraphrasing and gesture prediction can run locally or on an external server (Intel i9-10900K; 2× RTX 3090; 64GB RAM) via socket connections. Scheduler ensures paraphrasing precedes gesture prediction to maintain timing coherence. Evaluation: (1) Performance—on Mini (Intel i5-3550, 4 cores @3.3 GHz, 16 GB RAM, Ubuntu 16.04), response time measured for unimodal and a complex multimodal gesture (20 actions) across interfaces; resource usage measured when all interfaces act simultaneously; 10 repetitions averaged; compared against human interaction thresholds (0.25 s, 0.38 s, 1 s). (2) User Study 1—video-based quiz game interaction with two conditions: Neutral (no affect, emotional expressions replaced by neutral, constant speech prosody/pitch/volume) vs Expressive (profile-based affect conveying moods/emotions; emotional speech; moods change from neutral to happy after first correct answer; emotions decay across explanations); participants complete demographics and RoSAS (warmth, competence, discomfort) plus mood perception questions; randomized assignment. Pre-evaluation validated affect recognition with separate videos for moods, emotions, and punctual emotional expressions. (3) User Study 2—video-based landmark guessing game; Neutral condition repeats identical feedback gestures and transition utterances with normal speed/amplitude; Expressive condition adapts feedback via parameter-based modulation: congratulate gestures increase speed/amplitude with successive correct answers; regret gestures decrease them with wrong answers; transitions adapted to game context; RoSAS administered with randomized assignment. Statistical analysis used Shapiro–Wilk for normality, transformations as needed, independent samples t-tests with Levene’s tests, and ANCOVA with covariates (familiarity with technology/robotics, willingness to interact/own a robot).

Key Findings

System performance: RAM usage ~2.2–3.3% and ~42.2% of one CPU core during worst-case multimodal execution; unimodal expressions met reaction time thresholds (≤0.25–0.38 s); complex multimodal gesture averaged ~0.82 s, meeting the 1 s naturalness threshold. Affect recognition (pre-evaluation, n=55): Moods—neutral correctly identified by 44%; happy by 49%. Emotions—joy 51%; sadness 71%; anger 31% (40% selected 'none'); surprise 24% (45% 'none'). Punctual emotional expressions recognition higher: anger 92%; joy 94%; sadness 69%; surprise 94%. User Study 1 (n=83; neutral vs expressive affect): No significant differences between conditions for RoSAS warmth, competence, or sqrt(discomfort); competence ratings did not differ, confirming H1.4; H1.1–H1.3 not supported. User Study 2 (n=69; neutral vs expressive contextual modulation): Significant increase in warmth under the expressive condition compared to neutral (t(65) = −2.173, p = 0.033), supporting H2.1 and H2.2. No overall significant differences for competence and discomfort (H2.4 supported; H2.3 not supported). ANCOVA indicated marginal differences for competence when controlling for familiarity with robotics (F = 3.845, p = 0.067, η² = 0.051). Subgroup analyses showed significant competence differences for participants with mid-high/high willingness to own a robot (t(29) = −3.122, p = 0.004) and marginal differences for those with mid-high/high familiarity with technology. Overall, modulation strategies can convey recognizable affect and, when adapted to interaction context (parameter-based plus dynamic reconfiguration), significantly improve perceived warmth.

Discussion

The findings indicate that the proposed modulation and expressiveness framework can make a robot’s internal state perceivable and improve user perception, particularly warmth, when expressiveness is adapted to interaction context. The pre-evaluation confirmed recognition of moods and emotions above chance, with punctual emotional expressions particularly salient. In the first study, the absence of significant differences may stem from Mini’s inherently warm baseline design, short scenario duration with rapid affect changes, and possibly high-intensity affect swings tuned for video clarity rather than natural progression. In contrast, the second study demonstrated that contextual adaptation of symbolic and spontaneous components—via speed/amplitude modulation tied to performance and dynamic reconfiguration of utterances—positively influences perceived warmth without altering competence or discomfort. Subgroup effects suggest that participants more interested in robots or familiar with technology are more sensitive to expressiveness changes in competence ratings. These results support the decoupled architecture: developers can focus on symbolic content while the system handles spontaneous affect through modulation, achieving a better balance between task clarity and social expressiveness. The importance of context-aware adaptation emerges as a key factor in enhancing social perception.

Conclusion

The paper presents a decoupled, multimodal expressiveness architecture for social robots that separately models symbolic communication (task-related messages) and spontaneous communication (affect display). Gestures are implemented as modified FlexBE state machines, scheduled with priority and interface conflict management, and executed via interface-specific Players. To overcome repetitiveness and scalability limits of predefined gestures, the system integrates three adaptation techniques: dynamic reconfiguration of gesture actions, parameter-based global speed/amplitude modulation, and profile-based interface-specific affect modulation. It further incorporates two paraphrasing modules (variability and user-adaptive) and an automatic gesture prediction tool to select non-verbal accompaniments for speech. Evaluations show acceptable performance and demonstrate that affect can be recognized, and that contextual modulation significantly increases perceived warmth. Future work includes long-term, in-person studies; richer checks for compatibility among simultaneous expressions; extending modulation to more internal states and automating parameter adaptation; and simplifying gesture authoring while preserving developer control.

Limitations

The evaluations are video-based, which may underrepresent nuances of real interactions compared to in-person studies. Expressiveness generation is highly platform-specific, complicating objective comparisons across systems. CPU usage indicates room for performance optimization under heavy multimodal loads. Parameter-based modulation currently requires manual control from applications; automating adaptation to internal and contextual factors would improve usability. Profile-based modulation is presently tailored to affect (mood/emotion) and handcrafted profiles may not scale well if the number of internal states grows. The fusion method assumes coherence between task-related content and internal state, which may need broader architectural support.

Related Publications

Explore these studies to deepen your understanding of the subject.

Business

Effect of digital literacy on social entrepreneurial intentions and nascent behaviours among students and practitioners in mass communication

C. Y. Ip

Engineering and Technology

Milli-scale cellular robots that can reconfigure morphologies and behaviors simultaneously

X. Yang, R. Tan, et al.

Health and Fitness

COVID-19 vaccine communication and advocacy strategy: a social marketing campaign for increasing COVID-19 vaccine uptake in South Korea

S. Hong

Psychology

Psychopathic and autistic traits differentially influence the neural mechanisms of social cognition from communication signals

C. L. Skjegstad, C. Trevor, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny