Computer Science
Commonsense-Aware Prompting for Controllable Empathetic Dialogue Generation
Y. Liu and H. Kilicoglu
The paper addresses the problem that pre-trained language models (PLMs) for dialogue (e.g., T5, GPT-2/BART) often struggle to comprehend commonsense knowledge and emotions in conversational contexts, limiting empathetic response quality. Since human empathy relies on implicit commonsense reasoning, the authors investigate how to incorporate social commonsense into PLMs for empathetic dialogue generation and how to provide controllability during generation. They propose a framework that augments dialogue history with verbalized commonsense in prompts and applies a strategy-driven control mechanism (future discriminator) to steer responses. The work aims to improve emotional awareness and controllability in empathetic dialogue systems, which are valuable for domains such as healthcare and mental health support.
Prior empathetic dialogue generation often used non-PLM encoder–decoder models, jointly modeling cognitive (commonsense) and affective (emotion) aspects of empathy (e.g., Sabour et al., 2021; Li et al., 2022 with graph-based commonsense and VAD). These approaches, trained from scratch, may have limited generalization versus large PLMs. Prompt-based methods offer a way to inject external knowledge directly into PLMs via input augmentation. In controllable dialogue generation, prior works have guided responses using dialogue acts (Xu et al., 2018), topics (Wang et al., 2020), and external knowledge (Wu et al., 2021) to reduce hallucinations. Plug-and-play control methods such as PPLM (Dathathri et al., 2020) and Future Discriminators (FUDGE; Yang and Klein, 2021) enable attribute-conditioned generation without retraining large LMs, but have been little explored in dialogue. This study builds on these directions by combining commonsense-aware prompting (via COMET-ATOMIC 2020) and strategy-controlled generation (via FUDGE) for empathetic dialogue.
Task: Given dialogue history C = [X1, ..., XT−1], generate the next empathetic response Y = XT. The approach comprises three components: (1) commonsense prompting, (2) dialogue strategy prediction, and (3) FUDGE-controlled response generation.
Commonsense prompting: The authors use COMET-BART trained on COMET-ATOMIC 2020 to infer social commonsense entailments for each utterance in the dialogue history using 10 social relation types (e.g., xEffect, xReact, xIntent, xNeed, xReason, xWant, oEffect, oReact, oWant, xAttr). Each predicted relation-entailment tuple (ri,j, ei,j) is verbalized into natural language using templates (e.g., ([xReact], depressed) → “As a result, PersonX feels depressed.”) following Hosseini et al. (2022). These verbalized sentences are concatenated to the dialogue history as additional prompt context for the generation model.
Dialogue strategy predictor: The system conditions generation on a dialogue strategy (e.g., providing suggestions, self-disclosure, affirmation and reassurance). Two methods are explored: (a) a separate text classification model trained end-to-end on dialogue histories to predict the next strategy; (b) a joint model that shares the encoder of the generation LM (BART/COMET-BART) and adds a classification head, trained with a combined loss of generation (LM) and strategy prediction, weighting the strategy loss by a factor α. The encoder [CLS] representation serves as the context vector for strategy classification in the joint setup.
FUDGE future discriminator: To enforce strategy control during decoding, a future discriminator is trained to predict the strategy label from any prefix of a target utterance. Concretely, for target sequence X = [x1, …, xn], an LSTM classifier is trained to predict the ground-truth strategy from each prefix [x1, …, xt], optimizing the sum of log-likelihoods over t=1..n. At inference, the discriminator scores partial generations and reweights token probabilities from the LM to bias decoding toward the desired strategy category.
Experimental setup: Dataset: ESConv (1,053 multi-turn dialogues; 31,410 utterances) containing help-seeker/provider conversations with annotated causes and dialogue strategies. Models: BART-large (baseline) and COMET-BART (pretrained for commonsense reasoning), both fine-tuned for response generation with or without commonsense prompts. Control: FUDGE is used with oracle strategies and with predicted strategies (either separate classifier or joint classifier). Evaluation metrics: BLEU (B-2, B-4), ROUGE-L, and BERTScore.
- Commonsense prompting improves generation: Augmenting dialogue history with verbalized COMET-ATOMIC 2020 social commonsense improves automatic metrics for both BART and COMET-BART. COMET-BART outperforms vanilla BART with or without explicit commonsense prompts, suggesting benefits from its commonsense pretraining and better comprehension of prompted knowledge.
- FUDGE control improves performance when given correct strategies: With oracle strategies, FUDGE-controlled decoding improves metrics over plain LM decoding. Reported oracle results (Table 4):
- BART + oracle: B-2 5.02, B-4 1.32, ROUGE-L 18.34, BERTScore 90.59
- COMET-BART + oracle: B-2 3.83, B-4 0.75, ROUGE-L 12.04, BERTScore 90.34
- BART + oracle + FUDGE: B-2 7.46, B-4 2.01, ROUGE-L 21.18, BERTScore 91.11
- COMET-BART + oracle + FUDGE: B-2 5.76, B-4 1.78, ROUGE-L 20.82, BERTScore 90.66 These gains indicate effective enforcement of strategy-conditioned responses.
- Strategy prediction is a bottleneck: Using a separate strategy classifier yielded poorer strategy accuracy and degraded generation performance compared to oracle control, indicating that end-to-end controllability hinges on reliable strategy prediction.
- Joint training can help vanilla BART: Finetuning BART with a combined generation+strategy loss improved generation (without FUDGE) despite decreased strategy accuracy, suggesting better adaptability for producing strategy-specific responses. This effect was not observed with COMET-BART.
The results support the central hypothesis that integrating social commonsense into PLMs and enforcing strategy-level control improve empathetic response generation. Commonsense prompts help models infer and reflect users’ emotions and likely intents, enhancing empathy-aligned content. FUDGE effectively steers decoding toward targeted strategies, improving alignment and fluency when provided with correct strategy labels. However, practical deployment requires accurate strategy prediction; otherwise, control may misguide generation. The joint training observation for BART indicates a trade-off: weaker classifier accuracy but better generation quality, implying that sharing representations and jointly optimizing can make the generator more receptive to strategy cues even if explicit classification is imperfect. COMET-BART’s consistent advantage suggests that pretraining on commonsense reasoning equips models to better utilize knowledge prompts. Overall, the study clarifies that knowledge prompting and plug-and-play control are complementary: prompts inform content, while FUDGE enforces stylistic/strategic conformity.
The paper presents a framework for controllable empathetic dialogue generation that combines commonsense-aware prompting (via verbalized COMET-ATOMIC 2020 knowledge) with strategy-conditioned decoding using a future discriminator (FUDGE). Experiments on ESConv show that: (1) commonsense prompting improves response quality; (2) FUDGE enhances controllability and performance when strategy labels are accurate; and (3) joint training of generation and strategy prediction can benefit vanilla BART. The approach is plug-and-play and applicable to off-the-shelf generative LMs, with or without finetuning. Future work includes improving dialogue strategy prediction, exploring richer prompting methods beyond template verbalization, and further evaluating human-perceived empathy and support quality.
- Strategy prediction accuracy is limited when using a separate classifier, which degrades end-to-end controlled generation performance; the method performs best with oracle strategies.
- Results rely on a single dataset (ESConv); generalizability to other domains and dialogue settings is not established.
- Evaluation uses automatic metrics (BLEU, ROUGE-L, BERTScore); human evaluations of empathy and appropriateness are not reported.
- FUDGE introduces additional training and inference components (a future discriminator) and its effectiveness depends on the accuracy of the targeted attribute (strategy) and classifier calibration.
- Commonsense prompting uses fixed templates and a selected set of 10 social relations; other relations, prompting strategies, or template quality may affect performance.
Related Publications
Explore these studies to deepen your understanding of the subject.

