Linguistics and Languages
Sound symbolic associations: evidence from visual, tactile, and interpersonal iconic perception of Mandarin rimes
Y. Li and X. Jiang
The study investigates how Mandarin rimes map onto perceptual and social dimensions (visual, tactile, interpersonal) via sound symbolism, which posits non-arbitrary relations between phonemes and meanings. It addresses two core questions: (1) Which acoustic cues (including segmental and suprasegmental) are crucial for visual (shape, size, brightness, thickness), tactile (roughness, weight, temperature, hardness), and interpersonal (politeness, friendliness, authoritativeness, indifference) iconic perceptions? (2) What mechanisms underlie these associations, and how do low-level perceptual mappings interact with high-level interpersonal judgments (e.g., via mediation/suppression)? The work emphasizes underexplored features in Mandarin—formant transitions in diphthongs and nasal codas—and controls for listener sex effects in social evaluation. It aims to compare cue importance across modalities and test theoretical accounts such as language pattern and shared-property (e.g., Transitivity Proposal).
Prior research has shown robust sound symbolism effects such as bouba/kiki, often described with phonological features (front–back, height) but under-specifying the acoustic parameters that listeners perceive. Acoustic features like formants and pitch are fundamental to hearing and potentially more cross-linguistically general. Visual mappings: F1/F2 relate to perceived size and brightness; F2/F3 to shape; pitch to size. Tactile mappings are less consistent: some link round/curved-associated sounds to smoothness; others link central vowels or fricatives to roughness; duration and loudness can modulate tactile imagery (e.g., long vowels increasing imageability; high pitch associated with dryness/roughness/hardness/lightness; quieter sounds with smoothness/softness/lightness). Emotional and interpersonal dimensions also show sound-symbolic links: articulation can embody valence (e.g., /i/ and smiling), and higher-order emotional factors (valence, activity, potency) organize many associations. Social attitudes have been studied mainly via prosody rather than segmental phonemes, though phonemes can index intentions or personality traits. Mechanistically, cross-modal correspondences may arise due to statistical regularities in the environment, language patterns (co-occurrence in a linguistic community), or shared properties (perceptual/emotional mediators). The Emotion Mediation Hypothesis posits emotional factors as mediators; the Transitivity Proposal allows mediating dimensions (including low-level percepts) to produce A–C links via A–B and B–C associations. However, transitivity can appear inconsistent, potentially reflecting suppression effects where indirect effects oppose direct effects, hinting at interacting mechanisms.
Participants: 40 native Mandarin-speaking college students (21 female; mean age 21.15, SD 1.81). Dialect exposure varied, but all received extensive Mandarin education; mean self-rated proficiency: hearing 8.4, speaking 8.0, reading 8.3, writing 7.8 (10-point scale). Five had vocal training (mean 11.2 months); 15 had public speaking/hosting/broadcasting training (mean 3.67 months). Headphones were required. Ethics approval obtained and informed consent collected.
Stimuli and recording: Two native male Mandarin speakers (ages 38 and 25; different dialect backgrounds but native Mandarin) recorded 35 rimes (simple, compound, and nasal rimes) in level tone (tone 1) at 44.1 kHz in Praat, produced neutrally. Each rime was concatenated twice to form items (e.g., /a/-/a/), yielding 70 audio stimuli per questionnaire. Rime set included monophthongs, diphthongs, and nasal rimes with alveolar (/n/) or velar (/ŋ/) codas (triphthongs were later excluded from analyses). Intensity was scaled to ~70 dB.
Procedure: Three online questionnaires (QuestionPro) assessed: (1) visual dimensions (spiky–round, small–large, bright–dark, thin–thick) with reference images and labels; (2) tactile dimensions (smooth–rough, light–heavy, cold–hot, hard–soft) with images and labels; (3) interpersonal attitudes (polite–rude, friendly–hostile, encouraging–authoritative, passionate–indifferent) with validated textual scenarios/labels (pretest compatibility ratings all >4/5). Participants were instructed to focus on auditory features. Order: visual → tactile → interpersonal to reduce cognitive load.
Acoustic feature extraction: Annotated steady-state vowel portions. Extracted mean F0, F1–F4, and duration (Praat). For diphthongs, features were extracted separately for each component; computed formant transitions ΔFi (following minus preceding; AF1 removed later due to collinearity). For nasal rimes, formants were computed on the vocalic portion; nasality coded categorically (zero nasal, alveolar nasal, velar nasal).
Data analysis: Triphthong trials excluded. Analyses separated monophthong(-only and with nasal coda) and diphthong(-only and with nasal coda) sets. Linear mixed-effects models (LMEMs; lmerTest) predicted ratings with fixed effects: F0, F1–F4, duration, nasality; plus ΔF2–ΔF4 for diphthongs; listener sex as covariate. Random intercepts for participant and item. Predictors rescaled (F0/F1–F4 by 1000; duration in seconds). VIF checked; AF1 removed to reduce collinearity. Twelve LMEMs per set (visual, tactile, interpersonal), with Benjamini–Hochberg FDR correction across 252 p-values (FDR 0.05/0.01).
Machine learning (XGBoost): Ratings recoded to binary: {-2,-1}→0, {1,2}→1; 0 excluded. Features: same acoustic predictors (except sex). 5-fold CV repeated 10 times; class imbalance set chance baselines (50.8–66%). One-sample t-tests (BH-corrected) evaluated above-chance accuracy. Cross-dimension generalization tested by training on one dimension/aspect and testing on others (details in Supplement).
Mediation analysis: Tested interactions among dimensions using lavaan and mediation packages. Focused on shared predictor nasality and its relation to light–heavy (tactile) and interpersonal (polite–rude, friendly–hostile) for diphthongs. Nasality recoded numerically: zero=-1, alveolar=0, velar=1. Duration included for light–heavy per LMEMs. Compared two models for each pair (weight↔attitude): (a) nasality→weight (mediator)→attitude (with direct path), versus (b) nasality→attitude (mediator)→weight; model fit via AIC/BIC. Additional visual–tactile mediations reported in Supplement.
LMEMs (monophthongs):
- Visual mappings: spiky–round depended on F2 (β≈-0.944, p<0.001), with lower F2 (backer quality) perceived rounder. Size depended on F1 (β≈1.590, p<0.001) and duration (β≈3.156, p<0.001): lower vowels/longer duration perceived larger. Brightness depended on F1 (β≈-1.763, p<0.001) and F2 (β≈-0.601, p<0.001): front and low vowels perceived brighter.
- Tactile mappings: temperature (cold–hot) influenced by F2 (β≈-0.371, p<0.001): front vowels associated with coldness. Weight (light–heavy) influenced by nasality: velar nasal heavier than zero nasal (β≈0.595, p<0.001); alveolar vs velar contrasted with alveolar lighter than velar (t=-3.52, p<0.001).
- Interpersonal mappings: Listener sex affected polite–rude (β≈-0.252, p=0.003): female listeners rated monophthongs as politer. Other acoustic predictors largely non-significant.
LMEMs (diphthongs):
- Visual mappings: spiky–round influenced by initial F2 (β≈-0.867, p<0.001) and ΔF2 (β≈-0.929, p<0.001): lower initial F2 and smaller F2 transition associated with roundness (e.g., /ua/ rounder than /ia/; /au/ rounder than /ai/).
- Tactile mappings: smooth–rough affected by F2 (β≈0.439, p<0.001), ΔF2 (β≈0.395, p=0.002), and duration (β≈2.066, p=0.003): backer starts, smaller transitions, and shorter duration perceived smoother. Weight increased with velar nasals (β≈0.355, p<0.001) and longer duration (β≈2.509, p<0.001).
- Interpersonal mappings: Nasality predicted politeness and friendliness: compared to zero nasal, alveolar nasals perceived politer (β≈-0.709, p<0.001) and friendlier (β≈-0.447, p<0.001); velar nasals also politer (β≈-0.563, p=0.001). Alveolar vs velar not significantly different (t=-0.39, p=0.69).
Machine learning:
- Within-dimension: 23/24 models achieved above-chance accuracy (chance 50.8–66%); exception: smooth–rough for diphthongs.
- Cross-dimension: Visual models (monophthongs) predicted weight; some (size, brightness) predicted interpersonal (politeness and/or authoritativeness). Tactile models generalized best: light–heavy and cold–hot predicted 3 of 4 visual dimensions; all tactile dimensions predicted at least one interpersonal dimension, with smooth–rough and light–heavy predicting 3 of 4 interpersonal dimensions. Interpersonal-trained models predicted some visual and tactile dimensions, with stronger touch–attitude links (especially light–heavy ↔ interpersonal) than vision–attitude links.
Mediation (diphthongs):
- Model comparisons favored weight-as-mediator models: Weight–politeness (a) AIC=8077.27, BIC=8108.56 vs (b) AIC=8113.28, BIC=8144.57; Weight–friendliness (a) AIC=7893.11, BIC=7924.40 vs (b) AIC=7931.95, BIC=7963.24.
- Significant suppression effects: For weight–politeness (a), ACME=0.0274 (p<2e-16), ADE=-0.2139 (p<2e-16), Total=-0.1865 (p<2e-16), indicating nasality’s indirect path via heaviness increased perceived rudeness (positive ACME) while the direct path increased perceived politeness (negative ADE). For weight–friendliness (a), ACME=0.0292 (p<2e-16), ADE=-0.0891 (p=0.008), Total=-0.0598 (p=0.080), a similar suppression pattern. This supports interacting mechanisms: a shared-property (tactile weight) pathway counteracting a direct nasality–attitude mapping.
Findings address the core questions by isolating acoustic cues for multi-modal sound symbolism in Mandarin and revealing how low-level tactile perception interacts with high-level interpersonal judgments. Visual judgments relied primarily on basic vowel formants (F1, F2), while tactile judgments additionally depended on duration and nasality. Interpersonal judgments showed limited sensitivity to acoustic variation, with nasality and listener sex being the main factors. Diphthong-specific formant transitions (ΔF2) significantly contributed to roundness and smoothness, highlighting the importance of dynamic spectral cues in Mandarin.
Machine learning validated that a compact acoustic feature set captures most human judgments across modalities and revealed asymmetric generalization: tactile dimensions (especially weight) generalized best to interpersonal dimensions, whereas visual dimensions had more limited generalization to attitudes. Mediation analyses demonstrated suppression: the light–heavy pathway partially countered direct nasality-to-attitude effects. This supports the coexistence and interaction of mechanisms—shared-property/transitivity (low-level mediator) and language pattern/social inferencing (direct path). The results extend theoretical accounts by showing that cross-modal links can interfere as well as reinforce, refining the Transitivity Proposal for sound symbolism and emphasizing the role of dynamic formant cues and nasality in Mandarin rimes.
This study simultaneously compared visual, tactile, and interpersonal sound-symbolic associations for Mandarin rimes and identified underexplored acoustic determinants. Beyond F1/F2, diphthong formant transitions (ΔF2) and nasal codas robustly shaped iconic perception: ΔF2 predicted roundness and smoothness; nasals increased perceived heaviness and enhanced perceived politeness and friendliness. Machine learning models achieved above-chance classification within and across dimensions, underscoring the sufficiency of acoustic cues. Mediation analyses revealed that perceived weight acts as a suppressor between nasality and politeness/friendliness, providing evidence that shared-property mechanisms can interfere with direct mappings, suggesting coexistence and interaction among mechanisms (e.g., transitivity and language pattern accounts). Future work should manipulate prosody and individual acoustic cues, include more diverse speakers and stimuli, test ordering effects, incorporate intensity, and explore dialectal influences to further generalize and refine these findings.
- The suppression finding rests on model comparisons and should be interpreted cautiously; a follow-up study using similar ordering did not replicate the same suppression, warranting tests with reversed questionnaire order.
- Stimuli were emotionally neutral; emotional prosody could change mediation roles (e.g., affective factors as mediators/suppressors).
- Natural productions entail co-varying acoustic parameters; although LMEMs control for confounds, future work should use controlled manipulations (e.g., isolate F1) and analyze intensity (scaled out here) given its known effects.
- Only two male speakers recorded stimuli; broader speaker sets (sex-balanced) are needed to assess generalizability.
- Homogeneous rime-based materials may bias responses; adding consonantal fillers could increase ecological variability.
- Participants had varied dialect exposure; systematic comparisons across dialect groups and modeling dialectal acoustic distances could clarify dialect effects.
Related Publications
Explore these studies to deepen your understanding of the subject.

