Medicine and Health
Personalized quantification of facial normality: a machine learning approach
O. Boyaci, E. Serpedin, et al.
The study addresses the problem of defining and quantifying individualized facial normality for patients with congenital or acquired facial deformities. Traditional assessments (expert ratings, anthropometry, stereophotogrammetry, crowdsourcing, PROMs, eye-tracking) are not benchmarked to a patient’s own theoretical norm and are difficult to apply clinically. Population averages are ill-suited due to individual variability (gender, age, race) and admixture. The authors aim to (1) place an individual’s facial appearance numerically along their own continuum of normality and (2) objectively quantify changes effected by reconstructive interventions. They propose a machine learning framework that generates a patient-specific normalized analogue of a given face and measures distances between the raw and normalized images to predict human judgments of normality.
Prior work has explored determinants of facial beauty (proportion, symmetry, averageness), emotion communication and perception, personality inference, morphing, regions of attraction, machine recognition, and population norms. Existing clinical assessment methods (expert ratings, landmark measurements, stereophotogrammetry, crowd surveys, PROMs, eye-tracking) do not provide a personalized norm reference. Previous uses of GANs in ophthalmology include unsupervised anomaly detection in OCT and synthesis of postoperative appearance in thyroid-associated ophthalmopathy; however, these did not perform realistic facial normalization or quantify pre–post differences, suffered from limited data, and low realism. Conventional "face normalization" in recognition focuses on alignment, illumination, and pose, not on removing deformity. There is no prior system that realistically normalizes faces by eliminating deformity and measures variance between raw and normalized images for objective assessment.
Overview: Four-step pipeline: (1) image preprocessing, (2) image normalization, (3) feature extraction in image and latent spaces, and (4) prediction of human normality scores using regression.
Data and human ratings: Two datasets were used. Private training data: 150 images (50 raw deformity, 50 normalized outputs, 50 random StyleGAN faces) rated on a 1–7 Likert scale (1 most deformed, 7 most normal) by 80 raters (mean age 33.96, 60% male). Public test data: 60 images (30 raw, 30 random) rated by 50 raters (mean age 27.5, 46% male). For analyses of pre–post change sensitivity, 25 pre/post pairs were also examined.
Preprocessing: Faces detected with dlib, facial landmarks (68-point) extracted. Images were centered, cropped, aligned using vectors eye-to-eye and eye-to-mouth, and masked by filling the convex polygon of outer facial landmarks (OpenCV). The mask border was enlarged by 50 pixels. Image resolution was set to 1024×1024×3 to match StyleGAN output.
Normalization: Defined as a two-objective optimization in StyleGAN latent space W to generate a normalized image x_norm that removes anomalies while preserving identity. Objective: minimize λ_sim L_S(x_raw, G(w)) + λ_avg L_A(w, w_avg), where L_S (similarity) preserves unique facial features, and L_A (averageness) pulls w toward the population average w_avg. The generator G is StyleGAN; w ∈ R^{1×512} is tiled across 18 layers for generation. Tested similarity losses L_S: pixel (L_pix), structural (L_str), perceptual (L_perc; LPIPS). Tested averageness losses L_A: mean absolute error (L_mae), mean squared error (L_mse), mean exponential error (L_mee). Empirically, L_perc combined with L_mae produced the most realistic normalized images, preserving gender, age, and pose. Hyperparameters: λ_sim=10, λ_avg=100; initialization w=w_avg; iterations n=500 (no improvement beyond this); w components constrained to [−1,1], with out-of-range values reset uniformly within the range to stay on the normal manifold. The best w* over iterations is returned; x_norm=G(w*). Iterative progression demonstrates convergence from setting coarse attributes (skin tone, pose) to aligning age/gender and finally refining distinctive details.
Latent representation of raw face w_raw: To compute distances in latent space, w_raw was obtained by repeating the optimization with λ_avg=0 (using only L_S) for each raw image.
Feature extraction: In image space X, distances between x_raw and x_norm were computed using: LPIPS (d_xperc), multiscale structural similarity derived dissimilarity (d_xstr= (1−MSSIM)/2), PSNR (d_xpsnr), and pixel-wise metrics (MAE, MSE, RMSE, log-cosh). In latent space W, distances between w_raw and w_norm were computed using Bray–Curtis (d_wbc), Canberra, Chebyshev, Manhattan, correlation (d_wcorr), cosine, Euclidean, Mahalanobis, and Minkowski.
Feature selection: Correlations among 16 features and with human ratings were analyzed. Extremely randomized trees (ERT; 1000 trees) estimated feature importance; a 5% threshold selected candidates. Selected: d_xperc, d_xstr, d_xpsnr, d_wbc, d_wcorr, plus (initially) d_wcan and d_wcos. Due to high collinearity (d_wcan with d_wbc, r=0.99; d_wcos with d_wcorr, r=0.98), d_wcan and d_wcos were removed, yielding 5 final features: d_xperc, d_xstr, d_xpsnr, d_wbc, d_wcorr.
Regression and optimization: Tested regressors: Linear, Huber, Support Vector Regressor (SVR), Ridge, Lasso, and Multi-Layer Perceptron (MLP). Hyperparameters optimized with TPE (hyperopt) and 5-fold cross-validation on training data; validation on held-out 20%. Final MLP: one hidden layer with 6 neurons, ELU activation, MSE loss, batch size 20, 300 epochs. Additional stabilization: generate K candidate x_norm via K-nearest neighbors in LPIPS space of normalization outputs; compute features for each and average across K. Empirically, K=5 maximized performance with diminishing returns beyond.
Implementation: Python 3.6; scikit-learn, SciPy, Keras, TensorFlow; run on Intel i9-8950HK CPU 2.90GHz with NVIDIA GeForce RTX 2070 GPU.
- The proposed system accurately predicts human ratings of facial normality. On the public test set, the MLP regressor achieved Pearson correlation R ≈ 0.90 and mean absolute error (MAE) ≈ 0.57 on a 1–7 scale, outperforming Linear, Huber, SVR, Ridge, and Lasso models with similar R (0.87–0.90) but higher MAE.
- Averaging features across multiple normalized outputs improved performance up to K=5, after which gains diminished.
- Human rating variability (MAD) was lower at score extremes and higher in the mid-range: for training data, MAD 0.80 when mean score <3, 1.19 when 3<mean<5, and 0.68 when mean>5; similar pattern in test data (0.89, 1.25, 0.52, respectively). Machine errors (MAE) were more evenly distributed across the range on test data (~0.53–0.56).
- The normalization procedure visually corrected anomalies while preserving identity across diverse diagnoses (e.g., cleft lip, facial palsy, Treacher Collins, craniofacial dysostosis, neurofibromatosis, vascular anomalies, nasal deformity, jaw asymmetry).
- Compared to a prior anchor-based LPIPS approach, the personalized normalization approach showed much higher sensitivity to surgical changes across 25 pre–post pairs: mean detected difference 48.87% (vs 2.69%) and postoperative images labeled as improved 96% of the time (vs 76%).
By generating a personalized normalized counterpart for each face, the model enables direct measurement of the deviation from an individual’s own theoretical norm, aligning closely with how clinicians assess faces. This approach overcomes limitations of population averages and prior anchor-based comparisons that dilute localized anomalies with global differences (age, gender, race). Feature distances computed between raw and normalized images in both image and latent spaces capture clinically meaningful deviations, and the MLP regressor integrates them to accurately approximate human judgments. The framework mirrors the surgeon’s cognitive process—detect anomalies, assess severity, envision correction—achieving robust performance across a wide spectrum of facial presentations. The results suggest utility for objective surgical planning, patient education, and outcome measurement, addressing gaps in current subjective and non-personalized methods.
The study introduces the first machine learning system that realistically normalizes individual facial images by removing anomalous features while preserving identity, and quantifies deviation from normality to predict human ratings. The method combines StyleGAN-based latent optimization with perceptual and structural similarity measures, and a compact MLP regressor using five key features. It achieves high agreement with human scoring (R≈0.90, MAE≈0.57) and demonstrates strong sensitivity to pre–post surgical changes. Future work includes deployment in clinical workflows and on handheld devices (e.g., smartphones) to provide an objective, accessible tool for surgical planning, patient counseling, and outcome evaluation. Further research may expand datasets, refine normalization stability, and explore debiasing strategies for human rating-derived labels.
- Input image requirements: sufficient resolution (generally >256×256) and proper orientation; the face detector relies on clear visibility of key facial structures.
- Stochastic variability in normalization output; mitigated by averaging features across multiple (K) normalized candidates, but still present.
- Training labels are human ratings, which may embed cognitive biases related to age, race, and gender; the model may inherit these biases.
- Dependence on StyleGAN’s training data distribution; potential bias (e.g., female-leaning population average) necessitated algorithmic compensation during optimization.
- Computational demands of GAN-based optimization and perceptual metrics may constrain real-time use without optimization.
Related Publications
Explore these studies to deepen your understanding of the subject.

