Political Science

Explaining use and non-use of policy evaluations in a mature evaluation setting

V. Pattyn and M. Bouterse

This article delves into the intriguing landscape of evaluation utilization in policy-making, particularly within the Dutch Ministry of Foreign Affairs' highly mature evaluation department. It uncovers surprising insights about the factors that influence the use of evaluations, revealing the crucial role of policymaker involvement and optimal timing, while questioning the expected impact of political salience. This research was conducted by Valérie Pattyn and Marjolein Bouterse.... show more

Introduction

The paper investigates why policy evaluations are or are not used in organizations with high evaluation maturity, where many commonly cited facilitators (e.g., quality assurance, established processes) are already in place. Despite strong institutionalization, substantial research waste persists. The study asks which factors, and combinations of factors, explain instrumental evaluation use in such mature settings. The Dutch Ministry of Foreign Affairs’ Policy and Operations Evaluation Department (IOB) serves as a typical high-maturity case. The authors conduct a systematic inquiry, aiming to identify critical change-makers for minimizing evaluation waste, and to contribute to broader evidence-use research by examining causal complexity using Qualitative Comparative Analysis (QCA).

Literature Review

A structured review of evaluation-use scholarship identified a wide range of facilitators and barriers grouped into categories: involvement and interaction between policymakers and evaluators; political context; timing; key attributes of the evaluation/report; evaluator characteristics; policymaker characteristics; and organizational characteristics. The literature is fragmented, with uneven empirical support across factors, variable conceptualizations, and limited attention to barriers or to interactions among factors. Many studies lack clarity on the type of use considered. The review underscores the need to analyze combinations of conditions and their conjunctural, equifinal, and asymmetric causal roles in explaining instrumental use, particularly in mature evaluation environments.

Methodology

Case and setting: The study examines evaluations conducted by/for IOB, an internally independent unit within the Dutch Ministry of Foreign Affairs, recognized for high-quality evaluation practice and an explicit focus on use. Data collection: In June–July 2016 the authors selected the 20 most recent evaluations sent to Parliament between 2013 and 2016; 18 proceeded to in-depth analysis due to respondent availability. Evaluations typically lasted ~12 months (range: 5–24 months), were public, and spanned 7 policy reviews (systematic reviews), 5 program/activity evaluations of (semi-)public organizations, and 6 thematic/regional policy evaluations; only one was of ongoing policy; 10 concerned international development and the remainder foreign affairs/human rights. Multiple sources were triangulated per evaluation: document analysis (ToR, interim/final reports, ministerial responses), at least two interviews (IOB inspector/researcher and the primary policy contact), and surveys of other involved policymakers (e.g., reference group members or those drafting policy responses). Outcome and factors: Instrumental use was defined as the evaluation significantly influencing at least one major policy decision (termination/continuation, substantial strategic/operational change, or major funding change) during or shortly after the evaluation. A broad initial set of organizational-level factors was derived from the literature; many showed little variation (e.g., frequency/formality/timing of contacts, report quality/readability, evaluator credibility), reflecting IOB’s maturity, and were excluded. Some factors (e.g., feasibility of recommendations; initiator of contact) could not be reliably measured. QCA approach: Crisp-set QCA was employed to analyze necessary and sufficient conditions, following Basurto and Speer’s stepwise calibration of qualitative data. Given 18 cases, the model was limited to four conditions selected for explanatory potential and variability: political salience; timing in parallel with policy formulation; novelty of knowledge to main policymakers; and policymakers’ interest. Calibration rules: politically salient if high agenda priority by both evaluators and policymakers or perceived sensitivity; timely if the evaluation overlapped with drafting/revision of major policy measures; novel knowledge if main policymakers reported the evaluation provided new information; interest if main policymakers presented/pitched findings to all relevant staff or proposed at least one evaluation question. Conditions and the outcome were coded 1/0 per evaluation. Analyses of necessity and sufficiency were conducted (conservative solution, no assumptions on logical remainders).

Key Findings

Only 5 of 18 evaluations (27.8%) exhibited instrumental use despite the mature evaluation setting.
Necessary conditions for use: timing in parallel with policy formulation and clear interest by main policymakers were present in all used cases (noted as ‘trivial’ due to low coverage also among non-used cases; coverage for timing 0.56; for interest 0.42).
Sufficient configuration for use: TIMING ∧ NOVEL KNOWLEDGE ∧ INTEREST → USE, with solution consistency 1.00 and solution coverage 0.80 (4 of the 5 used evaluations shared this combination; none of the non-used did).
Political salience did not appear in the sufficient path for use and did not function as a necessary condition.
No necessary conditions for non-use were identified. Three sufficient paths for absence of use covered 10 of 13 non-used evaluations (solution consistency 1.00; coverage 0.77):
1. Absence of timing ∧ absence of novel knowledge.
2. Political salience present ∧ absence of novel knowledge ∧ absence of interest.
3. Absence of timing ∧ presence of interest.
Interpretation: Appropriate timing and policymaker interest are necessary but not sufficient; novelty of knowledge is the differentiator that, in combination with timing and interest, leads to instrumental use. Political salience matters little for use in this mature setting.

Discussion

The findings address the central question by demonstrating that instrumental use in a mature evaluation system depends on conjunctural causation: appropriate timing aligned with policy formulation and policymakers’ interest are necessary conditions, but only when the evaluation also delivers novel knowledge do they become sufficient. This underscores how engagement (interest) can shape evaluation questions and increase the likelihood of generating actionable, new insights. Timing and interest may reinforce each other: ongoing policy design or revision creates incentives for policymakers to engage and steer evaluative inquiries toward decision-relevant issues. Political salience neither promotes nor hinders use in this context, challenging prior research and offering a more controllable focus on process design (engagement) and scheduling (timing). For non-use, multiple pathways exist, highlighting asymmetry: the absence of certain conditions does not mirror the presence-side logic. Notably, without novel knowledge and appropriate timing, use is unlikely; even with interest, poor timing constrains uptake. These insights suggest that in mature systems, where quality and credibility are baseline constants, the levers for improving use lie in aligning evaluation cycles with policy windows and fostering intentional engagement to surface new, decision-relevant knowledge.

Conclusion

In a high-maturity evaluation environment, only a minority of evaluations achieve instrumental use. The study shows that while timing parallel to policy formulation and policymakers’ interest are necessary, instrumental use materializes when evaluations also provide novel knowledge. Political salience is largely inconsequential for use in this setting. Practically, this implies prioritizing engagement strategies that allow policymakers to pose targeted questions and scheduling evaluations to coincide with policy (re)design. Conceptually and methodologically, the QCA results affirm equifinality and causal asymmetry in evidence utilization. Future research should test these configurations in other mature and less mature contexts, explore differing knowledge regimes, broaden the set of potential conditions (including political actors’ perspectives), and examine other policy domains beyond development cooperation. Further work could unpack how credibility and reliability judgments form within varying institutional cultures and how managerial attitudes mediate translation of interest into action.

Limitations

Small-N design (18 cases) constrained the number of conditions included in QCA, limiting model complexity and potentially omitting relevant factors.
Focus on organizational-level factors; individual-level characteristics and some organizational variables with minimal variance (e.g., contact routines, quality, credibility) were excluded, which may matter in less mature settings.
Measurement challenges led to excluding factors such as feasibility of recommendations and initiator of contact; recall issues and overlap with outcome impeded reliable coding.
Analysis centered on immediate or end-of-cycle instrumental use; later-stage uses were not assessed.
Perspectives primarily from civil servants; politicians’ views and broader political variables were not incorporated.
Field specificity (development cooperation/foreign affairs) and the Dutch consensus-style knowledge regime may limit generalizability.
The study did not delve into how evaluators’ perceived reliability/credibility is constructed across different knowledge regimes.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

A comparative analysis of endometrial cancer disparities in incidence, mortality, and survival between women living in Puerto Rico, Non-Hispanic Blacks, Non-Hispanic Whites, and US Hispanics between 2000-2018

A. Rosario-santos, C. R. Torres-cintrón, et al.

The Arts

Monsters revisited: a comparative study of the use of humor in dramatizing benevolent monsters in *The Monsters under the Bed* and *The Boy Who Loved Monsters and the Girl Who Loved Peas*

H. M. Bayoumy

Psychology

The aftermath of war; mental health, substance use and their correlates with social support and resilience among adolescents in a post-conflict region of Sri Lanka

L. Dissanayake, S. Jabir, et al.

Health and Fitness

The use of dietary supplements and vitamin consumption during and after the Covid pandemic in Vietnam: a perspective of user-generated content

M. Ha, G. Nguyen, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny