logo
ResearchBunny Logo
Scalable watermarking for identifying large language model outputs

Computer Science

Scalable watermarking for identifying large language model outputs

S. Dathathri, A. See, et al.

Discover how Sumanth Dathathri and colleagues at Google DeepMind have tackled the challenges of identifying AI-generated content through their innovative SynthID-Text watermarking scheme. This groundbreaking research ensures high-quality synthetic text generation while maintaining detection accuracy and speed.... show more
Introduction

Large language models (LLMs) are increasingly used to generate synthetic text across many domains, making it difficult to distinguish AI-generated content from human-written text. Reliable identification and attribution of LLM outputs is critical for safe and responsible deployment, especially given risks in education, software development and web content. Existing approaches include: (1) retrieval-based systems that store all generated outputs for later matching, which raise privacy, coordination and scalability concerns; (2) post hoc detection methods that classify text as human or machine-generated using statistical features or learned classifiers, but suffer from computational costs, domain shift, bias (e.g., higher false positives for non-native speakers), and degrade as LLMs improve; and (3) text watermarking methods that embed signals during generation (generative), after the fact via edits (edit-based), or via training data triggers (data-driven). Edit- and data-driven approaches can leave artefacts or only trigger under specific prompts, limiting broad attribution. The research question is how to design a scalable, production-ready generative watermark that preserves text quality, offers high detectability under realistic conditions, and imposes minimal computational overhead. This work proposes SynthID-Text, a generative watermarking scheme based on a novel Tournament sampling algorithm, configurable for non-distortionary (quality-preserving) or distortionary (higher detectability) operation, and integrates with speculative sampling to meet production efficiency requirements.

Literature Review

Prior work on AI text identification spans retrieval-based logging, which requires centralized storage and access to all model outputs, and post hoc detection using statistical cues or learned classifiers. The latter faces practical limitations: inconsistent performance across domains and languages, higher false-positive rates for certain populations, and the need for continual retraining as LLMs evolve. Text watermarking offers an alternative. Edit-based methods apply rule-based transformations (e.g., synonym substitution or special Unicode characters), but may introduce perceptible artefacts. Data-driven watermarking uses trigger phrases learned during training to flag misuse, but only marks outputs when specific triggers are present and is aimed more at misuse detection than broad attribution. Generative watermarks embed statistical signatures by altering the sampling process without retraining the model and enable efficient detection without access to the LLM. Prior generative approaches include Gumbel-based sampling and list-based methods such as Soft Red List, with theoretical analyses relating watermark strength to properties of the underlying distribution (e.g., entropy). There remains a need for schemes that (a) rigorously preserve quality (non-distortion), (b) improve detectability across entropy regimes, and (c) scale within production systems using speculative sampling.

Methodology

SynthID-Text is a generative watermarking scheme comprising: (1) a random seed generator, (2) a sampling algorithm, and (3) a scoring function for detection. Random seed generator: Uses a sliding-window hash of the most recent H tokens (H=4 in experiments) combined with a secret watermarking key to produce a per-step seed r_t. Sampling algorithm (Tournament sampling): For each generation step, compute multiple independent pseudorandom watermarking functions g_i(x, r) over candidate tokens x, parameterized by the seed r. Then over-generate candidates from the LLM distribution p_LLM(x_t | x_<t) and conduct an m-layer tournament: tokens are randomly paired each layer, and in each pair the token with the higher g_i score advances (ties broken randomly). After m layers, the final winner is emitted as the next token. The number of layers m controls watermark strength and variance reduction; the paper commonly uses m=30. Configurations: - Non-distortionary mode: With two competitors per match and appropriate repeated context masking, Tournament sampling is constructed to be single-sequence non-distortionary (the induced distribution over sequences matches the original LLM on average over seeds), preserving text quality with slight reduction in inter-response diversity. - Distortionary mode: Using more than two competitors per match increases watermark strength (better detectability) but is token-level distortionary, trading some quality for detectability. Scoring function (detection): Given a text x_1..x_T and the key, recompute seeds r_t and g_i values; compute the mean g-score across tokens and watermark functions. Watermarked text is expected to score higher. Longer texts and higher-entropy generation settings improve detection power; lower-entropy settings reduce it. Additional scoring variants, including a learned Bayesian detector, are described for improved performance, especially with speculative sampling. Computational scalability: Tournament sampling modifies only the sampling layer and can be implemented efficiently (including vectorized forms), incurring negligible latency relative to model inference. Integration with speculative sampling: The authors propose two algorithms to combine generative watermarking with speculative sampling: (a) high-detectability watermarked speculative sampling (preserves detectability, may reduce efficiency), and (b) fast watermarked speculative sampling (for non-distortionary watermarks, preserves acceptance rate and efficiency, with some detectability reduction mitigated by a learned Bayesian scoring function).

Key Findings
  • Quality preservation at scale: In a live production experiment on Gemini (formerly Bard) with approximately 20 million responses, thumbs-up rate differed by +0.01% (watermarked higher) and thumbs-down by −0.02% (watermarked lower) compared to unwatermarked, both statistically insignificant within 95% CIs, indicating negligible impact on user-perceived quality. - Human preference study: Side-by-side ratings on 3,000 ELI5 prompts for Gemma 7B-IT showed no significant differences across grammaticality/coherence, relevance, correctness, helpfulness, and overall quality. - Capability metrics: No significant changes in perplexity or standard automated benchmarks between non-distortionary SynthID-Text and unwatermarked models. - Improved detectability (non-distortionary): For Gemma 7B-IT across ELI5 prompts, SynthID-Text achieved higher TPR at a fixed FPR of 1% than Gumbel sampling for the same text lengths; gains are larger in lower-entropy settings (e.g., lower temperatures). Selective prediction can achieve TPR=95% at FPR=1% by abstaining on uncertain cases. - Improved detectability-quality trade-off (distortionary): Distortionary SynthID-Text outperformed Soft Red List in TPR@1% FPR versus quality degradation (log perplexity), offering a more favorable trade-off. - Diversity: While both non-distortionary SynthID-Text and Gumbel reduce inter-response diversity, SynthID-Text provides a better diversity/detectability trade-off. - Minimal latency overhead: On Gemma 7B-IT served on 4 TPU v5e, baseline decoding was 15.527 ms/token; with 30-layer Tournament sampling it was 15.615 ms/token (+0.57%). Gumbel: +0.26%; Soft Red List: +0.28%. Relative overhead shrinks further for larger models. - Speculative sampling compatibility: Fast watermarked speculative sampling with Gemma 7B-IT (target) and Gemma 2B-IT (draft) with 3-token lookahead preserved acceptance rate and overall latency when using non-distortionary SynthID-Text, matching theoretical guarantees. - Multilingual consistency: SynthID-Text detection is consistent across languages (per supplementary results).
Discussion

The study addresses the core challenge of reliably attributing LLM-generated text without degrading user experience or incurring prohibitive computational costs. By introducing Tournament sampling within SynthID-Text, the authors demonstrate that generative watermarking can embed a robust statistical signature while maintaining the LLM’s original distribution over outputs in a non-distortionary configuration. Empirically, this yields superior detectability relative to established baselines (Gumbel sampling for non-distortionary and Soft Red List for distortionary settings) across models and temperatures, with detectability improving with text length and entropy. Quality neutrality is validated at scale through a live production A/B test with ~20 million Gemini interactions and a controlled human evaluation, and by unchanged perplexity and benchmark metrics. The scheme integrates with speculative sampling—common in production—to maintain throughput and latency when using non-distortionary watermarking. Together, these results establish the practical viability of watermarking for large-scale LLM deployments, enabling accountability and content provenance with minimal operational trade-offs. The work also clarifies non-distortion definitions and shows how to configure watermark strength versus quality, offering practitioners a tunable framework for diverse application needs.

Conclusion

SynthID-Text introduces a scalable, production-ready generative watermark for LLMs based on Tournament sampling. It provides: (1) configurable non-distortion guarantees that preserve text quality, (2) improved detectability over state-of-the-art baselines in both non-distortionary and distortionary regimes, (3) negligible latency overhead, and (4) compatibility with speculative sampling for high-throughput deployment. Deployed in Gemini and Gemini Advanced, SynthID-Text constitutes, to the authors’ knowledge, the first large-scale production deployment of a generative text watermark. Future work includes strengthening robustness to editing/paraphrasing and adversarial attacks (stealing, spoofing, scrubbing), extending scoring functions and detection under low-entropy regimes, improving diversity-preserving configurations, and broadening cross-language evaluation and interoperability across heterogeneous providers.

Limitations
  • Coordination requirement: Generative watermarking only identifies content from systems that adopt it; it cannot detect outputs from actors who do not implement watermarking. - Open-source challenge: Enforcing watermarking across decentralized, open-source deployments is difficult. - Vulnerability to attacks: Watermarks can be weakened by edits, paraphrasing, and adversarial attacks such as stealing, spoofing, and scrubbing; robustness remains an active research area. - Entropy dependence: Detectability decreases when the LLM distribution is low entropy (e.g., highly deterministic responses), and improves with longer texts and higher entropy. - Diversity trade-offs: Even non-distortionary configurations can reduce inter-response diversity; stronger non-distortionary guarantees may reduce detectability and increase computational complexity. - Complementarity: Generative watermarking is not a complete solution; post hoc detection or retrieval-based methods may be needed to detect unwatermarked AI text.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny