Mention versus Action: Guiding safety policy for content related to harmful topics

Authors

  • Hadas Kotek Apple Inc.
  • Leon Gatys Apple Inc.
  • Margit Bowler Apple Inc.
  • Yu'an Yang Apple Inc.
  • Shruti Palaskar Apple Inc.
  • Ciro Sannino Apple Inc.
  • Gunnar Lund Apple Inc.
  • Joseph Cheng Apple Inc.
  • Robert Daland Apple Inc.
  • Charlie Maalouf Apple Inc.
  • Jeffrey Bigham Apple Inc.

DOI:

https://doi.org/10.3765/plsa.v11i1.6080

Keywords:

use/mention, responsible AI, harmfulness

Abstract

As generative AI systems are integrated into high-stakes domains, designing safety policies that accurately distinguish harmful from harm-free content has become a central challenge. A natural starting point is the use/mention distinction from linguistics and philosophy of language: content that merely mentions a harmful topic is typically less harmful than content that uses it to express harm. However, we argue that this binary is insufficient as a basis for responsible AI policy. We propose a mention vs. action framework that extends the use/mention distinction along two additional dimensions: the level of gratuitous detail and the discourse level contribution of the content. Together, these dimensions ground safety assessments in narrowly scoped, operationalizable criteria rather than opaque intent judgments. We demonstrate the framework through case studies in instruction-following, jailbreak attempts, and image content moderation, showing its applicability across modalities and its practical value for safety policy development.

Downloads

Published

2026-06-19

How to Cite

Kotek, Hadas, Leon Gatys, Margit Bowler, Yu'an Yang, Shruti Palaskar, Ciro Sannino, Gunnar Lund, et al. 2026. “Mention Versus Action: Guiding Safety Policy for Content Related to Harmful Topics”. Proceedings of the Linguistic Society of America 11 (1): 6080. https://doi.org/10.3765/plsa.v11i1.6080.