Mention versus Action: Guiding safety policy for content related to harmful topics
DOI:
https://doi.org/10.3765/plsa.v11i1.6080Keywords:
use/mention, responsible AI, harmfulnessAbstract
As generative AI systems are integrated into high-stakes domains, designing safety policies that accurately distinguish harmful from harm-free content has become a central challenge. A natural starting point is the use/mention distinction from linguistics and philosophy of language: content that merely mentions a harmful topic is typically less harmful than content that uses it to express harm. However, we argue that this binary is insufficient as a basis for responsible AI policy. We propose a mention vs. action framework that extends the use/mention distinction along two additional dimensions: the level of gratuitous detail and the discourse level contribution of the content. Together, these dimensions ground safety assessments in narrowly scoped, operationalizable criteria rather than opaque intent judgments. We demonstrate the framework through case studies in instruction-following, jailbreak attempts, and image content moderation, showing its applicability across modalities and its practical value for safety policy development.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Hadas Kotek, Leon Gatys, Margit Bowler, Yu'an Yang, Shruti Palaskar, Ciro Sannino, Gunnar Lund, Joseph Cheng, Robert Daland, Charlie Maalouf, Jeffrey Bigham

This work is licensed under a Creative Commons Attribution 4.0 International License.
Published by the LSA with permission of the author(s) under a CC BY 4.0 license.
