Finding Ways to Credibly Signal the Benignness of AI Development and Deployment is an Urgent Priority
Without compelling evidence of AI development and deployment being benign, distrust and corner-cutting will predominate.
Introduction
A simplified view of AI policy is that there are two “arenas” with distinct challenges and opportunities. In the domestic arena, governments aim to support AI innovation within their borders and ensure that it is widely beneficial to their citizens, while simultaneously preventing that innovation from harming citizens’ interests (safety, human rights, economic well-being, etc). In the international arena, governments aim to leverage AI to improve their national security, while minimizing the extent to which these efforts cause other governments to fear for their security (unless this is the explicit intent in a given context, e.g., deterrence or active conflict).
For example, in the domestic arena, a government might require that companies assess the degree to which a language model is capable of aiding in the creation of biological weapons, and share information about the process of red teaming that model and mitigating such risks. And governments could require documentation of known biases in the model. In each case, the government would be attempting to ensure, and produce evidence, that an AI system is benign, in the sense of not posing threats to the citizens of that country (although in some cases the government may also be mitigating risks to citizens of other countries, particularly for catastrophic risks that spill across borders, of companies with a global user base). Citizens may also demand evidence that a government’s own use of AI is benign — e.g., not infringing on citizens’ privacy — or that appropriate precautions are taken in high-stakes use cases, or one part of government might demand this of another part of government.
In the international arena, a military might build an autonomous weapon system and attempt to demonstrate that it will be used for defensive purposes only. A military might also state that it will not use AI in certain contexts like nuclear command and control.
In all of these cases, outside observers might be justifiably skeptical of these claims being true, or staying true over time, without evidence. The common theme here is that many parties would benefit from it being possible to credibly signal the benignness of AI development and deployment. Those parties include organizations developing and deploying AI (who want to be trusted), governments (who want citizens to trust them and the services they use), commercial or national competitors of a given company/country (who want to know that precautions are being taken and that cutting corners in order to keep up can be avoided), etc. By credibly signaling benignness, I mean demonstrating that a particular AI system, or an organization developing or deploying AI systems, is not a significant danger to third parties. Domestically, governments should seek to ensure that their agencies’ use of AI as well as private actors’ use of AI is benign. Internationally, militaries should seek to credibly signal benignness and should demand the same of their adversaries, again at least outside of the context of deterrence or active conflict.
When I say that an AI system or organization is shown to not pose a significant danger to third parties, I mean that outsiders should have high confidence that:
The AI developer or deployer will not accidentally cause catastrophic harms, enable others to cause catastrophic harms, or have lax enough security to allow others to steal their IP and then cause catastrophic harms.
The AI developer or deployer’s statements about evaluation and mitigation of risks, including catastrophic and non-catastrophic risks, are complete and accurate.
The AI developer or deployer’s statements regarding how their technology is being used (and ways in which its use is restricted) are complete and accurate.
The AI developer or deployer is not currently planning to use their capabilities in a way that intentionally causes harm to others, and if this were to change, there would be visible signs of the change far enough in advance for appropriate actions to be taken.
Unfortunately, it is inherently difficult to credibly signal the benign of AI development and deployment (at a system level or an organization level) due to AI’s status as a general purpose technology, and because the information required to demonstrate benignness may compromise security, privacy, or intellectual property. This makes the research, development, and piloting of new ways to credibly signal the benign of AI development and deployment, without causing other major problems in the process, an urgent priority.
Precedents and related concepts
Existing efforts in AI policy such as the White House voluntary commitments could be read as efforts to get industry to achieve and communicate benignness, but the credibility part has not yet been achieved as shown by public distrust in AI as well as policymakers’ strong interest in going further than voluntary commitments.
Efforts have also begun to establish “rules of the road” for military AI in order to reduce the extent to which military AI efforts might inadvertently increase the risk of spiraling conflict. The idea of credible signaling of benignness is closely related to and draws inspiration from related concepts/efforts such as verification in arms control, confidence-building measures, challenges in third party auditing of AI development and deployment, etc.
So if there are all these related threads, why didn’t I just use an existing term? Perhaps I’m mistaken in my choice, but the alternatives I considered didn’t seem to quite capture what I wanted to say. E.g., I worked on a paper about the idea that trustworthy AI development is about others being able to verify your claims - I still think that’s a useful framing and it informs some of the ideas below but I’m interested in a specific subset of claims here, and sometimes verifiability may be to high of a bar. I also considered terms like “credibly signaling safety” or “demonstrable safety” or “verifiable safety” but safety, at least in the sense it’s often used in an AI context to refer to avoidance of accidental harms, is too narrow of a concept for what I mean here, which includes malicious uses (
If we don’t prioritize research, development, pilots, and policies aimed at making credible signaling of non-theateningness a solved problem, there will be a higher rate of corner-cutting on safety in order to avoid getting taken advantage of and outcompeted by others, whether those others are militaries or just the leading companies of another country. Conversely, having a mature body of research and practice on credible signals of benignness could give more breathing room on safety, since it’s easier to play it safe when you know others are doing so.
It’s hard to credibly signal the benignness of AI development and deployment, but not obviously impossible
AI being a general purpose technology implies, among other things, that a given system can be used in ways that both do and don’t harm people outside a given organization. Even for narrower technologies like cars, it can be hard to verify safety and other aspects of benignness – consider the Volkswagen emissions scandal, for example. This is true for a variety of reasons, including information asymmetry between different stakeholders as well as incentives to cut corners and the complexity of even relatively narrow technologies.
But I think we should still try hard to solve it, and there are encouraging historical precedents. In one of the most contentious domains in history, for example – nuclear weapons – investment in research, development, and diplomacy allowed for carefully designed on-site inspections by adversaries to take place at weapon facilities. This provided assurance that arms control treaties were being followed. Similarly, inspections of chemical factories routinely take place in order to ensure that chemical weapons are not being produced, and this is done in a way that protects other important interests like intellectual property and innovation.
A few factors point towards the tractability of this problem, including the fact that AI requires trackable inputs, that there are physical aspects of many ways that AI could cause harm that can also be tracked (e.g. weapons platforms, DNA synthesis technology), and that people are the ones (currently) making decisions about the development and deployment of AI – and people typically respond to incentives. Plus, we can learn from the lessons of all previous technologies.
We’re not starting from a blank slate and the problem is not obviously impossible, but we need to move quickly since AI capabilities are improving quickly.
A rough plan for making progress
Below is how I think the problem should be tackled:
We need to define what information is needed in order to convey benignness without unduly compromising intellectual property, privacy, or security.
There’s active research and public discussion about what kind of information is needed in order for external researchers to understand the properties of frontier AI models, and to scope out what confidence-building measures might look like in an AI context. There is a lot of research on methods for specifying AI system behavior in a way that avoids harm, and it is thus likely that a big part of signaling benignness at the AI system level will be sharing information about the rules the AI system is expected to follow, and why the organization in question is confident that they will be adhered to.
The clause about “without unduly compromising intellectual property, privacy, or security” is the basic challenge of arms control which has historically motivated (and still motivates) substantial investment in related processes and technologies. It’s also the basic challenge of domestic regulation, given that companies worry about hurting themselves commercially, reputationally, legally, etc. by sharing too much.
There are various angles of attack on this. It’s worth noting upfront that companies, militaries, etc. have a lot of different priorities beyond signaling benignness, so there may be a lot of low-hanging fruit for external demands for the information above to drive quick wins. But at some point that low-hanging fruit will be plucked and there will need to be innovative approaches like the use of neutral third party auditors to attest to certain claims being true without sharing all of the underlying evidence, regulatory requirements for certain information to be shared (which means companies don’t have to put themselves at commercial risk by acting unilaterally to share certain information), iterative sharing of information in a tit-for-tat fashion as trust is built up over time rather than all at once, etc.
Much depends on how exactly the definition of benignness above is operationalized – e.g., how much uncertainty about the adequacy of safety or security precautions an outside observer is willing to tolerate, etc. There is likely not a single answer here. But we can develop a much richer vocabulary and set of possible answers for different contexts.
We need effective compute governance to ensure that large-scale illicit projects aren’t possible – otherwise large organizations (e.g. competing governments and militaries) will always be viewed with suspicion even if some of their specific projects are shown to be benign.
As discussed at length elsewhere, I and various other people think that compute is central to AI governance. Scaling up computing power is a big reason for the recent progress in AI capabilities, so having a good handle on compute stocks/flows/usage can help calibrate on where major AI risks are most likely to arise.
It may be advantageous from the perspective of signaling benignness if the most dangerous AI activities were to happen in jurisdictions with strong rule of law and strong transparency. There are some subtleties around physical location vs. legal jurisdiction and the benefits from economies of scale need to be weighed against other considerations such as countries’ legitimate desire to have sovereignty over their data and compute, and for now at least it is hard to prove that a country will not renege on a promise to provide long-term access to cloud-based AI capabilities.
There are various tricky aspects of this. Biden’s executive order requiring information from AI developers about their safety mitigations may soon be legally challenged, and Trump has indicated his intent to reverse it if elected. The Bureau of Industry and Security, the agency at the US Department of Commerce that should be tracking illicit use of GPUs overseas, works on a shoestring budget, leading journalists to somewhat fill in the gaps. Algorithmic progress is occurring quickly enough that there are big questions around the right compute thresholds for reporting, risk assessment, etc., not to mention that China is actively trying to decouple from the global semiconductor supply chain (since it’s increasingly getting cut off from the globalized one). We need a strategy that takes more advantage of the currently super-concentrated supply chain, which there’s no excuse for not tracking closely, while planning ahead for the future of growing bifurcation.
We need a combination of existing ingredients like national technical means of verification, cooperative verification (possibly done bilaterally or multilaterally), confidence-building measures, and domestic regulation, as well as new ones like global whistleblowing processes.
It’s unclear what the exact recipe is here, but there’s certainly much to be learned from what’s already happening in the US, EU, China, etc. around regulation, as well as the long history of arms control.
But we also need to think about whether there are missing ingredients, and I think there’s at least one – a global system of incentivizing whistleblower reports on false claims of benignness. The idea is that if there is anywhere in the world where people are skirting the rules related to benign AI – e.g. there is a secret datacenter over the reporting threshold that was cobbled together from black market GPUs, or a lab is deceptively choosing which model to expose to third party auditors, or a military that isn’t following the principles it said it would in ensuring a human-in-the-loop on lethal decision-making, etc. - then someone should be well compensated for making that known to the relevant authorities.
There are precedents here, like SEC whistleblower bounties. Ideally, as with financial fraud, the resulting equilibrium is not that there’s a lot of bad activity and a lot of bounties being paid, but rather that the existence of this program makes it clear that you’re not going to get away with it for long anyway, so you shouldn’t try. A system like this needs to be designed carefully so as to not encourage sharing of sensitive IP or the proliferation of potentially dangerous technologies, second, not encourage people to create and then report on bad behavior or to sow disinformation, and to not put whistleblowers at great risk (particularly in the context of military AI).
We need research and development aimed at making #1-3 easier.
As with various other aspects of AI policy, more research is needed, including in technical areas related to verifiable safety claims but also work on distilling lessons from history in other technological domains as well as contemporary policy practices like whistleblower bounties for fraud. There may also be valuable work related to building evaluations for benignness – these may in some respects be quite similar to existing efforts like red teaming and dangerous capability evaluations, but I expect there to be differences in some cases (e.g., I’m not aware of publicly reported evaluations of military systems). There also could be much more research on appropriate tiers of security against model weight theft. The latter may be particularly critical not just for ensuring compliance with domestic regulations but also as a part of proposed plans for distributing the benefits of AI more widely.
Different lines of research make sense for different contexts (e.g. building confidence in military vs. corporate compliance with their commitments), and in particular for different kinds of harms. The evidence needed to show, e.g., that a system has been rigorously tested for, and designed in a way that minimizes, harmful biases will be quite different from the evidence needed to show that a country is not hiding big datacenters underground. But I think there are common themes, just as there were lessons learned and shared between, e.g., implementation of nuclear arms control and the implementation of the Chemical Weapons Convention.
Research on these topics can culminate in various outputs, such as actionable lessons learned from different domains, designs for different institutions (e.g. a process for triaging whistleblower claims across borders and securely handling related information), ways of incorporating these ideas into the implementation of the EU AI Act and other near-term policy windows, etc.
We need to run high-profile pilots to build confidence that the overall recipe works and to speed up the research and development discussed in #4.
I’m often inspired by the US-Soviet Joint Verification Experiment, which was essentially an effort to empirically test whether it’d actually be possible to detect underground nuclear explosions. This involved a lot of dialogue and negotiation from scientists on both sides.
We may eventually get to a point where militaries directly test one another’s AI systems, given that analogous things have happened in a nuclear context. This could make sense in order to clarify the intent behind certain projects or systems, and avoid inflated perceptions of malicious intent leading to a war that both sides would prefer to avoid. But this will require stronger foundations than we have today. In the meantime, we can start with pilots focused on signaling the benignness of industry systems and organizations. There is already a lot of innovation in new testing arrangements that can be built on here, as well as a growing ecosystem of third party testing organizations that could perform some of the work discussed here. But notably the bar for benignness described above is far higher than has been achieved by any pre-deployment testing that has occurred in industry to date.
Caveats and conclusion
To be clear, I don’t think that credibly signaling the benignness of AI development and deployment is a complete solution for AI policy.
I deliberately defined benignness fairly narrowly, such that it does not rule out an AI developer or deployer causing some kind of (real or perceived) harm to another party. For example, workers should in fact consider AI deployment a potential threat to their economic wellbeing in some cases, but including economic harms in this framework would make it too broad to be useful (indeed, it’s pushing up against that possibility already by combining the domestic and international arenas, as well as safety, security, and misuse). Similarly, at an international level, Country A could perceive Country B’s AI development and deployment to be harmful in that it sets up Country B to outcompete Country A economically, but I consider this kind of relative economic harm to also be out of scope.
It may also turn out to be the case that militaries develop the capacity to signal benignness, but ultimately choose not to use that capacity for whatever reason. They may choose to use AI in a conflict, which of course throws the idea of benignness out the window, or they may conclude that there is value in actively increasing the perceived threat of a particular AI system or a military for the purpose of deterrence. It may also be the case that decisively signaling one’s benignness requires foregoing flexibility (i.e. binding one’s hands) in a way that is unacceptable for various reasons to various parties.
And importantly, benignness will often be a lower bar than what we should want from such a game-changing technology as AI. At least for some developers and deployers of AI, we should demand not just benignness but also that the technology or organization in question actively benefits society widely.
Those caveats aside, I think that it would be a colossal mistake if we didn’t invest substantial effort figuring out how to credibly signal benignness, and then consider where things stand. So I am actively thinking about ways to speed things up, and if you’re interested in helping with this effort or any of the other project ideas I discuss here, please get in touch via this Google Form.
Thank you to Michael C. Horowitz, Gillian Hadfield, Shahar Avin, George Gor, Larissa Schiavo, Vishal Maini, Heidy Khlaaf, and Gustavs Zilgalvis for helpful comments on earlier versions of this blog post, though none of them necessarily endorse any of the content here and all remaining errors are my own.