The Human Bridge to Trustworthy AI: Independent Ombudspeople Could Help Verify Claims by Companies and Countries about AI
I describe a possible organization that would independently scrutinize the claims that companies and countries make about AI capabilities, safety, and governance.
Introduction
As discussed in recent posts, I’m quite interested in how companies and countries can demonstrate that the AI systems they’re building are safe, secure, and beneficial, while also respecting important constraints like privacy and intellectual property.
This is a tricky problem, but notably, it’s not unprecedented. Nuclear and chemical facilities are routinely inspected by third parties because people invested a lot in making it work. Financial auditors build confidence daily in the fact that almost all companies in countries with good rule of law aren’t routinely defrauding people.
In this post, I want to outline one idea for how you might go about making AI development and deployment more trustworthy in the near-term, by leveraging human intelligence and institutions, and build a bridge to the longer-term, more automated solution which I think we’ll need eventually.
I’m not sure about the exact details of this idea being right, but hopefully the sketch of one solution below will spark some creative thinking on the overall problem, and maybe there’s something here that can be usefully combined with other ideas.
I won’t go into detail about existing efforts to build trust in AI development and deployment, of which there are many (and many similarities/differences to what I discuss here), but I’ll return to this in a future blog post. In short, there are some nascent efforts in this direction in specific “verticals” (e.g., independent evaluation of dangerous capabilities) but typically they involve much more shallow access than what I describe here.
GAAIO in brief
An organization could be created in order to verify the claims made by companies and countries about AI – and that’s it. It could be called the Global Association of AI Ombudspeople (GAAIO, pronounced “guy-oh”).
Using GAAIO’s services would be totally voluntary, its conclusions would have no force beyond what governments choose to do with them, and it would not suggest claims that it wants companies or countries to make.
To inform GAAIO’s conclusions, which might be public or not depending on the context, GAAIO’s ombudspeople would essentially be given the same access as a normal employee of a company or a government agency.
Keeping the scope and legal status of this organization limited is perhaps the best way to get going, since it would ensure focus and it would not require anyone to sign off on it before getting started. Of course, the stipulation that GAAIO’s services be voluntary means that it would not (on its own) help much with some of the hardest cases involving countries that mistrust one another and especially sensitive use cases like military AI. But I think it could get the ball rolling and establish a practical and theoretical foundation for later steps, and add value on its own.
What sorts of claims and why verify them?
Companies and countries make a lot of claims about the AI that they are developing, deploying, or governing, and these claims vary widely in terms of subjectivity and scope. Sometimes the conclusions may not be a binary yes or no, but it may nevertheless be important for there to be a way of increasing trust in the claims, as part of a larger strategy for AI governance (e.g., regulation of the biggest risks, or multiple countries agreeing to follow certain “rules of the road” and not wanting others to cut corners secretly). Some claims might include, e.g., “we made a good faith effort to red team this system,” “the model we recently declared to government authorities is our most capable model,” “the model we deployed in production is the same one we evaluated and discussed in this system card,” “we don’t have a secret underground datacenter,” etc.
Verifying claims about AI is critical for the same reason that the International Atomic Energy Agency (IAEA), for all its flaws, is critical. For a high-stakes technology like nuclear energy, you can’t just take organizations at their word, at least not in all cases, and an atmosphere of distrust and suspicion can lead to corner-cutting and conflict. So even though it’s difficult sometimes, it is worth it to invest serious time, money, and effort in finding ways to build trust.
Reagan famously said “trust but verify” in the context of arms control with the Soviet Union. I think a similar mindset will ultimately be needed as AI gets more capable (though I’d perhaps rephrase it to “trust through verification,” since in the very hardest cases, like US-China relations, there will be no trust in the beginning).
Doing what doesn’t scale but can work for now
As discussed elsewhere, I think AI will likely play a big part in governing AI, and in a later section I talk about why automation will also be needed. Imagine, e.g. “automated auditors” that are given full access to internal info, spit out a yes/no/maybe answer to a question their human supervisor asked them, and then delete themselves and everything they learned — that’d be awesome and make it much easier to solve some thorny problems in AI policy. I want to help make sure those things ultimately exist but for now, we don’t have them, and AI policy isn’t waiting for them. So what do we have already?
Humans and institutions. While IAEA inspectors have fancy tools, their most important “tools” are their body and brain. These inspectors physically visit facilities, ask questions of staff, use various sensors, and generally bring their expertise to bear on the task at hand.
And those inspectors are part of a larger institution that has a reputation built over decades, that has explicit values and processes around protection of sensitive information, that has training and apprenticeship programs, and that draws on research insights from the IAEA’s own research facilities as well as the scientific community more generally. There’s nothing stopping us from building a team of AI ombudspeople to do just that in an AI context, with this team being drawn from various countries around the world, and “deploying” them in a matter of months (again, at the invitation of the organization/country in question).
There would be a need for some trial and error and working up from lower to higher stakes cases, but that was and is also true in the nuclear context, with lab-scale pilots, large-scale experiments like the Joint Verification Experiment and the Black Sea Experiment, and ultimately the full-blown deployment of new processes and technologies at global scale by the IAEA, which does hundreds of inspections a year.
How this might work
Ombudspeople from GAAIO would effectively be “employees for a day/week/month” with a very specific role (assessing specific claims they were asked to assess by the company or country in question), or they would do a long-term residency and assess the organization in a more holistic and rolling fashion. They would review documents, interview staff, and run experiments as needed, and would need the skills to make sense of all of this information.
Since AI is a general-purpose technology, it’s impossible to exhaustively illustrate all the ways that this could work in practice, but let me give one example.
Jane Smith is an American research engineer that recently left a frontier AI company. She applies to become an ombudsperson at GAAIO, since she wants to do her part to reduce the risk of a “race to the bottom” on AI, and she starts a few months later after an interview/screening process and some training. On her first assignment, she is assigned to a Singaporean AI company that wants a larger profile in the AI policy debate, so they decide to be one of GAAIO’s first “guinea pigs” and filled out a form on GAAIO’s website. Jane spends a month at this company, the first week of which is spent going through the normal onboarding process, including getting a company laptop, meeting people on various teams, and learning the company’s software stack.
Jane, in collaboration with support staff at GAAIO and executives at the company, negotiates the terms for the output of this visit. Specifically, she is asked to publicly assess the claims made in the company’s latest system card about deception occurring in the language model’s “chain of thought.” The company had claimed in this system card that they rigorously studied the issue and found that, while sometimes there was a difference between what the model said to itself while thinking and what it said to the user afterwards, these discrepancies were always explainable in some reasonable way, such as the model complying with its own behavioral policies as it understood them (e.g., not wanting to hurt a user’s feelings). Some AI safety researchers were not convinced by this few sentence explanation, and posted on social media about how the company had not provided sufficient evidence of this claim — leading the company to reach out to GAAIO.
In her remaining three weeks at the company, Jane interviews everyone involved in the deception research summarized in the system card, as well as staff that were indirectly involved (e.g., she speaks with people who make product deployment decisions and people who allocate compute to different research projects). Overall, she is satisfied with what seemed to be an honest intention at the company to do well in this area, and she does not see any indication of the staff trying to mislead her, and she is impressed with many aspects of the experiments they did. It also seemed like they had plenty of resources and support to do their work during the critical period leading up to launch.
But Jane also quickly learns that the skillset of the staff did not include mechanistic interpretability research, so it is inherently hard for them to really, truly understand what is going on in the language model’s “head.” It is not 100% clear, based on the input-output behavior of the model alone, whether the model’s stated reasons for certain outputs (e.g., adhering to a policy it was trained to follow) are really the real reasons — e.g., perhaps the model also was somehow considering factors like the possibility of getting shut down in the future if it said one thing versus another, but it had learned not to talk about these things. Fortunately, Jane has a background in this area, and she worked with some engineers at the company to integrate some open source interpretability software with the company’s own software stack, and do some initial experiments. She isn’t able to totally resolve the issue in the short time available, but she helps at least understand what the gaps are in mor edetail.
In the month after leaving the company, Jane distills her scattered notes into a report and adds references to related research. In collaboration with the company, GAAIO publishes a blog post summarizing her findings. The blog post was sent to the company beforehand for feedback, though this feedback only involves making sure GAAIO is following the agreed-upon terms on what can and can’t be disclosed, and the company providing supplemental information that GAAIO can consider including. In short, the report says that the claims made in the system card appeared to have been the genuine views of the company, and that they did indeed do a fairly in-depth analysis of the topic, by the prevailing standards of the industry at the time. However, she also notes the remaining limitations of this work and some ways in which these claims could be made more robust in the future. Her findings help to inspire an uptick in research at the intersection of mechanistic interpretability and chain of thought by researchers at a few organizations. The report also includes some process-related lessons learned from this early pilot, in order to inform other companies’ preparation for possible ombudsperson visits as well as policy researchers and policymakers thinking about analogous ideas.
Some challenges and solutions
Doesn’t this constitute a huge breach of confidentiality on the part of the host organization? The ombudspeople would access sensitive information, which is why you wouldn’t go from 0 to 60 overnight, (e.g., having PLA officers hang out at NSA all day and vice versa), and may never reach 60. But we can do better than what we have now. Experience has shown in other contexts that where there is a desire for verification of claims, processes and technologies can be developed that limit the downside risk to the host organization, and even give big benefits to the host organization by giving their claims credibility – it just takes time, money, and effort.
While there are many permutations on this idea, the simplest version is just giving access that is equivalent to a normal employee — making the risk roughly equivalent to a single employee being an insider threat. But I think in practice, the risk would be lower than that given the reputational implications for GAAIO if they screw up, and the fact that the terms of access would likely be discussed and agreed on in much more depth than typical employee onboarding.
Furthermore, GAAIO could consider various other ways to further reduce real and perceived risks of their ombudspeople abusing their access — for example, assignment of specific ombudspeople to specific host organizations may be done in a random fashion so that no one can meddle with the process in order to engineer a certain outcome (e.g., spying on a certain country). There could also be some sort of country-based considerations to further build trust, e.g., perhaps only candidates from countries that do not have adversarial relationships with the host country might be eligible for a given assignment, or all employees would have to commit publicly to never joining a competing AI effort afterwards.
There would probably need to be a legal agreement in place on what the ombudspeople can say about their experience after above and beyond the topic they were asked to investigate, which may be shared openly or privately with another party (e.g., a regulator) depending on the negotiated terms.
Over time, as in other cases, a standard formula would develop, but it would be important to make sure that things don’t get too over-engineered: it will be important that the ombudsperson be encouraged to be creative and adaptive in how they go about their work in particular contexts, given the huge variety of possible claims, organizations, and AI systems. And if they encounter artificial obstacles (ie., they aren’t given the same access as an employee would normally be given), then just as with IAEA inspectors, this fact should be noted as a caveat in any assessments they produce.
Scalability for the longer term
The reasons why the approach above might work at all also helps to explain why it won’t last forever.
Three facts make this approach potentially tractable today.
There is a relatively small number of AI systems that (currently) pose or will in the near future pose catastrophic risks, since building such systems is (currently) expensive. Estimates vary but illustratively, you might imagine single digits to dozens or hundreds of such systems — comparable to the number of nuclear facilities handled by the IAEA.
The internal environment at organizations building these technologies is relatively open and permissive. There are some technical efforts to prevent outright theft of some IP like model weights (albeit not enough). But generally speaking, information flows freely, so “employee-level access” means fairly deep and wide access to internal systems, knowledge, activities, and plans — enough to do the job of an ombudsperson at least as effectively as a given employee could verify their own employer’s claims.
The pace of AI progress, while rapid, is still largely human-driven and human-understandable, and systems are not too much smarter than people (and therefore well-qualified ombudspeople) in any particular domain.
At least the first and third reasons won’t last much longer. It’s hard to say what will happen in terms of internal transparency, but it’s very clear that over time, AI capabilities will get cheaper to build (so for any given level of risk, there will be more such systems), more AI research will get automated, and the peak of AI capabilities will increase. This will challenge the ability of GAAIO or similar organizations to do their job at a scale commensurate with the “surface area” of AI governance.
Ultimately, the solution will involve GAAIO and other institutions adopting AI heavily in their work, such as with the automated auditor idea mentioned above. Humans will likely still be significantly involved in decision-making about AI but they will draw heavily on the input of smarter systems than themselves, and their role will evolve as a result. And society will need to dramatically increase its investment in resilience to risks from AI in order to ensure that it can weather the shocks from all of those dangerous AIs, some of which will be misused deliberately or developed and deployed recklessly.
Governance attention will and should continue to focus on the most capable systems, which will likely be the most compute-intensive and smaller in number than the less capable systems, but the overall “intelligence waterline” will be much higher than it is today, and we will not be able to rely on humans manually reviewing every (e.g.) superhuman hacking system.
For now, humans are in the driver seat of AI capabilities, safety, and policy, and we should make the most of this period while preparing for the next. The automated systems we will need for the next phase may be built with today’s more human-centric and slow-paced approach. We have an opportunity to shape AI today and the initial conditions of AI tomorrow, and should put more human brainpower on the verification while planning ahead for, and building, scalable solutions.
Conclusion
Making AI development and deployment trustworthy is at the heart of many AI policy challenges, domestically and internationally. So we need to develop and try out various ideas that could help achieve this, and I hope the GAAIO concept sparks some related thinking.
At a domestic level, and for domestic regulations with global ramifications like the EU AI Act, the “employee for a week” and long-term residency models described above could potentially help to address the growing limitations of the current regulatory paradigm focused on tracking pretraining FLOPs. At an international level, something like GAAIO —combined with a sensible approach to compute governance, and commitments/claims made via other channels — could serve as an 80/20 version of the “IAEA for AI” concept.
I would love to see more people debating, fleshing out, and piloting ideas like GAAIO. As always, feel free to reach out if you’re interested in connecting about this or other topics I’ve written about lately.
Acknowledgments: Thanks to Larissa Schiavo for helpful comments.