The Launch of AVERI
And a new megapaper on frontier AI auditing with coauthors at dozens of organizations
I’m super excited to share that the non-profit I cofounded last year – the AI Verification and Evaluation Research Institute (AVERI) – has officially launched!
You can find some initial coverage of our work in a Fortune article by Jeremy Kahn here.
Quick Summary
AVERI is a 501(c)(3), US-based non-profit think tank that is trying to figure out how to make third-party auditing of frontier AI effective and universal.
A bit more verbosely, we are trying to envision, enable, and incentivize frontier AI auditing, defined as rigorous third-party verification of frontier AI developers’ safety and security claims, and evaluation of their systems and practices against relevant standards, based on deep, secure access to non-public information.
This is important for a bunch of reasons discussed in the paper linked above, and some not covered in it. I’ll try to give some intuition for why I personally think it’s so important below, but critically, I think there can be consensus on the need for frontier AI auditing even among people with very different views on things like existential risks from AI, the pace of AI progress, etc. In short, whether AI is a normal technology or an unusually dangerous one, it doesn’t make sense for companies to check their own homework on safety and security.
(Note: AVERI has three syllables and sounds like “A,” then “ver” as in “vertical,” then “E”).
Our website has more information about who we are and what we’re up to, and we have more exciting publications coming in the next few weeks.
And here’s the megapaper outlining what frontier AI auditing is, why it’s important, and ways to make it happen.
Personal Context
From a personal perspective, AVERI is the culmination of many things I’ve learned and thought about over the past 14 years since I decided to focus on AI policy.
Frontier AI auditing involves many themes from my various publications: the idea of safety and security being really key factors to manage when developing AI, and among the biggest barriers to widespread beneficial deployment; the idea that there’s value in paying particular attention to the most capable AI models and systems; the idea that trust is a key factor in making AI go well, including trust that the various competitors in an AI “race” will follow a common set of standards (so no one can get ahead by cutting corners); etc.
Some clues to the vision for AVERI can be found (e.g.) in this earlier megapaper, where we proposed third-party auditing as a mechanism for enabling more trustworthy AI development, and we made the connection to the AI arms race issue.
I hope AVERI will make a meaningful contribution to unlocking the great future that I think is plausible if we play our cards right and learn to confidently deploy AI across a huge number of beneficial (and often high-stakes) applications. And I hope it helps to prevent some really, really, really bad outcomes, as well as just normal bad outcomes. I hope it helps governments to ensure accountability for AI-related risk creation within their borders, and defuse the worst risks associated with an “AI arms race.”
In addition to building on my research interests and experience, I hope that AVERI will also benefit from the practical experiences I had at OpenAI, and everything I know about how the AI industry works from following its evolution very closely for quite some time.
Frontier AI Auditing is a No-Brainer
As I said, I don’t think that bringing about frontier AI auditing requires consensus on all things AI, nor agreeing with exactly how I framed things above. If it did, I’d be pretty pessimistic about frontier AI auditing happening! Instead, I think it’s hard but doable.
Here’s why, in short (again, read the paper for more): frontier AI auditing would make a lot of practical sense even if AI were simply a “normal technology,” and it is insanely important if you think AI is something more special than that.
If you think AI is “just” a normal technology like, say, electricity or the Internet… well, those were very big deals, and entire industries cropped up to address the risks associated with each of these. And these industries were not the result of electricity or Internet doomers, but people trying to help manufacturers create safe electrical products trying to prevent your credit card details from being stolen online. Viewed in that light, frontier AI auditing is “just” what we should be doing as a matter of course for something that is driving massive economic transformation, and which many users and enterprise procurement decision-makers worry will delete their files or hallucinate legal cases etc.
But let’s say you think that AI is a bigger deal than that… well, then AI not even having the level of scrutiny we apply to normal technologies (as we argue in the megapaper) is totally insane.
Let’s be a bit more quantitative about this, with help from some vibe coding. I asked all your favorite AI systems to come up with a bunch of different ways of scoring technologies based on the third-party scrutiny applied to them, and to estimate how this changed over time, whether increases were caused by specific incidents, what the maximum number of fatalities from an incident was at each point in time, etc.
At some point I may do a more rigorous version of this analysis – and for a less quantitative but still very valuable qualitative discussion, again, see the paper. But I’ll briefly summarize some conclusions that I think are pretty robust across different ways of measuring scrutiny, defining technologies, etc.
Scrutiny applied to AI is way below other technologies that – even if you totally ignore the doomers – could put far fewer lives at risk than AI in a single incident.
Scrutiny typically either jumps several points on any 10-point scale after an incident that kills a bunch of people, or increases very slowly over decades.
Scrutiny applied to AI has increased quickly in recent years - there is now something, rather than approximately nothing as was the case a decade ago.
Don’t take the details of this too seriously - again, read the paper instead, especially Sections 4 and the appendices, and/or look out for a future blog post on this topic. Among other issues with this particular graph, you can only get the highest score with government approval before deployments which I think will rarely if ever be the right approach for AI.
So, we are building a technology that is generally considered to be at least the next main driver of significant economic growth and change, and putting it into basically every part of society… and maybe also it could kill everyone or enable a permanent dictatorship, depending on who you ask. Given that, I think it makes pretty good sense to have guardrails for the most capable AI systems, and to make sure that it isn’t just the companies certifying that the guardrails are solid. That’s what AVERI is about.
We’re not starting from scratch, but we have a long way to go (again, see the paper).
Adapted from Our World in Data (original version says “The world” in each case where I added text).
An Early Preview of AVERI’s Work and Style
As I said, we aim to envision, enable, and incentivize frontier AI auditing. We’re not trying to be the auditors exactly (though we do have some exciting audit pilot project plans, where our goal is to go from 0 to 1 rather than 1 to 100).
We think of ourselves as a think tank that is more action-oriented than usual, but still ultimately focused on understanding and explaining things. We are a “research institute,” as the name says.
So what kind of research do we have planned? I’ll give a few quick examples, and we’ll have much more to share in the coming weeks and months. But to be clear, all of these are illustrative and we will pivot over time, and have already pivoted several times, because a lot has changed just in the past year when it comes to these topics. Many non-profits and companies are cropping up and spinning up and spinning down projects, and the technology is not sitting still either.
Gold standard articulation
To be super lame for a sec, my high school quote was something along the lines of “a problem well-stated is half-solved.” (there seem to be various versions of this quote - that version is attributed to Charles Kettering). That’s why we (and a lot of people at various other organizations) spent a bunch of time on the frontier AI auditing paper.
But the paper provides only a vision, and you need much more detail to have actual auditing standards. The Public Company Accounting Oversight Board (PCAOB) is a non-profit created after the early 2000’s Enron (etc.) accounting scandals. PCAOB’s job is to audit the (financial) auditors, and drive quality improvements across the industry. One of the things we talk about in the paper, and which we will likely be doing more of ourselves (though just using “soft pressure” rather than the formal authorities).
AVERI is a founding member of the AI Evaluator Forum and some AVERI staff helped write their first standard, AEF-1. This is a great start but we think of it as an early “bronze standard”. We need a more ambitious and more detailed “gold standard” for frontier AI audits – one that is specific enough to be actionable but that can also be tailored to different contexts. Under what conditions should companies give auditors access to information about training processes? What are the best practices for doing that securely (e.g., should such discussions happen on-site? Should auditors be issued company laptops when they’re doing related work, so they piggyback on corporate IT security investments)? These are just a few random examples of the things we want to help work through in partnership with others across the ecosystem.
Audit research and engineering
It’s one thing to say that we need various nice things – it’s another to make them technically and economically feasible. AVERI is interested in that, as well, and we have some amazing talented people working on just that, through building tools that can be used in audits and working with companies to conduct pilot projects.
We want frontier AI auditing to be rigorous and scalable. The highest quality audits will always require some amount of money and time because of the computing power and skilled labor involved, but we’ll need to make things pretty efficient in order to address the growing range of “risk surfaces” (again, even if you assume a relatively low pace and ceiling for AI progress). And we need standardized frameworks that can be compared across audits.
I’ll just give one example of this kind of thinking, which we say more about on our website. My colleague, Sean McGregor, recently put out a paper and platform called BenchRisk which is essentially a benchmark for benchmarks. The cool thing about BenchRisk is that it could be applied to internal evaluations at companies, by auditors given access to this non-public information, and therefore allow auditors to attest to the quality of that company’s benchmarks without doxing the company’s data.
Ultimately, we need to both protect sensitive information and give third parties very high confidence that that risks are being managed. This benchmark and similar ideas can help us to address the “transparency-security tradeoff” that we discuss further in the paper, which requires creative thinking to address.
Policy and advocacy
As I said above, third-party assessment of AI contains multitudes – there’s much more to be done, but also a lot of great work happening today, and we want to encourage incentivizing more companies to engage with the third-party ecosystem, and to do so in a serious way (not just as a checkbox exercise).
We need to increase demand for audits enough to grow the ecosystem and mitigate risks at a sufficient pace, but also need to make sure that auditors aren’t churning out zillions of low-grade audits to meet unrealistic deadlines and turn a profit. We are interested in both private policies (e.g., insurance pricing that takes auditing into account; enterprise procurement policies; investor due diligence on safety security) and public policies (e.g., legal safe harbors for auditors that find things that companies don’t want to hear about; requirements for auditing as a component of government procurement policies in domains like health and defense).
In the near-term, we’ll be doing a mix of two things on this front.
We’ll be working with various stakeholders to analyze the current lay of the land in the US – where there are many very related bills floating around, and a few somewhat related laws going into force – and the EU – where the General-Purpose AI Code of Practice needs to get implemented in a way that actually mitigates risks without overly burdening companies with paperwork that misses the mark of risk management.
We’ll also be thinking about new policy proposals that address specific challenges identified in the paper, such as quality control for audits via a PCAOB-for-AI that combines benefits of the private sector, such as better-than-government salaries to attract top AI talent, with those of the public sector, such as democratic accountability and the ability to impose fines or pull necessary accreditations.
AVERI’s Team and Supporters
AVERI is a small team of amazing people, backed by a much larger team of amazing people.
I’m honored to get to work with Sean McGregor, Carly Tryens, George Balston, Grace Werner, A. Feder Cooper, and Stone Addington everyday at AVERI. I’m also honored to be advised by Dean Ball, David Duvenaud, Bri Treece, Gillian Hadfield, and Jose Hernandez-Orallo, and held accountable by AVERI’s other board members, Max Henderson and Mike McCormick. You can learn more about all of these folks here.
AVERI’s funding comes from Halcyon Futures, Fathom, Coefficient Giving, Geoff Ralston, Craig Falls, Good Forever Foundation, AI Underwriting Company, Sympatico Ventures, and several non-executive employees and alums of frontier AI companies. No donor makes up a majority of our funding. We’ve also been offered API credits from Amazon, Anthropic, Google DeepMind, OpenAI, and Thinking Machines Lab.
We’re hiring, and we could hire more amazing people with additional donations, so please reach out about either of those if you’re interested, or if you just want to talk about any and all aspects of making frontier AI auditing a thing!



