To De-Risk AI, The Government Must Accelerate Knowledge Production
A guest post on addressing non-existential AI risk scenarios
This week’s post is a guest essay by my friend Greg Fodor, a former colleague, brilliant programmer, and most recently the brains behind shogtongue. Greg has been reading, hacking, and tweeting in the AI space for some time now, and while he’s not an X-risk doomer he is, like me, worried about more moderately bad AI risk scenarios.
Greg’s proposal for addressing these risks, though, is the opposite of the standard doomer insistence that we immediately halt all AI research. Instead, Greg is more in the “Manhattan Project for explainability and alignment” camp — an approach he calls knowledge accelerationism, or k/acc.
One of the many things I like about this essay is the way Greg distinguishes foundational knowledge from incremental knowledge as two types of knowledge with different AI risk profiles. I have been sitting on a draft of an essay that attempts to do something very similar, so I may restart it and build on Greg’s thinking, here.
There’s a lot to chew on in this piece, so I hope you all enjoy it as much as I did.
tl;dr: significant near term AI risk is real and comes from the capacity for imagined ideas, good and evil, to be autonomously executed on by agentic AI, not emergent superintelligent aliens. To de-risk this, we need to align AI quickly, which requires producing new knowledge. To accelerate the production of this knowledge, the government should abandon decelerationist policies and incentivize incremental alignment R&D by AI companies. And, critically, a new public/private research institution should be formed that grants privileged, fully funded investigators multi-year funding cycles with total scientific freedom and access to all state-of-the-art artificial intelligence systems operating under US law to maximize AI as a force multiplier in their research.
If you want to just read the proposals, jump to the last section.
The arrival of increasingly capable AI models has led to a fever pitch of clamoring for regulation, training pauses, and other centralized government interventions to try to de-risk the technology by slowing down its development. This essay suggests that knowledge accelerationism (aka k/acc), not capability deceleration, ought to be the goal of any government interventions, and outlines a specific set of proposals on how to do this.
Many are claiming to know what risks AI presents and our approximate odds of surviving these risks. Superintelligence, runaway intelligence explosions, self-improving systems, and so on. While these are interesting and fun thought experiments, I believe the most entertaining thought experiments are the least likely. The most urgent matter is de-risking whatever first crossover lies ahead for us — the earliest threshold where AI yields a sudden jump in risk of fully destabilizing society.
This first such jump will not be emergent superintelligence or any other presently unknowable phenomena rooted in assumptions that are at best weak conjectures based on little more than speculative game theory. The first jump into a high-risk regime will come from a rapid shift, already underway, in the nature and scope of individuals who can have outsized impact on the world: the shift from “hackers” to “ideators.” This is already happening right now — and is a shift we should be celebrating. But it places us in a new risk regime, one which we can actually point at, so let’s do so.
In this essay, I will first articulate an argument of AI risk for those skeptics who have been led to discount it by the absurd and dangerous proposals by some rationalists. Then I will argue that accelerated knowledge production is what is necessary to deal with it, not decelerationism. And finally, I will outline a series of proposals intended to maximize the likelihood the requisite knowledge will be produced in time.
If you’re already convinced significant AI risk is real and imminent, you can jump directly to the argument for the proposals (“How to accelerate knowledge production”) or just the summarized proposals themselves in the last section.
The time of the ideator
The age of the hacker is ending, the time of the ideator has come. AI is a ratchet which collapses the capital and resource consolidation needed to execute on ideas. Up to now, ideas have been easy, execution has been hard. Soon, ideas will be hard, and execution will be easy. Neural networks will continue to underperform humans at autonomous divergent reasoning — at fully generating new, viable ideas not in their training set. But they will increasingly outperform humans at autonomous planning and execution on ideas given to them which turn out to be viable.
If this is correct, the future is not one of black hat and white hat hackers duking it out to make dents in the universe, the arena which has largely defined the Information Age. The active player characters of the future, of the Intelligence Age, are black hat and white hat ideators conjuring up new, viable ideas they then hand off to sovereign, autonomous AI to execute.
Viable ideas are not necessarily ideas you or I would endorse or be pleased to see implemented. Viable ideas, in this case, are simply any ideas that can be made to work: ideas which can be executed on and will have sufficient reach to visibly impact the world. There have been many viable ideas in history that are extremely bad past the point of meaningful contention. The most devastating tragedies in history are often rooted in an idea motivated by evil that also turned out to be viable.
White hat ideators will conjure up viable good ideas. Ideas that create wealth. That expand freedom. That reduce suffering. That accelerate positive sum games.
Black hat ideators will conjure up viable bad ideas. Ideas that destroy wealth. That consolidate power. That harm others. That entrench zero-sum games.
The black hat ideators’ ability to execute on bad ideas will always be constrained to a degree by restraining forces, but AI systems present a radical shift in their favor. For an ever-increasing number of such ideas, they will be able to act as lone wolves, with no co-conspirators, and with many such ideas being delegated for execution in parallel with only a marginal increase in risk, cost, or time for each such idea attempted.
The key capability threshold to consider in this for gauging the risk of AI systems is minimally-viable harm delivery by sovereign AI agents.
To understand this concept, an analogy:
Imagine you are trying to shoot a target, but you have limited vision. Someone hands you a shotgun. You turn to aim, based on intuition, where you think the target is. It might be too far away, or you might not be aiming well - you’re uncertain. You pull the trigger, the gun fires, and you hear nothing. You’re handed a better shotgun. You point in the same direction, and pull the trigger again. Each improved shotgun fires more bullets, farther, and with a wider radius. Eventually you’re handed a shotgun where, when you pull the trigger, you can hear one of your bullets hit your target. You don’t have to change anything, you just keep pulling the trigger over and over pointed in the same direction, and, eventually, the target will be destroyed. Most of your bullets miss, but it doesn’t matter: a small percentage inflict damage. The shotgun you are holding at this point has passed the minimum viable harm delivery threshold: you have proven that given your own personal capacity and talents (ie, your intuition of where to point), along with that version of the shotgun, you now have sufficient “reach” to harm your selected target. You did not have to expand your talents past that point — you can even have “no technical ability” — you just needed sufficiently capable technology for your existing abilities to gain reach.
The minimally-viable harm delivery threshold is the threshold where ideas become reachable by black hat ideators who feed viable, bad ideas to a sovereign state-of-the-art AI agent — agents that can execute them in a plausibly deniable, low-risk way, in parallel. Given an idea, the agent first spikes out its reach (ie, proving it has the capacity to hit the target.) Then, if it does, the ideator can just metaphorically pull the trigger until some other force stops them.
This capability isn’t far off for many such viable, bad ideas today: it’s imminent. It seems highly likely that one such meaningful threshold of minimally-viable harm delivery will be crossed within the next generation of open-source LLM models, or if, for example, GPT-4 level capabilities can presently be reached with existing open-source models via fine-tuning.
This is not about species-level existential risk, outsmarting superintelligence, or any other fantastical theories put forward by rationalists built up from a rickety foundation of assumptions formed out of little more than their own imaginations. Those theories are dividing smart people into silly tribes like “AI doomers” and “e/acc” and causing us to ignore the much less controversial problem sitting directly ahead of us.
The liberating potential of AI systems is that anyone’s viable ideas become more and more possible for them to execute on independently, quickly, and with low cost. This positive technological miracle can have a negative sign put in front of it for certain bad ideas. We must acknowledge this other side of the double-edged sword exists. It’s imminent, and it’s real, and it’s important to talk about it without rationalists or AI ethicists derailing the conversation into a 40-page essay about the instrumental convergence of alien minds or how GPT-4 is actually a white supremacist.
Importantly, we don’t even need a shared definition of “harm” or “bad” to talk about this constructively. Some people feel very uneasy about the kinds of viable ideas that can be already executed today by AI systems — ideas they would deem “bad.” Others are not concerned about these at all. This is OK, we don’t have to agree on what constitutes a bad idea that we would prefer not be executed on to try to agree on the path to potential solutions. All we need to agree on is the ratchet is real and is turning in one direction, and that there will eventually be a point where some ideas executed on by black hat ideators will deserve a negative sign from all of us.
You don’t have to think this problem is solvable or worth solving. You can think the tradeoffs of proposed solutions are not worth it, or any proposed solutions would be a complete disaster in practice, or that solutions exist. That’s also OK. The first goal is to acknowledge a real problem we can all see, one that is imminent, that isn’t couched in speculation and metaphysical navel gazing about Von Neumann probes. A ratchet we all see turning, and one that we should agree inexorably leads to a double-edged sword that we know will start cutting fingers as much as clearing terrain.
If you do not think the regime change of ideators becoming the ones who chart humanity’s collective future presents real risk of surprising, negative consequences, you can stop reading here. If you agree and want to at least consider the potential solution space, read on.
Kinds of problems
Now that we can see a genuine, imminent, non-speculative problem regarding AI risk, we need to understand what kind of problem it is.
Problems are always solved by producing new knowledge and then applying it. There are at least two kinds of knowledge one can produce: incremental knowledge or foundational knowledge. One way to think about problems is which of these two kinds of knowledge are necessary to solve it.
The best known method to produce incremental knowledge towards solving a problem is by ‘working the problem.’ This is applied science and engineering. This is peer-reviewed, NSF-funded scientific research. This is shipping products to users. This is making contact with reality at the frontier of where the problem is manifesting itself. This is what Richard Hamming asked his colleagues: if you’re not working on the most important problem in your field, why not? This is what OpenAI is attempting with their alignment R&D.
Working the problem is a reliable approach for producing incremental knowledge that can solve problems adjacent to ones already solved. It works well if you have a high confidence that following the evidence where it leads will land on a solution. It works if the consensus view around how to attack the problem is mostly correct. It involves testing, iterating, and steady improvement on top of prior foundational knowledge. You still need to make conjectures to choose your next move, but they are not risky conjectures: they seem like they are correct guesses, and they usually are.
However, some problems cannot be solved this way. These problems require new foundational, transformative knowledge. Unfortunately, you can neither hill climb your way to such knowledge nor predict if and when it will be discovered. Only once this knowledge is discovered can you then work the problem using incremental methods. Without the right foundational knowledge, working the problem incrementally won’t work.
An example of a problem which required new foundational knowledge to solve was instant person-to-person communication, from the vantage point of a person in the year 1700. It turned out, the necessary knowledge was mastery of electromagnetism and the construction of electronic relays, but it was impossible to predict this knowledge was needed as well as the odds of it being produced. Guessing if such instant communication was impossible or just around the corner was uncertain enough to be best considered unknowable at the time.
Producing new foundational, transformative knowledge is a lot harder than producing incremental knowledge. It’s not something we are very good at today, at all. (We used to be better at it — more on that later.)
So, given these two kinds of knowledge, what kind of knowledge is necessary for the problem of de-risking the effects of black hat ideators using sovereign AI to execute their viable, bad ideas?
Can we hill climb to a solution?
To consider whether mitigating the harmful effects of black hat ideators is amenable to incremental knowledge production, ie, by ‘working the problem,’ we can consider a case of maximum difficulty which is also obviously reachable.
Remember: the capabilities ratchet is turning, and with each click, the scope of viable ideas which can be executed on autonomously by sovereign, agentic AI systems increases.
Assuming AI systems will, at a minimum, converge on near human-level capacity for executing ideas (which seems reasonable, post-transformers) eventually there will be an idea that meets the following criteria:
it was previously only capable to be executed on by a narrow cohort of humans, who either never considered it or rejected it as a bad idea
but crosses over into being possible for a sovereign, agentic AI to execute in general for anyone who chooses to ask it to
it is universally agreed upon to be catastrophically bad among all humans except a few black hat ideators
one such black hat ideator conceives it and chooses to execute it, someone who would have been incapable of executing on it without AI
Unfortunately, if we wished to prevent execution of this bad idea, the problem converges onto the general AI alignment problem. If such black hat ideators exist and have the ability to request execution by a fully sovereign AI system, the AI system itself must be the thing to reject such bad ideas when called on to execute them. Alternatively, we would need to pre-empt the emergence of individuals who would choose to execute on such bad ideas. This would imply we’d need to solve the alignment problem for human beings themselves, which is an equally daunting task.
So, we start from a clear, imminent problem — an inexorable sign bit flip, the black mirror image of the leverage granted to white hat ideators by AI. From this, we reasonably guess we’ll need new foundational knowledge to avert disaster. As long as AI systems continue to increase in their capability to execute on ideas, we’ll need to solve the general AI alignment problem, which doesn’t seem amenable to a solution by mere incremental knowledge production alone.
Crucially: none of this argument depends on the assumption that neural networks will reach superhuman intelligence. It relies upon a weaker and much more widely accepted assumption that neural networks will become increasingly capable of executing on viable ideas, using the Internet, on par with more and more capable humans. This assumption is correct in most futures where AI advancements continue at a reasonable pace.
So, to stand a chance of de-risking the first crossover into high-risk AI we must accelerate knowledge production of both kinds: incremental, and foundational. Incremental knowledge can help in the short term, but new foundational knowledge is likely the only way to solve the general problem of AI alignment — a problem we now can see needs to be solved not because of the potential for emergent superintelligence, but more acutely because we’ll soon hit the first catastrophic bad idea conceivable by any arbitrary black hat ideator that can be autonomously executed on by agentic AI.
How to accelerate knowledge production
The focus of this essay is about what government interventionism we ought to pursue, so I am going to scope the discussion of how to accelerate knowledge production to motivate such interventions.
Accelerating production of incremental knowledge
The first kind of knowledge production we ought to try to accelerate is incremental knowledge: knowledge discoverable by intentionally “working the problem” of AI alignment with real systems making contact with reality. I think there ought to be two goals of any such government intervention: incentivizing AI developers to do this kind of work, and ensuring any useful knowledge they discover is disseminated widely and not held back as IP or trade secrets.
OpenAI is, in my view, is exemplifying the kind of behavior we want here. They are working the problem on real systems, and they are sharing their work in a risk-savvy way. You can disagree with their risk assessment, and many of their assumptions, but they are acting thoughtfully and meaningfully working the problem. For example, they have published their approach to RLHF as an alignment mechanism, using neural networks themselves for interpretability, and have open sourced their evaluation framework to lower the burden of pre-requisites for companies to begin working the problem of AI alignment.
So what government interventions could incentivize other AI companies to act similarly in working the problem of AI alignment, and sharing their work?
Some ideas:
Tax benefits for AI companies that dedicate some % of their R&D budgets to alignment oriented research and development and disclose their results.
Prioritized treatment in processing intellectual property claims (patents, litigation, etc) if companies show they are releasing IP determined to be primarily useful for aligning AI systems.
Public reporting requirements for sufficiently large companies to disclose the alignment R&D they are performing and their roadmap to disclose their findings to the public. Failing to file such reports would exclude them from being candidates for these kinds of programs.
I do not think it is a good idea to enable the government to penalize AI companies for not doing R&D on alignment — it would be an easy path to regulatory capture and other abuses. If companies choose to focus elsewhere, this is fine, they ought to be able to continue to operate under the status quo. But the government should create additional incentives for companies to prioritize alignment R&D programs.
The net effect of this would be a more rapid convergence on the actual ceiling of alignment capabilities that can be reached today absent the production of non-incremental, foundational knowledge.
If this ceiling turns out to be high, these interventions alone could significantly reduce AI risk. At a minimum, it would help craft the next proposal, which is intended to accelerate the production of foundational knowledge.
Accelerating production of foundational knowledge
Foundational knowledge production is much harder to accelerate. First, it’s impossible to have high certainty what research paths will lead to it. Second, it’s very hard to know if you’re even making any progress toward discovering it. The moment before foundational knowledge appears is often one at the end of a long, fruitless slog.
In Scientific Freedom, Donald Braben makes the case that our research institutions once understood at least one formula for accelerating the production of foundational knowledge, but over time have stopped applying it.
To produce foundational knowledge, according to Braben, you must grant extraordinary individuals a long funding commitment subject to minimal constraints on their research.
For example, if you were seeking advances in fundamental physics, you would seek out extraordinary physicists whose talents are clear but whose research interests lie outside of the prevailing conceptions of physics. Braben argues that the miraculous advances we saw in physics by the “Planck Club” during the early 20th century were largely due to those extraordinary individuals being left alone to do this kind of open-ended research.
Once selected, investigators would have a commitment of funding for many years, and the oversight of their work would be focused on ensuring they are able to be productive. Oversight would not be about directing their research or front-running its outcomes.
In this approach, it is essential there is no pre-conceived notion about what foundational knowledge will be produced, or, if a specific problem is the point of interest, what specific kinds of solutions we ought to expect to emerge. Investigators merely need to clear the bar of pursuing lines of inquiry plausibly useful for the problem at hand, and once that is cleared, a long-term commitment is made and the researcher has full autonomy.
Braben argues that our institutions have forgotten the difference in methods between incremental knowledge production and foundational knowledge production and that our present-day scientific institutions have biased rewards and incentives towards purely incrementalist approaches. The compounded effect decades later is there are few areas where we ought to expect these institutions to produce surprising new foundational knowledge in the way they once did.
Some institutions, like HARC, BP Venture Research, the Arc Institute have attempted to re-create these methods of long-term funding of investigators directly in the hope of producing new foundational knowledge in various fields. And, to his credit, Yudkowsky’s MIRI is the closest first order approximation of such an investigator-focused institution in the realm of aligning AI systems.
This institution’s criteria for investigators must avoid the pitfall identified by Braben of front-running the solution space. MIRI, for example, presumes mathematical logic and game theory is the path to solving AI alignment. This kind of narrow assumption is antithetical to this method of setting the stage for the production of foundational knowledge. It must be avoided if we are to expect any such knowledge to be produced.
But this is an essay about government interventionism in aligning AI. Should the government set up such an institution? In my opinion, no. But I do think we need a new, independent institution whose mission is foundational research that could plausibly assist with the alignment problem, one that takes this approach of long-term funding of extraordinary individuals with minimal constraints.
Foundational knowledge production is an AI problem
Assuming such an institution were created that funds investigators to perform alignment research, does the government have any role?
I believe there is a specific kind of government intervention which would radically improve the odds of success of such an institution successfully producing new foundational knowledge, while also being low risk in terms of potential abuse or capture by the government.
To see why, we must first notice that we are in a very strange situation: AI technology itself acts as a knowledge production accelerator. Investigators themselves will have higher success rates if they have access to state-of-the-art artificial intelligence.
This insight has immense implications.
First, it implies that capability decelerationist and knowledge accelerationist policy goals are in direct conflict with one another.
If we know we must produce new foundational knowledge to de-risk AI, but AI itself improves the success rate or shortens the timeline towards producing it, decelerating AI non-uniformly across the world would risk scenarios where the knowledge we needed could have been produced in time, but wasn’t. If we artificially slow down the technology among the actors who would apply it in this way, it means the remaining actors who won’t will barge ahead and we’ll hit the risk regime crossovers with less expected foundational knowledge than we would have otherwise.
If you buy that foundational knowledge production is essential to de-risking AI, uneven AI deceleration could greatly increase AI risk! Given there is no way to evenly decelerate AI across all global actors, despite claims to the contrary relying upon many layers of magical thinking, this is an important realization. It implies we should absolutely stop asking the government to focus on decelerating AI, and pivot to leverging the power of the state to accelerate the investigators in pursuit of new foundational knowledge that can plausibly help with AI alignment.
Second, the insight that AI itself accelerates knowledge production points to a role the government can play in improving the odds for investigators: it can ensure all investigators have access to state-of-the-art AI!
How would this work?
To maximize their odds of success, investigators will need full access to the very edge of the frontier of artificial intelligence. This means:
Access to all frontier models and datasets, by all companies
Documentation, code, and other resources to enable their use
A large pool of compute cycles on state-of-the-art hardware to use them
These are not things we ought to expect all companies and entities working on AI to voluntarily give up.
So, here is my proposal:
Investigators in the research institute are signing up for something akin to a “tour of duty.” They would have the full commitment of several years of funding, but in exchange for this commitment, they must complete their term and would be barred from re-entering industry for some reasonable grace period after their term is complete. They would also need to be highly vetted, with full background checks, and would also be potentially subject to some level of personal surveillance on the part of the institute.
In turn, to maximize their odds of success, they are afforded access to all state-of-the-art AI. The investigators have full autonomy to request anything they want from any AI company, such as access to inference hardware, data, models, and so on. Leaking any information outside of their investigation team would be a serious criminal offense. If a company does not comply with an investigator’s request, it would be reviewed by the institute, and if deemed reasonable (under a structured framework) the government can be called in to enforce it, levying fines or other penalties for non-compliance.
The intended effect of this is investigators are operating at the frontier of knowledge production capacity offered by AI systems, a frontier the public does not necessarily have access to, nor does any single company have access to for that matter. (This is why they cannot re-enter industry immediately after their term.) By keeping the government enforcement as an option of last resort called upon by the institution itself, it minimizes the risk the government will abuse its power here. The government’s enforcement is always localized: it directs a specific entity to yield specific things to specific investigators.
Proposals, summarized
To summarize: if we wish to de-risk AI, we must accelerate the production of both incremental and foundational knowledge. The need to de-risk sovereign agentic AI is real and significant, and doesn’t require taking on any assumptions about intelligence explosions, recursive self-improvement, or alien superintelligence.
This essay puts forward the following proposals for government interventions:
The government must incentivize the creation of alignment R&D programs within AI companies and their dissemination of new knowledge around solving AI alignment. They can do this by introducing tax incentives, IP incentives, reporting requirements, and so on.
A new AI alignment research institution should be created via a public and private consortium, focused on directly funding extraordinary investigators for long periods with minimal constraints. Once funded, investigators should have full autonomy and scientific freedom, and any oversight should not involve influencing their research.
This institution should be obligated to eventually publish all of the work of its investigators to the public.
Investigators will engage in a several-year tour-of-duty, with a funding commitment and, crucially, full access to all state-of-the-art artificial intelligence technology, not for auditing or investigating it, but for using it as a force multiplier in their work, across all entities operating under US law, backed by state enforcement. In exchange for this privilege, they cannot re-enter industry for some grace period.
Finally, the government should abandon policies which intend to decelerate AI technology in the West. These goals are in direct conflict with the goal of accelerating knowledge production, and would increase the risk of an AI catastrophe given capabilities would push forward but knowledge production around alignment may halt entirely.
I have not seen the above proposals presented anywhere in the present discourse regarding government interventions with AI technology, and I hope this essay was useful towards disseminating and motivating them.
Excellent essay! Thank you so much for clearly dissecting the problems at hand and offering very practical solution proposals. Moreover the importance of this essay cannot be overstated in terms of relocating the problem of AI alignment from the realms of „esoteric existentialism“ (however probable one might think such scenarios to be) to very real and certainly highly probable scenarios in the not too distant future.
A really good bit on "foundational knowledge". It definitely produced its fair share of quacks and pseudoscientists back in the day (and in Germany or Russia), but there does seem to be a palpable reduction in the rate of breakthroughs (unless mankind is nearing the exhaustion of the laws of physics to discover).
Is Greg Fodor's argument have anything to do with Mo Gawdat's? Gawdat's seems to be roughly about curating training data... (Apologies if I'm talking nonsense, my IQ is way too low for these matters.)