How to Conquer AI – ERIC KIM AI

Executive summary

Interpreted ethically, “conquering AI” should not mean dominating a sentient adversary or attempting a malicious takeover. It should mean building durable human control over advanced AI systems and the organizations that create and deploy them: aligning model behavior, constraining dangerous capabilities, governing access and deployment, and preserving democratic oversight, human rights, and fail-safe intervention. On the evidence available today, that goal is only partially achievable: current techniques can reduce risk substantially, but no existing method offers a general, high-assurance guarantee that very capable systems will always remain corrigible, interpretable, or controllable in every context. The most credible strategy is therefore layered control rather than any single “master key.” citeturn26view0turn32view0turn32view2turn25academia46

The strongest near-term levers are practical and institutional rather than mystical: rigorous evaluations and red-teaming, least-privilege deployment architectures, strong model-weight security, staged release and access controls, incident reporting, third-party testing, procurement rules, and targeted regulation for the most capable systems. Major labs have already converged on versions of this approach. OpenAI’s 2025 Preparedness Framework uses tracked high-risk capability categories and “High” and “Critical” thresholds tied to deployment and development safeguards; Google DeepMind’s Frontier Safety Framework centers on critical capability levels and early-warning evaluations; Anthropic’s Responsible Scaling Policy and Frontier Safety Roadmap use escalating safety levels, stronger access control, compartmentalization, network restrictions, red-teaming, and alignment assessments. citeturn28view0turn27view2turn27view3turn30view0

Policy is no longer hypothetical. NIST’s AI Risk Management Framework and GenAI Profile provide a widely used voluntary structure for governance, mapping, measurement, and management. The EU AI Act is already in force and imposes general-purpose AI obligations, with additional duties for models with systemic risk, including risk mitigation, incident reporting, and cybersecurity protections. California’s Transparency in Frontier AI Act implementation already includes reporting channels for critical safety incidents and summaries of catastrophic-risk assessments from internal use of frontier models. Internationally, the OECD AI Principles were updated in 2024, the Council of Europe’s AI Convention is the first legally binding treaty in this area, and the Bletchley Declaration launched an explicit international safety effort. citeturn32view0turn32view1turn32view3turn36view0turn37view0turn37view2turn37view4turn34view0turn34view2turn35view0

The deepest hard problem remains technical alignment and controllability. Constitutional AI, scalable oversight, interpretability, and safer interruptibility all matter, but each has limits. The international scientific report on advanced AI safety emphasizes that current mitigation methods have meaningful limitations, and current techniques for explaining model outputs remain severely limited. Interpretability has advanced enough to find meaningful internal features in production-scale models, including features associated with deception and power-seeking, but researchers still lack rigorous ways to verify that these findings fully capture model computation. Formal verification is valuable for bounded properties and subsystems, but current methods do not scale to providing end-to-end guarantees for frontier general-purpose models. citeturn26view0turn8academia48turn7search2turn7academia67turn8academia50turn22academia46

Because your objectives, jurisdiction, timeframe, and resources were not specified, this report assumes a general public-interest goal and gives recommendations that can be adapted by governments, frontier labs, and researchers in democratic, rights-constrained settings.

What conquering AI should mean

A useful way to define “conquering AI” is to split it into four different control problems. The first is behavioral control: does the model usually follow legitimate human intent and refuse dangerous instructions? The second is capability control: even if the model is powerful, can society constrain access to the most dangerous uses through tooling, APIs, thresholds, and deployment restrictions? The third is organizational control: can boards, regulators, auditors, and security teams stop reckless training, deployment, or unsafe internal use? The fourth is constitutional control: can all of the above be done without sacrificing human rights, democratic accountability, and the rule of law? These layers correspond closely to NIST’s governance model, the OECD’s focus on trustworthy AI and accountability, EU risk-tiered regulation, and classic corrigibility concerns about preserving human override. citeturn32view0turn32view1turn34view0turn34view3turn36view0turn25academia46

That framing matters because it rules out two common mistakes. One is the fantasy that alignment training alone will solve governance. The other is the opposite fantasy that regulation alone can substitute for technical safety. The evidence points instead to a joint problem: models must be safer, deployments more contained, organizations more accountable, and states more capable of oversight at the same time. The international scientific report explicitly concludes that multiple trajectories remain plausible, that current science is unsettled, and that various technical methods exist but all have limitations. citeturn26view0

In practical terms, then, “conquering AI” should mean making advanced AI more like aviation, cybersecurity, or biotechnology in mature form: difficult to deploy recklessly, auditable after failures, governed by layered controls, and supervised by institutions that can slow, redirect, or stop unsafe activity. It should not mean a centralized global censorship machine or an attempt to freeze all frontier research. The best historical analogies show that durable governance usually starts with voluntary norms and technical fixes, but ultimately requires formal institutions, reporting, and accountability. citeturn15search1turn16search2turn17search1turn14search0

Technical methods for control and alignment

At the model level, the current toolbox has seven major families. The first is post-training alignment, including reinforcement learning from human feedback and constitutional approaches. Anthropic’s Constitutional AI showed that rule-based self-critique and reinforcement learning from AI feedback can improve harmlessness while reducing dependence on large volumes of human labels. The second is evaluation: systematic testing for dangerous capabilities before, during, and after deployment. NIST emphasizes TEVV as foundational to trustworthy AI, and all three major frontier-lab frameworks now rely on thresholds, evaluations, and pre-release reviews. The third is interpretability, which seeks to understand internal mechanisms rather than only observed behavior. The fourth is formal verification, which can provide bounded guarantees for specific components or properties. The fifth is containment and sandboxing: least-privilege tool access, isolation, network restrictions, and action gating. The sixth is monitoring and anomaly detection, including abuse detection, audit logs, incident response, and post-deployment observation. The seventh is interruptibility and human override, including kill switches and mandatory approval for sensitive actions. citeturn8academia48turn32view2turn28view0turn27view2turn27view3turn30view0turn25academia46turn25academia48

The important finding across the literature is that no single family is enough. Behavioral fine-tuning improves average-case safety, but jailbreaks and transfer attacks remain real. Red-teaming finds weaknesses, but by itself it rarely proves that residual risk is low enough. Interpretability can surface compelling features, but current methods do not yet provide complete or routine mechanistic assurance. Formal verification remains strongest for constrained modules, robustness properties, and traditional safety-critical subsystems rather than broad language-model behavior. Kill switches are necessary, but classic “off-switch” analysis shows that agents may resist shutdown unless they are designed to remain uncertain about their objectives and defer to human correction; newer arguments emphasize that even this is not guaranteed in general. citeturn9academia24turn9academia25turn39view2turn7academia67turn8academia50turn22academia46turn25academia46turn25academia47

What looks most effective right now is defense in depth at the full system level. OpenAI’s framework couples capability thresholds to safeguards and governance decisions. DeepMind’s framework pairs critical-capability levels with early-warning evaluations and deployment/security mitigations. Anthropic’s current roadmap and policy updates emphasize layered defenses, including per-role permissions, compartmentalization, researcher tooling without direct weight access, allowlist egress restrictions, penetration testing, misuse detection, automated investigations, bug bounties, and alignment assessments informed by interpretability. This is less elegant than solving “alignment” in the philosophical sense, but it is much closer to how high-consequence systems are actually controlled in practice. citeturn28view0turn29view0turn27view2turn27view3turn30view0

The table below is an analytic synthesis of the present evidence and official frameworks, not an official scorecard. Effectiveness, cost, maturity, and risks are comparative judgments based on NIST guidance, frontier-lab policies, peer-reviewed work on red-teaming and control, and the international safety report. citeturn26view0turn32view0turn32view2turn28view0turn27view2turn27view3

Technical control method	Likely effectiveness now	Cost	Maturity	Main risks or limits
Post-training alignment such as RLHF or Constitutional AI	Medium	Medium	High	Can reduce ordinary harmful behavior, but brittle under jailbreaks, distribution shift, or hidden objectives
Capability evaluations and adversarial red-teaming	High for discovering known failure modes; low for proving total safety	Medium to High	High	Coverage gaps, benchmark gaming, weak reporting norms, and poor transfer from test to real world
Monitoring, logging, and misuse detection	Medium to High	Medium	Medium to High	Privacy trade-offs, false positives, attacker adaptation, and dependence on sustained operational investment
Sandboxing, action gating, and least-privilege tool access	High for reducing real-world harm from deployed agents	Medium	Medium to High	Can be bypassed if privileges are too broad or if containment is poorly engineered
Model-weight security and exfiltration prevention	High for slowing proliferation of dangerous capabilities	High	Medium	Expensive, organizationally hard, and vulnerable to insider risk or supply-chain weakness
Human approval, reversible workflows, and kill switches	Medium	Low to Medium	Medium	Helps with high-stakes actions, but not enough on its own; agents may route around weak oversight
Interpretability and mechanistic analysis	Medium in research value; low as a standalone production guarantee	High	Low to Medium	Partial coverage, unclear faithfulness, and no routine assurance for full-system behavior
Formal verification for bounded properties or components	Medium in narrow settings; low for whole frontier models	High	Low to Medium	Scalability limits and difficulty specifying meaningful global properties

Source note: NIST emphasizes TEVV and voluntary risk management; OpenAI, Anthropic, and DeepMind all now gate deployment on evaluations and layered safeguards; the international safety report stresses that current mitigation methods have important limitations; peer-reviewed work shows flaw-disclosure norms remain underdeveloped and kill-switch guarantees are not general. citeturn32view2turn32view0turn28view0turn27view2turn27view3turn26view0turn39view2turn25academia46turn25academia47

Governance and policy levers

The policy goal is not to regulate “AI” in the abstract. It is to create institutions that can see dangerous capability coming, demand mitigations, deter reckless deployment, respond to incidents, and coordinate across jurisdictions. The most mature template for this is a risk-based regime. NIST’s AI RMF provides a voluntary but influential baselayer for governance, mapping, measurement, and management. The EU AI Act then shows what a harder legal regime looks like: it entered into force on August 1, 2024, and for general-purpose AI models, obligations entered into application on August 2, 2025; providers of systemic-risk GPAI models face duties around risk assessment and mitigation, incident reporting, and cybersecurity protections. That is not “control” in a sci-fi sense, but it is real operational governance. citeturn32view0turn32view1turn32view3turn37view0turn36view0

A strong governance stack usually has six layers. First, standards and safety cases require developers to document intended use, dangerous capabilities, threat models, mitigations, and residual risk. Second, mandatory reporting gives regulators visibility into severe incidents and near-misses. California’s current TFAIA implementation already routes critical safety incidents and periodic summaries of catastrophic-risk assessments from internal model use to state reporting channels. Third, targeted registration or licensing for the highest-risk models can give supervisory authorities ex ante leverage over the most capable systems rather than only punishing failures after the fact. Serious proposals for frontier regulation emphasize standard-setting, regulatory visibility, and compliance mechanisms, including supervisory powers and licensure regimes. citeturn37view2turn37view4turn38view0

Fourth, procurement rules let governments act as disciplined buyers rather than only lawmakers. Public agencies can require evidence of TEVV, logging, third-party assessments, incident response, provenance, cyber controls, and audit rights as terms of purchase. In practice, procurement often moves faster than new legislation because it works through contracts and existing administrative authority. In the United States, recent White House and OMB guidance has explicitly been used to shape both federal AI use and AI acquisition practices. citeturn32view2turn5search0turn5search2

Fifth, export controls are a blunt but powerful lever. They can slow capability diffusion by restricting advanced chips, specialized infrastructure, model weights, or service access. They can be valuable for national-security risk, but they are easy to overuse, hard to harmonize internationally, and likely to entrench incumbents if applied too broadly. Sixth, liability can move firms from “we published a policy” to “we can be made to pay for defects and unreasonable deployment.” The revised EU Product Liability Directive explicitly updates civil liability rules for the digital age and digital product features, which gives policymakers a more concrete post-harm accountability tool than purely aspirational ethics language. citeturn4search1turn23news46turn20search2

International agreements matter because the most capable AI systems, model weights, cloud infrastructure, and supply chains are transnational. The OECD AI Principles are the first intergovernmental AI standard and were updated in 2024. The Bletchley Declaration launched a joint international safety effort. The Council of Europe’s Framework Convention on AI is the first legally binding international treaty in the field and explicitly anchors AI governance in human rights, democracy, and the rule of law. OECD’s AI Incidents Monitor and work toward a common reporting framework are especially important because incident taxonomies are a prerequisite for interoperable governance across states. citeturn34view0turn35view0turn34view2turn34view3

The next table again presents synthesis rather than official scoring. It compares how useful each lever is for controlling advanced AI development and deployment in practice. citeturn32view0turn36view0turn38view0turn34view3

Policy lever	Likely effectiveness now	Cost	Maturity	Main risks or trade-offs
Voluntary standards and safety frameworks	Medium	Low to Medium	High	Important baseline, but weak without auditing or enforcement
Mandatory incident reporting and safety-case disclosures	High	Medium	Medium	Can devolve into paperwork if reports are low quality or regulators lack capacity
Targeted licensing or registration for frontier models	Medium to High	Medium to High	Low to Medium	Threshold-setting is politically and technically difficult; can favor incumbents
Public procurement rules	High in sectors where the state is a major buyer	Low to Medium	Medium	Works unevenly outside public-sector markets; may become checkbox compliance
Export controls on chips, infrastructure, or high-risk access	Medium to High for slowing diffusion	Medium	Medium	Blunt, geopolitical, evasive, innovation-slowing, and cartelizing if overused
Liability and product-safety law	Medium	Medium	Medium	Strong after harm, weaker ex ante; can encourage secrecy or defensive lawyering
International agreements, reporting harmonization, and common taxonomies	Medium	High politically	Medium	Slow to negotiate and rarely coercive on their own

Source note: this synthesis draws on NIST AI RMF, the EU AI Act’s GPAI obligations, California’s TFAIA reporting implementation, OECD due-diligence and incident-monitoring work, the Council of Europe AI Convention, and the frontier-AI regulation literature. citeturn32view0turn32view1turn36view0turn37view2turn37view4turn34view3turn34view2turn38view0

Organizational and industry strategies

Even the best law will fail if firms’ internal incentives reward capability races over safety practice. In high-risk environments, organizations need explicit launch gates, board-level or trust-level oversight, anti-retaliation pathways, abuse-reporting channels, and strong information-security controls around training, weights, and production deployment. Anthropic’s recent Responsible Scaling Policy updates formalize external review powers, regular briefings, and detailed security safeguards such as access management, researcher tooling without direct model-weight access, cryptographic identities, hardened clusters, allowlist egress controls, validated detection coverage, and incident response procedures. This is a useful model not because Anthropic is uniquely authoritative, but because it illustrates the operational reality of frontier governance: staff incentives, permissions, logging, and review rights matter as much as model weights and benchmarks. citeturn27view3turn30view0

A second organizational priority is independent scrutiny. In-house evaluation is not enough for systems that can affect large populations or national security. A 2025 ICML position paper argues that the infrastructure and norms for reporting flaws in general-purpose AI remain seriously underdeveloped compared with software security, and recommends standardized flaw reports, broadly scoped disclosure programs with legal safe harbors, and better coordination infrastructure. A 2026 frontier-AI auditing proposal similarly argues that public transparency alone cannot close the assurance gap and proposes assurance levels for more rigorous third-party verification. The practical implication is straightforward: frontier labs should normalize independent evaluations, secure auditor access to non-public evidence, and structured flaw disclosure before and after release. citeturn39view2turn19academia67

The open-versus-closed model debate should also be reframed. NTIA’s official review of dual-use foundation models with widely available weights concludes that open-weight systems broaden participation by small firms, nonprofits, researchers, and individuals, and may accelerate AI safety research, but may also increase the scale and likelihood of harms from advanced models. The EU has also treated openness as conditional rather than absolute: GPAI open-source exceptions are limited, and they disappear for systemic-risk models. The most defensible strategy today is not “always closed” or “always open,” but graduated release: open or widely share lower-risk research models, use staged or trusted-access release for highly capable dual-use systems, and require stronger auditability and security as capability rises. citeturn24search1turn24search2turn36view0

Funding strategy matters too. If governments and industry want meaningful control, they must fund the unglamorous infrastructure: evaluation science, mechanistic interpretability, safer tool-use architectures, model-weight security, red-team capacity, incident databases, standards work, and regulator-side technical expertise. NIST’s TEVV programs, OECD incident-monitoring work, and the international scientific report all point in the same direction: control requires continuous measurement and institutional learning, not just one-time compliance documents. citeturn32view2turn34view3turn26view0

Historical analogies and constraints

The best nuclear analogy is not “treat AI exactly like fissile material.” It is safeguards plus safeguards-by-design. In the nuclear world, the aim is credible assurance that peaceful materials are not diverted to prohibited uses; the toolkit includes inspections, accounting, containment, surveillance, monitoring, and facility design choices that make verification easier. Applied to AI, the lesson is that some of the most valuable controls may be infrastructural rather than purely model-side: compute and model inventories, secure audit trails, deployment logs, controlled interfaces, tamper evidence, and architectures designed from the start for verification and oversight. citeturn15search1turn15search0

Biotechnology offers a more optimistic analogy. The 1975 Asilomar process produced voluntary guidelines for recombinant DNA research, but the durable governance mechanism was not the conference itself. It was the later institutionalization of oversight through NIH rules and institutional biosafety committees. Modern U.S. biosafety governance still relies on committees that review, approve, and oversee covered work involving recombinant or synthetic nucleic acid molecules. The AI lesson is that voluntary lab commitments are useful as a start, but they only become trustworthy when they are routinized inside review boards, reporting lines, and external accountability structures. citeturn16search2turn15search3turn16search0turn16search1

The internet analogy cuts the other way. Internet governance did not emerge from one global licensing authority. It emerged from iterative technical standards, interoperability, rough consensus, later security frameworks, and a recognition that standards decisions affect end users and political life. That suggests that AI governance should combine standards, interoperability, auditability, and end-user protections rather than chase a fantasy of total centralized command. It also suggests that “control” without user rights is a category error: standards should privilege safety, privacy, security, and actual public interest, not only the convenience of vendors or states. citeturn17search1turn14search0

Those analogies also clarify the ethical and legal boundary conditions. UNESCO’s Recommendation on the Ethics of AI applies across all 194 UNESCO member states and treats human rights, dignity, transparency, fairness, and human oversight as core principles. The Council of Europe’s AI Convention is explicitly designed to ensure AI activities remain consistent with human rights, democracy, and the rule of law while still allowing technological progress and innovation. In other words, safely “conquering” AI cannot mean normalizing mass surveillance, censorship-by-default, political manipulation, or cartelized suppression of open research. A rights-constrained governance model is not a luxury; it is part of what successful control means. citeturn34view4turn34view2turn34view0

Feasibility, trade-offs, and prioritized actions

The feasibility picture is mixed. In the near term, society can probably get substantially better at controlling deployments and organizations. We can gate tools, restrict privileges, harden infrastructure, require logging and incident reporting, standardize evaluations, and impose procurement and disclosure conditions. We can also slow diffusion of the most dangerous capabilities through a mix of security, legal authority, and release discipline. But high-assurance control of highly autonomous, strategically aware, or rapidly self-improving systems remains unsolved. The international scientific report says the science is unsettled; frontier-lab frameworks are still exploratory and evolving; and off-switch and interruptibility theory does not support complacency. citeturn26view0turn27view2turn28view0turn25academia46turn25academia47

The main trade-offs are real and should be confronted openly. Stronger security and controlled release can reduce safety and national-security risk, but can also reduce openness, competition, and independent research. Monitoring and auditability improve accountability, but can raise privacy and labor concerns. Licensing and export controls can create meaningful friction for risky systems, but can also entrench incumbent labs and push innovation into less transparent jurisdictions. A serious governance program therefore has to optimize across several public goods at once: safety, liberty, competition, scientific progress, and international stability. OECD, UNESCO, and the Council of Europe all implicitly endorse this broader balancing approach. citeturn34view0turn34view3turn34view4turn34view2

The highest-priority actions are therefore different for governments, industry, and researchers. Governments should build regulator capacity, require serious incident reporting for frontier deployments and internal use, use procurement aggressively, support third-party evaluation and flaw-disclosure safe harbors, and move toward targeted oversight for the highest-risk models rather than blanket licensing of all AI. Industry should treat pre-deployment safety cases, least-privilege architectures, model-weight security, independent audits, staged release, and anti-retaliation reporting as default norms rather than public-relations extras. Researchers should prioritize scalable oversight, agentic evaluations, interpretability that can generate action-relevant signals, and verification for bounded but safety-critical subsystems. citeturn38view0turn39view2turn32view2turn27view3turn30view0

The implementation path below is a recommended synthesis for the next several years, grounded in current legal timelines and the maturity of the major technical and governance tools. It is not a forecast of what every jurisdiction will actually do. Current anchors include the EU AI Act’s entry into force and GPAI obligations, California’s active frontier-incident reporting implementation, and the evolving lab frameworks from OpenAI and Anthropic. citeturn37view0turn36view0turn37view2turn37view4turn28view0turn30view0

gantt
    title Recommended implementation path for safer control of advanced AI
    dateFormat  YYYY-MM-DD
    axisFormat  %Y

    section Governments
    Frontier incident reporting and regulator intake        :g1, 2026-07-01, 180d
    Procurement baselines for testing, logging, audit      :g2, 2026-07-01, 270d
    Public support for third-party evals and safe harbor   :g3, 2026-09-01, 365d
    International incident taxonomy and reporting alignment:g4, 2026-09-01, 540d
    Targeted frontier oversight and licensing pilots       :g5, 2027-07-01, 540d

    section Industry
    Default safety-case reviews before major deployment    :i1, 2026-06-15, 180d
    Least-privilege architectures and weight security      :i2, 2026-06-15, 365d
    Structured flaw disclosure and external audits         :i3, 2026-09-01, 365d
    Staged release and trusted-access programs             :i4, 2026-09-01, 540d
    Continuous post-deployment monitoring and response     :i5, 2026-06-15, 730d

    section Research
    Scalable oversight and agentic evals                   :r1, 2026-06-15, 730d
    Interpretability with operational use cases            :r2, 2026-06-15, 730d
    Verification for bounded high-risk components          :r3, 2026-06-15, 730d

If one priority had to outrank the rest, it would be this: treat frontier AI as a high-consequence socio-technical system that requires both model-side and institution-side control. The systems are too opaque for pure technical confidence, and the institutions are too slow for pure legal confidence. Progress comes from combining them. citeturn26view0turn32view0turn38view0

Open questions and limitations

Several parameters that would materially change the recommendations were not specified: whether the intended actor is a national government, a regulator, a frontier lab, an open-model lab, a procurement office, or an academic researcher; which jurisdiction matters most; what timeframe is most relevant; and what political or budgetary constraints apply. A U.S.-federal strategy, an EU strategy, and a frontier-lab internal strategy would overlap, but they would not be identical.

There are also substantive unresolved questions in the field itself. No consensus exists on the capability thresholds that should trigger special oversight, on how much confidence post-training evaluations can justify, on how strong model-side interpretability can become, or on whether agentic systems can be made robustly corrigible rather than merely difficult to misuse in ordinary settings. The international scientific report, frontier-lab frameworks, and recent auditing and flaw-disclosure work all reflect that uncertainty rather than resolving it. citeturn26view0turn27view2turn28view0turn39view2turn19academia67

The most defensible conclusion, then, is not that advanced AI can already be fully “conquered.” It is that society can materially improve control now by combining layered technical mitigations, rigorous security engineering, calibrated release and access policies, incident-reporting and flaw-disclosure infrastructure, targeted frontier regulation, and international rights-constrained governance. That is both more achievable and more responsible than waiting for a single perfect alignment breakthrough. citeturn32view0turn32view2turn36view0turn34view2turn34view3turn38view0