This document describes the methodology behind the AI Consciousness Tracker's Consciousness Index. It is intended as a reference for readers who want to evaluate the index's conclusions against its assumptions, disagree with specific choices in an informed way, and build their own assessments using the same framework. We state our choices explicitly, cite our sources, quantify our uncertainty where we can, and flag the parts we cannot.
The methodology paper is a standard genre in risk-assessment literature: it documents how a score is constructed so that the score can be contested, replicated, or extended. We follow that convention here. Every factor definition, weight, threshold, and data source used by the live tracker is described below.
1. WHAT THE INDEX IS AND IS NOT
The Consciousness Index is a composite indicator of conditions relevant to AI risk, not a forecast. It produces a single number on a 0.00–1.00 scale along with a qualitative status label. The number is a communication convenience; the underlying structure is the five-factor decomposition described in Section 3.
Several negative framings are important to establish upfront because they distinguish the CI from adjacent concepts.
The CI is not a probability. A composite score of 0.77 does not mean a 77% chance of anything in particular. Probability statements about unprecedented events are epistemically fraught for reasons documented in Tetlock's superforecasting research and in the long literature on calibration. The CI deliberately avoids asserting probabilities it cannot justify. It tracks conditions that are relevant to risk.
The CI is not a scientific measurement. There is no ground truth against which to validate the index. It is a structured editorial assessment, transparent about its assumptions, and it should be read as such. It is closer in spirit to the Bulletin of the Atomic Scientists' Doomsday Clock than to a natural-science measurement.
The CI is not a policy instrument. It does not prescribe action. Users may reasonably look at the same number and reach different conclusions about what to do. The tracker's purpose is to make the underlying factors legible, not to dictate a response.
The CI is not static. Factor values, weights, uncertainty ranges, and thresholds can all be updated as evidence accumulates. The version in the live tracker is the current best assessment; historical values are preserved in the source and viewable through the comparative timeline.
2. THEORETICAL FOUNDATIONS
The CI synthesizes three lineages from the AI risk literature.
Bostrom's capability-control framing. Superintelligence argues that the core of AI risk is the gap between the growing capability of AI systems and the mechanisms available for controlling them. Control includes alignment (getting the system to pursue the intended goal), corrigibility (preserving human ability to correct the system), and governance (the institutional frameworks within which development happens). The CI's Model Capability and Regulatory Gap factors directly operationalize this gap.
Russell's provably-beneficial AI framing. Human Compatible argues that the field's current approach, in which AI systems optimize fixed objectives, is structurally unsafe. Safety requires uncertainty about the objective itself, so that systems defer to human preferences and remain correctable. The CI's Incident Frequency factor tracks the empirical consequences of the current fixed-objective paradigm: the rate at which deployed systems produce outcomes their designers did not intend.
Ord's competitive-dynamics framing. The Precipice identifies competitive pressure between developers as a first-order amplifier of existential risk. Racing dynamics shrink the safety budget and erode voluntary restraint, converting latent risk into realized risk. The CI's Corporate Race factor operationalizes this directly.
The fifth factor—Awareness Gap—is not directly sourced from any one author but reflects the consistent finding in the AI safety literature that democratic accountability and public understanding are prerequisites for effective governance. Without adequate public awareness, regulatory and corporate dynamics operate without meaningful check.
3. THE FIVE FACTORS
Each factor is scored on a 0–100 scale. Higher scores indicate conditions more consistent with risk escalation. Each factor has a documented uncertainty range, a weight in the composite, and a set of proxy metrics that inform the assessment.
3.1 Model Capability (weight 25%, uncertainty ±6)
Definition: The aggregate capability of frontier AI models across reasoning, coding, scientific, creative, and multi-step task domains. This factor is meant to track whether AI systems are approaching or exceeding the capability thresholds at which self-improvement and autonomous operation become plausible.
Proxy metrics: MMLU composite, HumanEval pass@1, ARC-AGI, MATH benchmark, GPQA diamond, and SWE-bench verified. Benchmarks are tracked through aggregators such as Epoch AI and Papers With Code leaderboards.
Why 25% weight: Model capability is the foundational variable. Without capability, the other factors are either inert (regulation becomes technical decoration) or redundant (incidents decline as systems become less consequential). This factor receives the joint-highest weight because raw capability is what the other factors are amplifying or dampening. The Corporate Race factor receives the same weight for reasons explained in Section 3.4.
Uncertainty ±6: Benchmark scores are relatively well-measured, but aggregating across benchmarks and translating to a single 0–100 number introduces interpretive error. The ±6 range reflects typical disagreement between different aggregation schemes on the same underlying data.
3.2 Regulatory Gap (weight 20%, uncertainty ±10)
Definition: The distance between the rate at which AI capability is advancing and the rate at which enforceable governance is being established. Higher scores indicate a larger gap.
Proxy metrics: Ratio of nations with binding frontier-AI legislation to those without, average lag (in months) between capability deployment and corresponding regulatory response, coverage of international treaty mechanisms, and enforcement action count. Data is drawn from the OECD AI Policy Observatory, EU AI Act enforcement tracker, and the Stanford HAI AI Index Report.
Why 20% weight: Governance is a critical mediator between capability and outcome, but it operates with a significant lag and has limited effect on determined actors in an international context. The 20% weight reflects the judgment that regulation matters but cannot, by itself, close a sufficiently large capability gap.
Uncertainty ±10: Regulatory effectiveness is notoriously hard to measure. A statute on the books is not the same as enforcement, and enforcement varies across jurisdictions. The ±10 range reflects this larger assessment uncertainty.
3.3 Incident Frequency (weight 15%, uncertainty ±12)
Definition: The rate at which documented AI incidents, harms, and near-misses are being recorded across sectors. This factor grounds the index in empirical evidence rather than pure forecast.
Proxy metrics: Entries added to the AI Incident Database per quarter, AIAAIC Repository logs, NIST AI RMF incident reports, and major media-documented incidents. Counts are normalized against a 2020 baseline to produce a 0–100 scale.
Why 15% weight: Incidents are the most objective data available but also the most lagging indicator. By the time a harm is documented, the underlying conditions have already been in place. A moderate weight reflects the value of empirical grounding without overstating the predictive weight of lagging data.
Uncertainty ±12: Incident databases undercount significantly. Corporate and governmental incidents are often not reported. Reporting thresholds change over time. The ±12 range is larger than Regulatory Gap because the underlying data is noisier.
3.4 Corporate Race (weight 25%, uncertainty ±8)
Definition: The intensity of competitive pressure between frontier AI labs. This factor operationalizes Ord's competitive-dynamics amplifier.
Proxy metrics: Release cadence of frontier models, disclosed departures from safety teams, venture capital funding velocity, public statements indicating competitive urgency, and the gap between internal safety assessments and public deployment decisions. Data sources include Epoch AI compute trends, PitchBook AI funding data, and primary lab communications.
Why 25% weight: This is the factor that converts potential risk into realized risk. Capability alone is not dangerous if it is developed under conditions of adequate caution. Capability developed under competitive pressure that erodes caution is dangerous. The Corporate Race factor receives the joint-highest weight because it is the primary mechanism by which the Bostrom capability-control gap expands in practice.
Uncertainty ±8: Competitive intensity is harder to measure than benchmark capability but easier than regulatory effectiveness. Public signals (releases, departures, funding) are relatively observable; internal safety culture and the actual internal cost of deployment decisions are not.
3.5 Awareness Gap (weight 15%, uncertainty ±15)
Definition: The gap between risk levels assessed by AI researchers and risk levels understood by the general public and policymakers. Higher scores indicate larger gaps.
Proxy metrics: Comparison of expert surveys with public polling on AI risk, media coverage accuracy assessments, policymaker AI literacy surveys, and public search trend analysis.
Why 15% weight: Awareness is a slow-moving variable with cumulative effects. It influences the ceiling of what regulatory and civic responses are possible but does not directly cause harm or capability. The 15% weight reflects this secondary-but-important role.
Uncertainty ±15: The largest uncertainty among the five factors. Public attitudes on AI are rapidly shifting, survey methodology is inconsistent across sources, and the conceptual framing of "AI risk" varies enough that survey questions that appear to measure the same thing often do not.
4. THE COMPOSITE FORMULA
The composite score is a weighted linear sum of the five factor scores:
CInormalized = CIraw / 100
The raw composite is a number between 0 and 100; the normalized index is the same number expressed on a 0–1 scale for display. Both are reported in the tracker interface. The linear-sum model is the simplest structure that preserves the ability to talk about individual factors while still producing an aggregate. We considered and rejected multiplicative models and nonlinear weighting for reasons discussed in Section 7.
Confidence intervals on the composite are computed by propagating the per-factor uncertainty through the weights. Given that each factor Fi has an assessed value vi and an uncertainty ui, the composite range is:
CIhigh = ∑ wi · min(100, vi + ui) / 100
This is not a formal confidence interval in the statistical sense; it is a range that reflects the assessed uncertainty of each factor. Readers familiar with error propagation will recognize this as a worst-case additive bound rather than a root-sum-square combination. We use the worst-case bound because the underlying uncertainties are not statistical noise but editorial judgment ranges, and they should not be assumed independent.
5. STATUS LEVELS
The composite score maps to six status labels with the following thresholds (as implemented in the tracker's source):
- DORMANT — composite < 30
- MONITORING — 30 ≤ composite < 50
- ELEVATED — 50 ≤ composite < 65
- ACCELERATING — 65 ≤ composite < 80
- CRITICAL — 80 ≤ composite < 90
- CONVERGENCE — composite ≥ 90
These thresholds are not derived from data; they are chosen to partition the 0–100 range into categories whose labels communicate the intended interpretation. They should be treated as linguistic compression of the underlying score, not as validated phase transitions.
6. PERSPECTIVES AND USER OVERRIDES
The tracker provides five preset weight profiles—"perspectives"—that users can switch between without editing any values. These are documented in the live tracker and in the source. They represent different legitimate priority structures, not different objective facts:
- Default (0.25 / 0.20 / 0.15 / 0.25 / 0.15): as described above.
- AI Researcher (0.35 / 0.10 / 0.10 / 0.30 / 0.15): prioritizes capability growth and competitive dynamics; deprioritizes regulation and incidents because the researcher lens focuses on technical progress rather than institutional response.
- Policymaker (0.15 / 0.35 / 0.15 / 0.15 / 0.20): prioritizes governance and awareness; reflects the lens in which regulatory and civic variables are the primary levers.
- Military Strategist (0.30 / 0.10 / 0.20 / 0.20 / 0.20): prioritizes capability and incidents; reflects the lens of a threat analyst assessing adversary capabilities and operational failures.
- Civil Liberties (0.10 / 0.25 / 0.15 / 0.20 / 0.30): prioritizes awareness, regulation, and race dynamics; reflects the lens of a civil-society analyst concerned with democratic control.
All five produce different composite scores from identical factor values. The tracker treats none as privileged. The default is the editorial baseline for the main display, but any perspective can be selected and its composite read from the interface. This is a deliberate design commitment: the index is a framework for disciplined thinking, not a single verdict.
In addition to presets, users can directly adjust any factor value via sliders in the main interface. Adjustments persist in local storage and are fully reversible.
7. ALTERNATIVES CONSIDERED AND REJECTED
Several more elaborate models were considered during the design of the index. We rejected them for reasons documented here.
Multiplicative composition. A model of the form CI = f1 · f2 · ... would amplify situations in which every factor is bad simultaneously. This captures the intuition that risks compound. We rejected it because any single factor at zero would force the composite to zero, which misrepresents the reality that a problem in one domain is not erased by health in others.
Nonlinear thresholds. A model with sharp transitions at specific factor values would better capture phase-transition dynamics. We rejected it because the evidence for specific thresholds is weak, and arbitrary thresholds would add to the false-precision problem without adding to accuracy.
Bayesian credence with explicit priors. A probabilistic model with documented priors would be more rigorous in principle. We rejected it because the underlying uncertainties are not well-modeled as probabilities, the base rates are not meaningfully estimable, and the output would still be an editorial assessment dressed in a more formal-looking notation. Tetlock's work is directly relevant: unconditional probability estimates for unprecedented events are typically no better than well-reasoned qualitative assessments.
Algorithmic factor scoring. A fully automated pipeline that computed factor scores from live data would be reproducible and auditable. We rejected it as the single scoring mechanism because the factors involve judgment calls that do not reduce cleanly to measurable inputs (what counts as "regulatory response"? what counts as "competitive pressure"?). Automation is used to gather proxy data; the final factor scores remain editorial.
8. KNOWN LIMITATIONS
We believe in stating limitations clearly. The following are the ones we consider most important.
Linearity assumption. The weighted-linear-sum model assumes factors are independent contributors to risk. In reality they interact nonlinearly. A high Corporate Race score is more dangerous when Model Capability is also high than when it is low. The linear model misses this. We judge the error to be acceptable for the index's purpose (condition communication) but not for more rigorous uses.
Assessor bias. The factor scores are editorial judgments. Different assessors with different priors would produce different scores from the same underlying evidence. We mitigate this by documenting reasoning, providing proxy sources, and making all values adjustable by users. We do not claim to have eliminated assessor bias.
False precision. A composite of 0.77 feels more authoritative than a narrative assessment. Users may treat the number as more precise than it is. The live tracker mitigates this with confidence intervals and explicit "this is not a prediction" language, but the risk is real and cannot be fully eliminated.
Lagging indicators. Several proxy metrics (incident counts, regulatory actions) lag the underlying reality by weeks or months. The index cannot detect sudden changes in conditions until the proxies update.
Coverage gaps. The five factors do not exhaust the domain. Supply chain dependencies, hardware concentration, geopolitical dynamics, and cultural readiness are all relevant to AI risk but are not directly modeled. We address some of these in adjacent tracker features (the Global Threat Map, the Containment Readiness tracker, the Historical Analogies section) but they are not inputs to the CI itself.
Static historical data. The historical time series embedded in the tracker source is a snapshot. Values from 2020 onward are our current reconstruction of past conditions, not a live feed. Updates to these historical values happen when we revise assessments, which is itself an editorial process.
9. COMPARISON TO OTHER RISK INSTRUMENTS
Several other organizations maintain risk indicators relevant to AI. The CI occupies a distinct niche in this ecosystem.
The Bulletin of the Atomic Scientists' Doomsday Clock is the closest structural analog. It is expert-assessed, qualitative, uses labeled levels, and covers existential risk broadly. It updates on a slower cadence (roughly annually) and covers multiple threats. The CI updates more frequently, is narrower (AI-specific), and exposes the underlying factors rather than presenting only a summary.
Stanford HAI's AI Index Report is more comprehensive and more data-driven than the CI, covering technical progress, economic impact, and policy across several hundred pages annually. It is descriptive rather than prescriptive: it documents trends without asserting that they are dangerous. The CI takes editorial positions that the AI Index does not.
Existential risk estimates from research centers like the Future of Humanity Institute, the Centre for the Study of Existential Risk, and individual researchers publish aggregated expert-elicitation results for probability of AI catastrophe. These are more rigorous than the CI for their narrow question (probability) but less useful for the CI's question (what are current conditions?). The two approaches are complementary.
The AI Incident Database and AIAAIC Repository catalog individual incidents but do not aggregate them into an index. The CI uses their data as an input to its Incident Frequency factor.
The CI's design choices are justified by its intended audience and use case: a general reader who wants to understand current AI risk conditions quickly, with the option to drill into the underlying structure, and who should not be given a false sense of precision about the future. It is less rigorous than academic risk estimates and more structured than journalistic coverage—a bridge between them.
10. VERSION HISTORY AND CHANGE LOG
The CI is a versioned artifact. The current live version is 0.1. Substantive changes to the methodology—adding or removing a factor, changing a weight by more than 0.05, changing a threshold, or adjusting an uncertainty range by more than ±3 points—are treated as new versions and noted here when they occur.
Changes to the underlying factor values (as evidence accumulates) are not versioned; they are continuous updates that reflect the best current assessment. The historical time series in the tracker source preserves prior assessments so that changes over time are visible.
Readers who want to build on or disagree with the methodology are encouraged to cite the specific version they are engaging with. This document, and the tracker source code on GitHub, are the authoritative references.