Every Statistic Is Earned, Not Assumed

A complete explanation of the systems, algorithms, and editorial standards behind every number in the Lighthouse database.

Last reviewed March 2026 Version 2.4 Independently audited Q3 2026 (scheduled)
4.6M+
Verified statistics
A–D
Evidence-graded
Full
Citation chains
Continuous
Updated continuously
50+
Industries covered

The 6-Stage Autonomous Pipeline

Every statistic in Lighthouse passes through a 6-stage deterministic verification pipeline before publication. No editorial shortcuts. No LLM-assigned grades.

01
Stage One
Collect

Continuous monitoring of 23,000+ live intelligence feeds — government statistical APIs, academic RSS, industry report publishers, regulatory filings, and verified social signals. New data is ingested automatically across 13 languages and 27 government statistical APIs.

23K+ sources monitored
02
Stage Two
Extract

Named entity recognition and statistical pattern detection identify numeric claims, their context, units, and methodology signals. Each claim is tagged with an extraction confidence score before entering the grading queue.

Sub-100ms per stat
03
Stage Three
Deduplicate

Cross-source deduplication at scale eliminates redundancy before grading begins. Content-hashing identifies identical claims across multiple sources. Matching claims accumulate corroboration weight rather than creating duplicate entries.

Content-hash dedup
04
Stage Four
Grade

A 6-dimension deterministic scoring algorithm evaluates: source credibility, methodology rigor, sample adequacy, data recency, corroboration breadth, and independence scoring. Output: an A–D evidence grade plus a full confidence score breakdown. Zero editorial opinion involved.

6-dimension formula
05
Stage Five
Verify

A 10-step verification pipeline checks: value precision (0.1% tolerance), source liveness (HTTP validation), staleness (90/180-day thresholds), cross-platform consistency, and contradiction detection. Every step is formula-based and fully reproducible.

2+ sources required
06
Stage Six
Publish

Only statistics passing all quality gates are published. Every stat carries a full 6-layer provenance chain and a citation safety classification: Safe to Cite, Use with Context, Use with Hedging, or Do Not Cite. Citation-ready within hours of source publication.

Citation-ready hours

The A–D Evidence Grade System

Every published statistic is assigned one of four evidence grades based on a deterministic rubric — never editorial opinion. Grades reflect the quality, verifiability, and corroboration of the underlying research.

A
High Confidence

Large-sample primary research, peer-reviewed methodology, 3+ independent sources, published within 18 months. Safe to cite without qualification in any context.

B
Verified

Industry study with documented methodology, 2+ independent sources, adequate sample size, published within 24 months. Reliable for most citation contexts.

C
Provisional

Single credible source or cross-source consensus without a primary study. Auto-generated hedging language provided. Suitable for directional use with appropriate caveats.

D
Low Confidence

Unverified origin, insufficient corroboration. Flagged as Do Not Cite. Included in the database for completeness and research context only.


Confidence Score Decomposition

Every grade is computed across exactly 6 weighted dimensions. The weights are defined by formula and applied consistently across all statistics. LLMs are never the source of a numeric score.

Source Credibility
Methodology Rigor
Sample Adequacy
Data Recency
Corroboration Breadth
Independence Scoring

Bar widths are illustrative of relative weighting. Exact dimension weights are published in the technical appendix.


Four Citation Safety Classifications

Every published statistic receives a citation safety classification that tells you exactly how and when it is safe to reference the data — with no ambiguity.

Safe to Cite
Grade A or B, no risk flags

Cite with full confidence. Complete provenance chain is available. Suitable for published reports, enterprise presentations, and journalism.

Use with Context
Grade C, high confidence score

Reliable for directional use. Note the source limitation when citing. Recommended for internal research and supporting evidence.

Use with Hedging
Grade C, lower confidence score

Auto-generated disclaimer language is provided with every statistic. Use as supporting context, not as a primary claim.

Do Not Cite
Grade D or flagged

Insufficient provenance to support citation. Available for internal research purposes only. Not suitable for any public-facing use.


The 6-Layer Provenance Chain

Every published statistic carries a 6-layer immutable audit trail. Each layer is append-only — no record can be modified or deleted, only superseded by a new version.

1
Layer 1
Source Artifact

The original document — content-hashed and stored permanently. URL, retrieval timestamp, full-text snapshot, and Terms of Service compliance status are recorded at ingestion.

2
Layer 2
Extraction Event

Which extraction model was used, which prompt version, the raw model response, and the structured output. Every extraction is traceable to a specific model version and run configuration.

3
Layer 3
Version History

Every correction creates an immutable version record with a full diff. No data is overwritten. Previous versions remain accessible with an explanation of what changed and why.

4
Layer 4
Quality Trace

The full pipeline trace — pass/fail status for each of the 10 verification steps, the specific values that triggered any failure, and the precise reason the statistic was accepted or rejected.

5
Layer 5
Dependency Graph

Upstream sources this statistic corroborates, and downstream derived statistics that depend on it. Any change to a root stat automatically propagates confidence recalculations across its dependency tree.

6
Layer 6
Lineage Score

A 0–100 completeness score reflecting the depth and integrity of the provenance chain. Higher lineage scores indicate more complete audit trails and are used to prioritise quality assurance review.

200M+
Lineage rows in append-only tables
100M+
Citation chain edges mapped
6
Interconnected immutable layers
18+
Months to build this infrastructure

How Data Freshness Works

Confidence scores are not static. Lighthouse applies a time-aware staleness model that automatically downgrades confidence as data ages — with topic-specific thresholds for breaking news versus evergreen benchmarks.

90
90-Day Threshold — Confidence Downgrade

Statistics that have not been refreshed by a new primary source within 90 days receive an automatic confidence penalty. The grade may drop by one level if no corroborating update is found. A staleness flag is added to the citation display.

180
180-Day Threshold — Publication Block

Statistics older than 180 days without a refreshed source are blocked from new citations and marked as potentially stale. They remain in the database with full provenance history but are excluded from top-of-page results.

Topic-Aware Evergreen Exceptions

Structural benchmarks — such as average human reading speed or long-run global literacy rates — are classified as evergreen and exempt from standard staleness thresholds. Evergreen classification is assigned by formula, not editorial discretion.


See the methodology in action

Browse 4.6 million+ evidence-graded statistics — each one with a full provenance chain, citation safety classification, and confidence breakdown available on demand.