Home/Methodology

Methodology

OpenGov combines official government data with machine analysis to make congressional activity accessible. This page explains exactly where the data comes from, what we do to it, and where the gaps are.

Where the Data Comes From

Congressional Bills & Votes

GovInfo (govinfo.gov)

XML bulk download + sitemap-based incremental sync

DailyU.S. Government Publishing Office

Member Information

congress.gov / bioguide.congress.gov

Member directory sync

WeeklyLibrary of Congress

Campaign Finance

FEC.gov

Direct contributions from PAS2 bulk files, independent expenditures from Schedule E API

WeeklyFederal Election Commission

Endorsements

Public endorsement lists + FEC data

Curated from public endorsement lists and FEC independent expenditure filings

As availableMultiple public sources

Social Media

Bluesky API, YouTube RSS, Senate press releases

Direct API (Bluesky), RSS feeds (YouTube, press releases)

Every 2-6 hoursOfficial member accounts

Executive Actions

Federal Register API

Executive orders, memoranda, proclamations

DailyNational Archives

What We Cover

12,337

Bills Tracked

119th Congress (2025-2026)

1,021

Roll-Call Votes

119th Congress

548

Members of Congress

Senate + House

119th Congress (2025-2026): Full coverage — all bills, votes, and members.

118th Congress (2023-2024): Partial — bills and votes loaded, analysis coverage varies.

Scope: Federal only. State legislatures are on the roadmap but not yet available.

How Often It Updates

Bills and votes

Daily (6:00 UTC) — Automated sync from GovInfo

Social media posts

Every 2-6 hours — Bluesky (2h), YouTube + press releases (6h)

News heat scores

3x daily — Aggregated from public news APIs

Member profiles and analysis

Weekly — Recomputed and cached in Redis (7-day TTL)

Issue classification

Continuous — New bills classified within 24 hours of ingestion

Campaign finance

Weekly — FEC bulk files and Schedule E API

What's Machine-Generated vs Raw

Some data on OpenGov comes directly from government sources with no processing. Other data is generated by AI models. This table shows exactly which is which.

Data Type	Source	Method
Vote totals, sponsor lists, bill status	congress.gov / GovInfo	Raw -- no AI processing
FEC contribution totals	FEC.gov	Raw -- no AI processing
Plain-English bill summaries	LLM-generated	Claude (Haiku) via Claude CLI. ~98% of 119th bills covered.
Stance classification (expand/restrict/etc.)	LLM-generated	Claude (Haiku). 15-verb constrained vocabulary. ~98% coverage.
Policy direction labels	LLM-generated	Claude (Haiku). Binary expand/restrict classification. ~94% coverage.
Position synthesis and themes	LLM-generated	Claude (Opus). Aggregated from member voting + sponsorship patterns.
Bill similarity scores	Vector embeddings	all-MiniLM-L6-v2 (384d), cosine similarity via Neo4j vector index
Bill connection paths	Graph traversal	Neo4j shortestPath algorithm, no AI
Social post issue tagging	LLM-generated	Claude (Haiku) with keyword fallback

How Bills Are Classified

Bills flow through a 5-stage pipeline that combines official government taxonomy with semantic search and quality validation.

Official Data Ingestion

Bill text, metadata, and legislative subjects are downloaded from GovInfo. Executive actions come from the Federal Register API. Data is synced daily via sitemaps to capture new and amended legislation.

12,300+ bills from the 119th Congress across all 8 bill types (H.R., S., H.J.Res, S.J.Res, H.Con.Res, S.Con.Res, H.Res, S.Res).

CRS Classification

Each bill is classified using its Congressional Research Service (CRS) policy area -- the same taxonomy used by the Library of Congress. CRS policy areas are mapped to our 42 tracked issues through a hand-curated configuration with disambiguation rules for overlapping areas.

33 CRS policy areas mapped to 11 themes and 42 issues. When a CRS area covers multiple issues (e.g., "Crime and Law Enforcement" spans gun rights, public safety, and criminal justice), legislative subjects and title keywords disambiguate the classification.

Semantic Search & Ranking

Bill text is chunked and embedded using sentence-transformers (all-MiniLM-L6-v2). For each issue, a hybrid search combines vector similarity with keyword matching, scoped to bills in the relevant CRS policy areas.

384-dimensional embeddings, top-100 chunk retrieval per issue, keyword boosting for domain-specific terms.

Heat Score & Coverage Bypass

A heat score (0-15) measures each bill's legislative momentum -- factoring in cosponsor count, committee advancement, floor votes, and enactment. High-heat bills that semantic search missed are injected into results, ensuring legislatively important bills are never overlooked.

4-stage retrieval: CRS graph scoping, vector search, heat-score bypass (CRS-scoped), cross-CRS keyword bypass. Simple resolutions (commemorative/symbolic) are excluded from bypass to maintain precision.

Validation Against Golden Sets

Retrieval quality is measured against curated golden sets -- hand-verified lists of bills that must appear (recall) and must not appear (precision) for each issue. This provides objective, reproducible quality metrics.

42 golden sets with 5-18 must-find bills each. Overall: 94% recall, 92% rejection rate. 23 issues achieve 100% recall.

Neutrality

All generated text follows a strict neutrality policy: numbers and comparisons only, never judgment adjectives. We do not say a bill is "good" or "bad" -- we show what it does and let you decide.

--We don't editorialize -- bills are classified by CRS taxonomy, not editorial judgment

--We don't predict outcomes -- heat scores measure momentum, not probability of passage

--We don't use partisan data sources -- all data comes from official government repositories

--We don't hide our methodology -- quality metrics are measured and disclosed on this page

How We Order Things

The order in which candidates and issues appear on screen can imply preference or importance, even unintentionally. We take this seriously. Here's how we handle it.

Candidates

Candidate lists are randomized per session. Each time you visit, candidates appear in a different order. No candidate gets persistent top placement. We use a session-based seed so the order stays consistent while you browse (no jarring re-shuffles), but changes when you return later.

Issues

Issue lists are ordered by Congressional activity — how many bills and votes are happening on that topic right now. If you've selected demographics (like "renter" or "parent"), issues that affect people like you are shown first. We never editorially pick which issues are more important.

What We Don't Do

We don't sort candidates by party, fundraising, poll numbers, or incumbency status. We don't prioritize issues by controversy or newsworthiness. We don't use engagement metrics or clicks to reorder content. Every ordering decision is either randomized, data-driven, or personalized by your own selections.

Types of Evidence

When we show what a candidate has done on an issue, each piece of evidence is labeled by its source type. Not all evidence carries the same weight — a floor vote is an official action, while a campaign statement is a promise. We show you the difference so you can judge for yourself.

Source Type	What It Means	Strength
Floor Vote	The candidate voted YES or NO on a bill in the Senate or House. This is an official, recorded action.	Official record
Bill Sponsored	The candidate sponsored or co-sponsored a bill. Sponsorship indicates active support for the legislation.	Official record
Campaign Site	A statement from the candidate's official campaign website, extracted verbatim.	Candidate's own words
Social Media	A post from the candidate's verified social media accounts (Bluesky, YouTube, press releases).	Candidate's own words

When we have no evidence for a candidate on a topic, we say so explicitly: "No public record found." This does not mean the candidate has no position — it means we haven't found evidence in our sources yet.

How Suggested Topics Are Generated

When you see suggested topics on a race page, those are generated from what's actually happening in Congress — issues with the most bill introductions, committee activity, and floor votes in the current session.

We display topics as neutral labels (e.g., "Immigration" not "Border crisis") to avoid framing bias. The labels come from our standardized issue taxonomy, not editorial choices.

If you've selected demographics, suggested topics are re-ranked to show issues that affect people like you first. This changes the order, not the content — you can still see all topics by scrolling.

Known Limitations

*Bill gap (Jan 2 - Apr 8, 2026): Bills introduced between January 2 and April 8, 2026 are being backfilled due to a sync pipeline issue. Everything before January 2 is complete.
*Member bios: Not yet populated for all members. Coming soon.
*Graph features: "Bridge senators" and "voting blocs" are in development. Current versions have methodological issues and are not publicly surfaced.
*X/Twitter integration: Not yet available due to API cost constraints. Bluesky, YouTube, and Senate press releases are covered.
*State legislatures: Federal only. State coverage is on the roadmap.

How to Report a Data Issue

If you find incorrect data -- a wrong vote count, a misclassified bill, a broken link -- please open an issue on our GitHub repository or email [email protected]. We take data accuracy seriously and will investigate every report.