GitHub hosted database of activity #190186

rusackas · 2026-03-20T20:03:11Z

rusackas
Mar 20, 2026

Select Topic Area

Question

Body

I'm probably posting this in the wrong discussion category, but I'm hard pressed to find a better one. Seems feed-adjacent at least.

I, and various other, project maintainers, have often wanted access to a SQL queryable database of all GitHub activity (sure, for our own repos, but why not all of them?)

I happen to work on Apache Superset, and have seen on various sites like this that do a good job (some better than others) of maintaining a rolling archive of all GitHub activity. That particular site implements a bunch of stuff on top of this. It's lovely, and while I could build such a thing for my own uses, this feels like a valuable service GitHub itself should/could be operating (or likely already has, internally).

Allowing us random users to hit such a DB (with proper keys, rate limits, etc) would allow ALL SORTS of things to track our projects... a timeseries chart of github stars... knowing which non-bot PRs are the "most stale" at all times... or which reviewers have the fastest/slowest response times, or carry the most burden at any given time. There are a million stats/charts to be built from such a dataset, and I know many people that would be all over it. I'd even be happy to bring the Superset community to the table if anyone wants to explore various ways to utilize and visualize the data for our community and others.

Does such a thing exist that I'm not aware of? Could it? Can I/we help? :D

Radi410 · 2026-03-21T09:33:21Z

Radi410
Mar 21, 2026

If you want a SQL-queryable database of almost all public GitHub activity, you should check out the GitHub Archive on Google BigQuery. GitHub pushes their event timeline (stars, forks, PRs, issues) to a public dataset there. Since it’s in BigQuery, you can write standard SQL to track "stale PRs" or "reviewer velocity" across the entire ecosystem. It's exactly the "rolling archive" you're looking for!

Another route specifically for project health stats like you mentioned is OSSInsight.io. They actually use that same data to build the exact timeseries charts and contributor rankings you described.

That said, having a native "GitHub Insights SQL Console" built directly into the platform would be incredible for maintainers who don't want to leave the site to check their project's health. Given your work on Superset, you're in a unique position to show GitHub what that integration could look like. Maybe there's a world where a "Superset-lite" dashboard becomes a first-class citizen in the Insights tab!

0 replies

Rajveer-code · 2026-03-21T13:26:37Z

Rajveer-code
Mar 21, 2026

You’re definitely not alone in wanting this — a queryable, time-series view of GitHub activity would unlock a ton of useful insights.

Short answer: there isn’t a single official “SQL over all GitHub activity” service from GitHub, but there are a few things that get close depending on what you need.

What exists today

GitHub Archive (BigQuery / public datasets)
There’s a public dataset that captures GitHub events over time (pushes, PRs, issues, stars, etc.).
You can query it with SQL via BigQuery.

Pros:

Historical + time-series friendly
Covers global GitHub activity
Easy to plug into tools like Superset

Limitations:

Event-based (not full relational state)
Some gaps / lag depending on ingestion

GitHub GraphQL / REST APIs
These are the official ways to query data, but:

Not designed for large-scale analytics
Rate limits make “warehouse-style” queries painful
You have to build your own storage layer

Third-party / self-built pipelines
A lot of platforms you’re seeing likely:

Ingest GitHub Archive + APIs
Normalize into their own schema
Serve it via SQL or dashboards

That’s basically the “roll your own warehouse” approach.

Why GitHub doesn’t offer this (yet)

My guess (based on how their APIs are structured):

Cost of serving global analytical queries at scale
Privacy / data governance concerns
Preference for API-based access over bulk querying
If you want something close today

A pretty solid stack would be:

GitHub Archive (BigQuery) → base event stream
periodic enrichment via GraphQL API
→ materialize into your own warehouse
→ visualize with Superset

That gets you ~80–90% of what you’re describing.

Your idea is actually very compelling

An official:

“GitHub Analytics API / warehouse layer”

would be huge for:

maintainers
OSS analytics
contributor insights
project health metrics

And Superset would honestly be a great fit on top of that.

TL;DR
No official SQL-accessible GitHub-wide database
GitHub Archive (BigQuery) is the closest thing
Most tools build their own pipelines on top of it
Your idea makes a lot of sense — just not something GitHub exposes (yet)

Would love to see what you build if you go down this route — feels like there’s a real gap here 👍

0 replies

itxashancode · 2026-04-07T20:42:37Z

itxashancode
Apr 7, 2026

GitHub doesn't offer a public SQL endpoint for all activity. The closest official source is the GitHub Archive Program, which partners with GH Archive to provide a public dataset of public events via Google BigQuery. You can query it directly there.

I've used GH Archive's BigQuery public dataset for similar analytics. The schema is well-documented and updated hourly. For your specific use cases - like tracking stale PRs or reviewer response times - you'd write SQL queries against the githubarchive.month.* tables.

If you want to build a Superset dashboard, connect Superset to BigQuery and point it at this dataset. The data is free but subject to BigQuery's query costs on large scans.

GH Archive also open-sources their ETL pipeline on GitHub if you want to replicate or contribute. That might be the best way to "help" if you're looking to improve the underlying infrastructure.

So: no hosted GitHub SQL DB, but BigQuery + GH Archive is the functional equivalent for public data.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

GitHub hosted database of activity #190186

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

GitHub hosted database of activity #190186

Uh oh!

rusackas Mar 20, 2026

Select Topic Area

Body

Replies: 3 comments

Uh oh!

Radi410 Mar 21, 2026

Uh oh!

Rajveer-code Mar 21, 2026

Uh oh!

itxashancode Apr 7, 2026

rusackas
Mar 20, 2026

Radi410
Mar 21, 2026

Rajveer-code
Mar 21, 2026

itxashancode
Apr 7, 2026