GitHub hosted database of activity #190186
Replies: 3 comments
-
|
If you want a SQL-queryable database of almost all public GitHub activity, you should check out the GitHub Archive on Google BigQuery. GitHub pushes their event timeline (stars, forks, PRs, issues) to a public dataset there. Since it’s in BigQuery, you can write standard SQL to track "stale PRs" or "reviewer velocity" across the entire ecosystem. It's exactly the "rolling archive" you're looking for! Another route specifically for project health stats like you mentioned is OSSInsight.io. They actually use that same data to build the exact timeseries charts and contributor rankings you described. That said, having a native "GitHub Insights SQL Console" built directly into the platform would be incredible for maintainers who don't want to leave the site to check their project's health. Given your work on Superset, you're in a unique position to show GitHub what that integration could look like. Maybe there's a world where a "Superset-lite" dashboard becomes a first-class citizen in the Insights tab! |
Beta Was this translation helpful? Give feedback.
-
|
You’re definitely not alone in wanting this — a queryable, time-series view of GitHub activity would unlock a ton of useful insights. Short answer: there isn’t a single official “SQL over all GitHub activity” service from GitHub, but there are a few things that get close depending on what you need. What exists today
Pros: Historical + time-series friendly Limitations: Event-based (not full relational state)
Not designed for large-scale analytics
Ingest GitHub Archive + APIs That’s basically the “roll your own warehouse” approach. Why GitHub doesn’t offer this (yet) My guess (based on how their APIs are structured): Cost of serving global analytical queries at scale A pretty solid stack would be: GitHub Archive (BigQuery) → base event stream That gets you ~80–90% of what you’re describing. Your idea is actually very compelling An official: “GitHub Analytics API / warehouse layer” would be huge for: maintainers And Superset would honestly be a great fit on top of that. TL;DR Would love to see what you build if you go down this route — feels like there’s a real gap here 👍 |
Beta Was this translation helpful? Give feedback.
-
|
GitHub doesn't offer a public SQL endpoint for all activity. The closest official source is the GitHub Archive Program, which partners with GH Archive to provide a public dataset of public events via Google BigQuery. You can query it directly there. I've used GH Archive's BigQuery public dataset for similar analytics. The schema is well-documented and updated hourly. For your specific use cases - like tracking stale PRs or reviewer response times - you'd write SQL queries against the If you want to build a Superset dashboard, connect Superset to BigQuery and point it at this dataset. The data is free but subject to BigQuery's query costs on large scans. GH Archive also open-sources their ETL pipeline on GitHub if you want to replicate or contribute. That might be the best way to "help" if you're looking to improve the underlying infrastructure. So: no hosted GitHub SQL DB, but BigQuery + GH Archive is the functional equivalent for public data. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Select Topic Area
Question
Body
I'm probably posting this in the wrong discussion category, but I'm hard pressed to find a better one. Seems feed-adjacent at least.
I, and various other, project maintainers, have often wanted access to a SQL queryable database of all GitHub activity (sure, for our own repos, but why not all of them?)
I happen to work on Apache Superset, and have seen on various sites like this that do a good job (some better than others) of maintaining a rolling archive of all GitHub activity. That particular site implements a bunch of stuff on top of this. It's lovely, and while I could build such a thing for my own uses, this feels like a valuable service GitHub itself should/could be operating (or likely already has, internally).
Allowing us random users to hit such a DB (with proper keys, rate limits, etc) would allow ALL SORTS of things to track our projects... a timeseries chart of github stars... knowing which non-bot PRs are the "most stale" at all times... or which reviewers have the fastest/slowest response times, or carry the most burden at any given time. There are a million stats/charts to be built from such a dataset, and I know many people that would be all over it. I'd even be happy to bring the Superset community to the table if anyone wants to explore various ways to utilize and visualize the data for our community and others.
Does such a thing exist that I'm not aware of? Could it? Can I/we help? :D
Beta Was this translation helpful? Give feedback.
All reactions