About This Visualization
This bar chart race shows the popularity of programming languages on GitHub over time. Watch as languages compete for dominance across different metrics including commits, issues, pull requests, repositories, and stars.
Use the controls to adjust the visualization:
- Switch between different metrics using the tabs
- Change between yearly and monthly views
- Adjust how many languages are shown
- Use the replay button to restart the animation
Data Sources
Data is extracted from GitHub archives and represents the activity across public repositories from 2011 to the present.
Data last updated: April 2025
How I Built This Project
The GitHub Language Trends project uses the GitHub Archive dataset as its primary source, supplemented with data from the GitHub GraphQL API for repository languages. My data collection pipeline follows these steps:
- Historical Data Extraction: Downloading terabytes of GitHub event data from GitHub Archive, targeting specific events (Push, Pull, Issues, Create, Fork, & Watch)
- Language Resolution: Utilizing GitHub's GraphQL API to find the primary language for all events' repositories
- Incremental Updates: Running bi-monthly jobs to incorporate the latest GitHub activity (semi-implemented).
The most challenging aspect was designing a multithreaded asynchronous script to query the GraphQL API at fast enough intervals to resolve the initial 500 million repository languages, after first downloading the data.
The animated bar chart race visualization was built using D3.js version 7, with several key considerations:
- Programming Language Representation: Ensuring the dataset accurately reflects the usage of programming languages in a given time period.
- Interactive Elements: Adding controls for users to explore different metrics and time periods
- Language Color Schemes: Creating a large table of unique colors to display 20 languages maximum without appearing too similar in shade.
The animation has been tuned to prioritize visual appeal, accounting for the animation speed of different time intervals.
Building this visualization presented several interesting challenges:
- Data Volume: Processing over 1TB of raw GitHub event data required distributed data management techniques
- Event/Repository Cardinality: Handling GitHub's old Timeline API events versus the current Rest API, and determining which columns between the two make a unique record.
- Data Parsing: To prioritize speed and efficiency, at times, the virtual machine took 20 cores to download and parse in parallel.
The most difficult challenge occurred the first time I parsed the downloaded data. I had parsed everything into a single table; the database size required all the RAM (64GiB) I could offer to migrate events to their respective tables, and crashed all other VMs. (This was a lesson that only needed to be taught once).
Hosting the Database
Infrastructure Overview
The GitHub Language Trends database architecture consists of:
- Primary Storage: ClickHouse columnar database running on a Ubuntu 24.04 headless VM (Proxmox, base OS)
- VM Specifications: 32 GiB RAM, 20 Cores optimized for analytical workloads
- Table Structure: ReplacingMergeTree engine to ensure no data duplication and protect event uniqueness
- Data Pipeline: ClickHouse aggregations exported to CSV files for efficient web client consumption
The language for each event is resolved by finding the primary language for the repository identified in the event. The pre-aggregated CSV files enable clear animation without requiring real-time queries to the database during visualization rendering.
Development Timeline
March 2025
Project inception: Initial data research and backend planning.
April 1 - April 7, 2025
Data verification: Testing the scope and potential of GitHub Archive data, using Google's BigQuery.
Backend database: Research to find a good backend database which could work with the large dataset.
April 8 -- April 14, 2025
ClickHouse exploration: Settled on ClickHouse database because of GitHub Archive bulk dataset. Learned to manipulate & create tables; similar syntax and aggregation functions to BigQuery.
Data download: Worked with ChatGPT and Claude to build a multithreaded Shell script to download and parse data into the ClickHouse DB.
April 15 -- April 21, 2025
GitHub API data: Built a multithreaded python script to retrieve primary language information for each repository.
Data aggregation: Building scripts to query data and join tables to count programming languages per each GitHub event.
April 22 -- April 28, 2025
Primary keys: Analyzed raw data to determine which combined columns act as a primary key for each event type.
Data re-download: Identified a lack of uniqueness in some events; had to add additional columns to event tables and redownload ALL data.
April 29 -- May 5, 2025
Frontend development: Testing D3.js visualizations, and interactive components
Querying duration: Took an estimated 16 days to query the primary language of all ~500 Million repositories.
May 6 - May 12, 2025
Updated aggregation: Changed aggregation query to include additional language attributes from old Timeline API data between 2012 and 2014.
Frontend development: Added custom filters, and an additional D3.js line graph visualization.
May 13 - May 16, 2025
Frontend development: Created percent-change line graph visualization, and refined data filtering.
Website integration: The frontend visualization webpage was built separate of my personal website. As such, the CSS encountered some conflicts
Future Plans
Automatic updates: Updates the dataset bi-monthly (semi-implemented), and queries any new missing repositories' languages (not implemented).
Language alteration: Record repositories' languages with a query-date timestamp, to reflect potential primary language change in future data.