Dataset

GitHub Contributions

Open-source contribution data derived from the Google BigQuery GitHub public dataset, with synthetic regressors and outcome.

All datasets

Bipartite incidence pattern for GitHub Contributions

Graph Summary

Sparse cross-language/project structure makes this a difficult real benchmark.

The primary benchmark graph is built from unique (id1, id2) pairs. The figure shows a binned sparsity pattern of the bipartite incidence block, with both partitions relabeled to contiguous integer identifiers. This block is the off-diagonal part of the corresponding graph Laplacian.

548,843rows

41,751unique edges

27,970id1 levels

24,358id2 levels

13,232components

52,328nodes

CSV ZIP Stata DTA

Variables

Columns in the clean CSV:

id1 id2 id3 t len_commit_msg num_files x1 x2 y

The v1 graph uses id1 and id2. Additional identifier-like columns available for richer specifications: id3, t.

Source Notes

Google BigQuery public dataset `bigquery-public-data:github_repos`.

Google BigQuery public GitHub dataset documentation.

Historical Benchmark

2017 SEC benchmark timings for this dataset, in seconds:

Method	Citation	Seconds
MAP-Aitken	(Guimaraes 2012)	268.7
MAP-SD	(Gaure 2013)	110.1
MAP-CG-Sym	(Correia 2016)	114.4
MAP+Prune	(Correia 2016)	127.1
LSMR	(Gomez 2016)	169.6