Dataset
GitHub Contributions
Open-source contribution data derived from the Google BigQuery GitHub public dataset, with synthetic regressors and outcome.
Graph Summary
Sparse cross-language/project structure makes this a difficult real benchmark.
The primary benchmark graph is built from unique (id1, id2) pairs. The figure shows a binned sparsity pattern of the bipartite incidence block, with both partitions relabeled to contiguous integer identifiers. This block is the off-diagonal part of the corresponding graph Laplacian.
Variables
Columns in the clean CSV:
The v1 graph uses id1 and id2. Additional identifier-like columns available for richer specifications: id3, t.
Source Notes
Google BigQuery public dataset `bigquery-public-data:github_repos`.
Google BigQuery public GitHub dataset documentation.
Historical Benchmark
2017 SEC benchmark timings for this dataset, in seconds:
| Method | Citation | Seconds |
|---|---|---|
| MAP-Aitken | (Guimaraes 2012) | 268.7 |
| MAP-SD | (Gaure 2013) | 110.1 |
| MAP-CG-Sym | (Correia 2016) | 114.4 |
| MAP+Prune | (Correia 2016) | 127.1 |
| LSMR | (Gomez 2016) | 169.6 |