HDFE benchmark collection

Datasets for benchmarking high-dimensional fixed-effect algorithms and their implementations

A curated collection of datasets (CSV, Stata) for evaluating algorithms that absorb millions of fixed effects across multiple levels.

Download all CSVs Download all Stata files

15datasets

5,977,832rows

4,886,226unique graph edges

2,995,103graph nodes

Goal

Modern empirical work often needs regressions of the form y = Xb + D a + e, where D is not one fixed effect but a collection of high-dimensional indicator matrices: workers and firms, borrowers and months, students and teachers, senders and receivers, etc. The coefficients on X are usually the object of interest, but the nuisance fixed effects can dominate the computational problem.

The standard trick is Frisch-Waugh-Lovell residualization: partial out the fixed effects from y and each column of X, then run the low-dimensional regression. For one fixed effect this is just demeaning. For two or more fixed effects it becomes an iterative numerical problem whose difficulty is governed less by the number of rows than by the connectivity of the underlying graph.

Econometric motivation The same model can be easy or hard depending on how units are matched. A balanced worker-firm panel and a fragmented worker-firm panel may have similar row counts but very different identifying variation and convergence behavior.

Numerical motivation Alternating projections, conjugate-gradient acceleration, pruning, and sparse solvers all exploit the same geometry in different ways. Poorly connected graphs can make naive methods look fine on toy examples and fail on realistic ones.

Benchmark motivation Synthetic complete or uniform graphs are useful but not enough. Real anonymized networks have components, bottlenecks, degree heterogeneity, and sparse bridges. Those features are the most interesting when testing HDFE implementations.

The figures below illustrate the sparsity sparsity patterns of the different datasets. They show the binned bipartite incidence blocks induced by the primary two-way fixed effects, i.e. the off-diagonal structure of the graph Laplacian. Dense, compact patterns are usually easy. Long thin structures, many separated blocks, or weak bridges are warning signs: they indicate poor algebraic connectivity, large finite condition numbers, and slow residualization for some algorithms.

Datasets

Credit, Small Subsample

A small borrower-period panel derived from the credit dataset. It is useful as a fast smoke test for HDFE implementations.

real-anonymized credit2 extra ids

19,094rows

19,094edges

601id1 levels

140id2 levels

CSV ZIP Stata DTA

Credit

Borrower-month observations from a credit panel, with synthetic regressors and outcome added for benchmarking.

real-anonymized credit extra ids

516,810rows

516,810edges

13,946id1 levels

140id2 levels

CSV ZIP Stata DTA

Soccer

Player co-appearance data constructed from soccer match rosters, with synthetic regressors and outcome.

real-derived soccer

73,487rows

73,355edges

528id1 levels

521id2 levels

CSV ZIP Stata DTA

Synthetic Complete

A complete bipartite synthetic benchmark where every unit in the first partition is matched to every unit in the second partition.

synthetic synthetic-complete

500,000rows

500,000edges

1,000id1 levels

500id2 levels

CSV ZIP Stata DTA

Synthetic Uniform, Easy

Uniform random matching synthetic benchmark with easy connectivity.

synthetic synthetic-uniform-easy

500,000rows

475,566edges

216,007id1 levels

20id2 levels

CSV ZIP Stata DTA

Synthetic Uniform, Hard

Uniform random matching synthetic benchmark with weaker connectivity.

synthetic synthetic-uniform-hard

500,000rows

499,992edges

122,702id1 levels

122,655id2 levels

CSV ZIP Stata DTA

Synthetic Uniform, Harder

Uniform random matching synthetic benchmark with the weakest v1 uniform connectivity.

synthetic synthetic-uniform-harder

500,000rows

499,997edges

216,007id1 levels

216,046id2 levels

CSV ZIP Stata DTA

Synthetic Assortative Matching

Synthetic CEO-firm panel with turnover, retirement, poaching, and assortative matching between CEO and firm types.

synthetic synthetic-assortative extra ids

499,155rows

196,245edges

126,101id1 levels

94,933id2 levels

CSV ZIP Stata DTA

Synthetic Zigzag

A small synthetic graph designed to create a path-like hard connectivity pattern.

synthetic synthetic-zigzag

10,002rows

10,000edges

5,000id1 levels

5,001id2 levels

CSV ZIP Stata DTA

Enron Email

Email sender-recipient graph derived from the SNAP Enron email network.

real-public enron

367,662rows

367,662edges

36,692id1 levels

36,692id2 levels

CSV ZIP Stata DTA

GitHub Contributions

Open-source contribution data derived from the Google BigQuery GitHub public dataset, with synthetic regressors and outcome.

real-public github extra ids

548,843rows

41,751edges

27,970id1 levels

24,358id2 levels

CSV ZIP Stata DTA

Patent Citations

Patent citation graph derived from the SNAP patent citation dataset.

real-public patents

500,008rows

500,008edges

101,837id1 levels

362,995id2 levels

CSV ZIP Stata DTA

Workers

Employer-employee matched data with anonymized firm and worker identifiers plus job-title and year fields.

real-anonymized workers extra ids

504,315rows

253,929edges

28,864id1 levels

218,390id2 levels

CSV ZIP Stata DTA

Schools

Student-teacher longitudinal data with anonymized identifiers and synthetic regressors/outcome.

real-anonymized schools extra ids

413,444rows

406,998edges

206,722id1 levels

11,960id2 levels

CSV ZIP Stata DTA

Directors

Firm-director matched records with synthetic regressors and outcome.

real-anonymized directors

525,012rows

524,819edges

395,925id1 levels

400,850id2 levels

CSV ZIP Stata DTA

Additional Fixed-Effect Dimensions

The canonical v1 figure for every dataset uses id1 and id2, because two-way fixed effects have a direct graph/Laplacian interpretation. Several datasets also include other identifier-like columns that can support richer three-way or multi-way benchmarks.

Dataset	Primary graph	Additional identifier-like columns
Credit, Small Subsample	id1, id2	ruc, t
Credit	id1, id2	ruc, t
Soccer	id1, id2	None in the clean v1 CSV
Synthetic Complete	id1, id2	None in the clean v1 CSV
Synthetic Uniform, Easy	id1, id2	None in the clean v1 CSV
Synthetic Uniform, Hard	id1, id2	None in the clean v1 CSV
Synthetic Uniform, Harder	id1, id2	None in the clean v1 CSV
Synthetic Assortative Matching	id1, id2	year, ceo_type, firm_type
Synthetic Zigzag	id1, id2	None in the clean v1 CSV
Enron Email	id1, id2	None in the clean v1 CSV
GitHub Contributions	id1, id2	id3, t
Patent Citations	id1, id2	None in the clean v1 CSV
Workers	id1, id2	id3, id4
Schools	id1, id2	id3
Directors	id1, id2	None in the clean v1 CSV

Historical Benchmark

The table below reproduces the initial benchmark timings from this project (see presentation link below). They show the central lesson of the collection (as of early 2026): no single method wins uniformly, and the hard cases are exactly the ones with weak graph connectivity.

Dataset	MAP-Aitken	MAP-SD	MAP-CG-Sym	MAP+Prune	LSMR	Fastest
Synthetic-complete	2.3	2.5	3.5	7.5	2.8	MAP-Aitken (2.3s)
Synthetic-unif-easy	7.4	5.0	4.9	16.8	10.4	MAP-CG-Sym (4.9s)
Synthetic-unif-hard	19.0	21.8	16.1	22.7	32.1	MAP-CG-Sym (16.1s)
Synthetic-unif-harder	83.5	49.8	47.2	50.0	124.2	MAP-CG-Sym (47.2s)
Synthetic-assortative	320.6	108.7	101.8	73.6	206.6	MAP+Prune (73.6s)
Credit	15.1	20.5	14.2	28.2	17.3	MAP-CG-Sym (14.2s)
Enron	51.4	38.1	29.7	31.5	51.0	MAP-CG-Sym (29.7s)
Schools	221.5	79.7	61.7	116.6	132.0	MAP-CG-Sym (61.7s)
Github	268.7	110.1	114.4	127.1	169.6	MAP-SD (110.1s)
Workers	484.1	146.0	169.6	646.1	356.3	MAP-SD (146.0s)
Patents	574.2	191.7	188.5	172.3	490.6	MAP+Prune (172.3s)
Directors	688.8	215.6	258.6	240.7	894.1	MAP-SD (215.6s)

To recreate comparable timings with current Stata and current packages, download benchmark_reghdfe_variants.do and run it after downloading the Stata files. For the original motivation, examples, and discussion behind these benchmarks, see the 2017 SEC presentation: Linear Models with Multi-Way Fixed Effects.