Gtools
There was a section in my presentation about gtools
, which is the suite of commands I coded for working with big data more efficiently. While gtools
still has to work within the contraints of Stata, it's underlying code is in C and is much faster than Stata in many places, to the point where some big data tasks are much less of a bottleneck.
-
The original impetus for
gtools
was writing a faster collapse. While in Stata 17/MP,collapse
has caught up togcollapse
, the package has expanded well beyond this original idea, and thare is a lot of functionality that remains much faster than any other Stata programs available. (Even in the case ofcollapse
,gcollapse
offers themerge
andmerge replace
options, for example.) -
Please visit gtools.readthedocs.io for detailed documentation and examples. Below I reproduce some of the tables with an overview of how
gtools
compares to other Stata commands, which you can also find on the official site.
Gtools commands with a Stata equivalent
Function | Replaces | Speedup (IC / MP) | Unsupported | Extras |
---|---|---|---|---|
gcollapse | collapse | -0.5 to 2 (Stata 17+); 4 to 100 (Stata 16 and earlier) | Quantiles, merge, labels, nunique, etc. | |
greshape | reshape | 4 to 20 / 4 to 15 | "advanced syntax" | fast , spread/gather (tidyr equiv) |
gegen | egen | 9 to 26 / 4 to 9 (+,.) | labels | Weights, quantiles, nunique, etc. |
gcontract | contract | 5 to 7 / 2.5 to 4 | ||
gisid | isid | 8 to 30 / 4 to 14 | using , sort |
if , in |
glevelsof | levelsof | 3 to 13 / 2 to 7 | Multiple variables, arbitrary levels | |
gduplicates | duplicates | 8 to 16 / 3 to 10 | ||
gquantiles | xtile | 10 to 30 / 13 to 25 (-) | by() , various (see usage) |
|
pctile | 13 to 38 / 3 to 5 (-) | Ibid. | ||
_pctile | 25 to 40 / 3 to 5 | Ibid. | ||
gstats tab | tabstat | 10 to 50 / 5 to 30 (-) | See remarks | various (see usage) |
gstats sum | sum, detail | 10 to 20 / 5 to 10 | See remarks | various (see usage) |
(+) The upper end of the speed improvements are for quantiles (e.g. median, iqr, p90) and few groups. Weights have not been benchmarked.
(.) Only gegen group was benchmarked rigorously.
(-) Benchmarks computed 10 quantiles. When computing a large
number of quantiles (e.g. thousands) pctile
and xtile
are prohibitively
slow due to the way they are written; in that case gquantiles is hundreds
or thousands of times faster, but this is an edge case.
Extra commands
Function | Similar (SSC/SJ) | Speedup (IC / MP) | Notes |
---|---|---|---|
fasterxtile | fastxtile | 20 to 30 / 2.5 to 3.5 | Allows by() |
egenmisc (SSC) (-) | 8 to 25 / 2.5 to 6 | ||
astile (SSC) (-) | 8 to 12 / 3.5 to 6 | ||
gstats hdfe | (.) | Allows weights, by() |
|
gstats winsor | winsor2 | 10 to 40 / 10 to 20 | Allows weights |
gunique | unique | 4 to 26 / 4 to 12 | |
gdistinct | distinct | 4 to 26 / 4 to 12 | Also saves results in matrix |
gtop (gtoplevelsof) | groups, select() | (+) | See table notes (+) |
gstats range | rangestat | 10 to 20 / 10 to 20 | Allows weights; no flex stats |
gstats transform | Various statistical functions |
(-) fastxtile
from egenmisc and astile
were benchmarked against
gquantiles, xtile
(fasterxtile
) using by()
.
(+) While similar to the user command 'groups' with the 'select'
option, gtoplevelsof does not really have an equivalent. It is several
dozen times faster than 'groups, select', but that command was not written
with the goal of gleaning the most common levels of a varlist. Rather, it
has a plethora of features and that one is somewhat incidental. As such, the
benchmark is not equivalent and gtoplevelsof
does not attempt to implement
the features of 'groups'
(.) Other than the dated 'hdfe' command, I do not know of a stata command that residualizes variables from a set of fixed effects. The 'hdfe' command, as far as I can tell, morphed into the 'reghdfe' package; the latter, however, is a fully-functioning regression command, while 'gstats hdfe' only residualizes a set of variables.