Gtools
There was a section in my presentation about gtools, which is the suite of commands I coded for working with big data more efficiently. While gtools still has to work within the contraints of Stata, it's underlying code is in C and is much faster than Stata in many places, to the point where some big data tasks are much less of a bottleneck.
-
The original impetus for
gtoolswas writing a faster collapse. While in Stata 17/MP,collapsehas caught up togcollapse, the package has expanded well beyond this original idea, and thare is a lot of functionality that remains much faster than any other Stata programs available. (Even in the case ofcollapse,gcollapseoffers themergeandmerge replaceoptions, for example.) -
Please visit gtools.readthedocs.io
for detailed documentation and examples. Below I reproduce some of the tables with an overview of how gtoolscompares to other Stata commands, which you can also find on the official site.
Gtools commands with a Stata equivalent
| Function | Replaces | Speedup (IC / MP) | Unsupported | Extras |
|---|---|---|---|---|
| gcollapse | collapse | -0.5 to 2 (Stata 17+); 4 to 100 (Stata 16 and earlier) | Quantiles, merge, labels, nunique, etc. | |
| greshape | reshape | 4 to 20 / 4 to 15 | "advanced syntax" | fast, spread/gather (tidyr equiv) |
| gegen | egen | 9 to 26 / 4 to 9 (+,.) | labels | Weights, quantiles, nunique, etc. |
| gcontract | contract | 5 to 7 / 2.5 to 4 | ||
| gisid | isid | 8 to 30 / 4 to 14 | using, sort |
if, in |
| glevelsof | levelsof | 3 to 13 / 2 to 7 | Multiple variables, arbitrary levels | |
| gduplicates | duplicates | 8 to 16 / 3 to 10 | ||
| gquantiles | xtile | 10 to 30 / 13 to 25 (-) | by(), various (see usage) |
|
| pctile | 13 to 38 / 3 to 5 (-) | Ibid. | ||
| _pctile | 25 to 40 / 3 to 5 | Ibid. | ||
| gstats tab | tabstat | 10 to 50 / 5 to 30 (-) | See remarks | various (see usage) |
| gstats sum | sum, detail | 10 to 20 / 5 to 10 | See remarks | various (see usage) |
(+) The upper end of the speed improvements are for quantiles (e.g. median, iqr, p90) and few groups. Weights have not been benchmarked.
(.) Only gegen group was benchmarked rigorously.
(-) Benchmarks computed 10 quantiles. When computing a large
number of quantiles (e.g. thousands) pctile and xtile are prohibitively
slow due to the way they are written; in that case gquantiles is hundreds
or thousands of times faster, but this is an edge case.
Extra commands
| Function | Similar (SSC/SJ) | Speedup (IC / MP) | Notes |
|---|---|---|---|
| fasterxtile | fastxtile | 20 to 30 / 2.5 to 3.5 | Allows by() |
| egenmisc (SSC) (-) | 8 to 25 / 2.5 to 6 | ||
| astile (SSC) (-) | 8 to 12 / 3.5 to 6 | ||
| gstats hdfe | (.) | Allows weights, by() |
|
| gstats winsor | winsor2 | 10 to 40 / 10 to 20 | Allows weights |
| gunique | unique | 4 to 26 / 4 to 12 | |
| gdistinct | distinct | 4 to 26 / 4 to 12 | Also saves results in matrix |
| gtop (gtoplevelsof) | groups, select() | (+) | See table notes (+) |
| gstats range | rangestat | 10 to 20 / 10 to 20 | Allows weights; no flex stats |
| gstats transform | Various statistical functions |
(-) fastxtile from egenmisc and astile were benchmarked against
gquantiles, xtile (fasterxtile) using by().
(+) While similar to the user command 'groups' with the 'select'
option, gtoplevelsof does not really have an equivalent. It is several
dozen times faster than 'groups, select', but that command was not written
with the goal of gleaning the most common levels of a varlist. Rather, it
has a plethora of features and that one is somewhat incidental. As such, the
benchmark is not equivalent and gtoplevelsof does not attempt to implement
the features of 'groups'
(.) Other than the dated 'hdfe' command, I do not know of a stata command that residualizes variables from a set of fixed effects. The 'hdfe' command, as far as I can tell, morphed into the 'reghdfe' package; the latter, however, is a fully-functioning regression command, while 'gstats hdfe' only residualizes a set of variables.