Planning ahead
Sorting and by
For example, if you will be doing many operations by group, sorting the data and working on it by group all at once will make each operation much faster. (NB: While gtools
functions are also faster on sorted data, of course part of the point of gtools
is to obviate the need to do a sort, and the speed gain is often larger without sorting.)
set seed 1729
clear
set obs 10000000
gen x = rnormal()
gen y = rnormal()
gen g = mod(_n, 100)
gen r = runiform()
sort r
set rmsg on
If should be clear that this is inefficient:
qui {
bys g: gen a = sum(x)
bys g: replace a = a[_N] / _N
sort r
gen b = .
bys g: replace b = max(y, b[_n - 1])
bys g: replace b = b[_N]
sort r
gen c = .
bys g: replace c = min(y, c[_n - 1])
bys g: replace c = c[_N]
sort r
}
However, we might accidentally end up doing a version of this if we're not deliberate about doing similar operations together. A better way to do it would be
drop a b c
qui {
sort g
by g: gen a = sum(x)
by g: replace a = a[_N] / _N
gen b = .
by g: replace b = max(y, b[_n - 1])
by g: replace b = b[_N]
gen c = .
by g: replace c = min(y, c[_n - 1])
by g: replace c = c[_N]
sort r
}
Of course, gegen
is faster in this case if you care about leaving the data in its original state:
drop a b c
{
gegen a = sum(x), by(g)
gegen b = max(y), by(g)
gegen c = min(y), by(g)
}
But you should know two things. First, this is an inefficient gtools
solution, and we should be using the merge
option from gcollapse
:
drop a b c
gcollapse (sum) a=x (max) b=y (min) c=y, by(g) merge
Second, if we don't benchmark the sorts
, the individual series of by
operations in Stata are faster than even this gcollapse
statement. The reason is that gcollapse
has to group the data internally, whereas the by
statement relies on the sort statement for that; so it's not a one-to-one comparison, but if your data will be sorted anyway, you should know that sometimes relying on by
can be the fastest solution!
Pre-computing variables
You should pre-compute variables that will be re-used instead of creating them on the fly. For example:
gen byte ind = ...
program1 if ind == 1
program2 if ind == 1
gen var = ...
for i = 3 / 7 {
program`i' var
}
Very long operations
Sometimes a program that takes a long time to run is inevitable:
-
Run overnight or over a break. (So program does not compete for computing time or your own time.)
-
Include checkpoints:
-
Do not write a single function to do all your work.
-
Group tasks into programs, and save your data along the way.
-
Print messages along your program to tell you where you are (can check log while program executes).
-
This program groups execution and has checkpoints and log messages throughout:
program part1
display "part 1, task 1"
* ...
display "part 1, task 2"
* ...
end
program part2
display "part 2, task 1"
* ...
display "part 2, task 2"
* ...
end
part1
save checkpoint1.dta
display "finished part 1"
part2
save checkpoint2.dta
display "finished part 2"
We can even be fairly sophisticated about it. The snippet below can scale to multiple parts, for instance:
program main
syntax, [cached]
local nparts = 2
local startat = 1
if "`cached'" == "cached" {
forvalues part = 1 / `nparts' {
cap confirm file checkpoint`part'.dta
local startat = cond(`startat' == `part' & _rc == 0, `part'+1, `startat')
}
}
forvalues part = 1 / `nparts' {
if `startat' <= `part' {
part`part'
save checkpoint`part'.dta, replace
display "finished part 1"
}
else if `startat' == (`part' + 1) {
display "loading cached part `part' results"
use checkpoint`part'.dta, clear
}
}
if `startat' > `nparts' {
disp "Nothing to do with option -cached-; all checkpoints already exist"
}
end
program part1
display "part 1, task 1"
* ...
display "part 1, task 2"
* ...
end
program part2
display "part 2, task 1"
* ...
display "part 2, task 2"
* ...
end
main, cached