Improve your Algorithms

A parallelized and extremely efficient loop is slower than the equivalent vector operation.
The fastest regression with a full set of fixed-effect indicators is slower than reghdfe.
Code tailored to your specific task is faster than general-purpose code.

A simple example I think most people should find intuitive is computing a leave-one-out mean. One definition is

\[ \bar{x}_{-i} = \dfrac{\sum_{j \ne i} x_j}{\sum_{j \ne i} 1} \]

And the corresponding code is given by

clear
set rmsg on
set obs 10000
gen x = runiform()
gen x_loo = .
forvalues i = 1 / `=_N' {
    qui sum x if _n != `i', meanonly
    qui replace x_loo = r(mean) in `i'
}

With only 10,000 observations this already takes a few seconds. Another definition, which is one I think most people will have used, is $$ \bar{x}_{-i} = \dfrac{(\sum_j x_j) - x_i}{(\sum_j 1) - 1} $$

qui sum x, meanonly
gen x_loo2 = (r(sum) - x) / (r(N) - 1)
assert x_loo == x_loo2

clear
set obs 10000000
gen x = runiform()
qui sum x, meanonly
gen x_loo2 = (r(sum) - x) / (r(N) - 1)

Even with 10M observations, this method gives the answer in a fraction of a second. In this case, it truly would not have mattered how fast you make your loop: The faster algorithm will win every time.