Writing my first Stata plugin: A real world use case
I’m not the biggest fan of Stata. Though I use it every day for RA work, and Stata does shine when all you want to do is explore one data set (or a series of data sets that are easy to merge), it’s become increasingly apparent over time that whenever I want to do something complex or computationally intensive, it pales.
Mata makes many of the rougher corners of Stata rather bearable. However, optimizing Stata for a speedy run is really difficult. Enter Stata plugins.
What are Stata plugins?
A Stata plugin is pre-recompiled code, written in C or C++, that can interact with Stata using the “Stata Plugin Interface (SPI).” Stata provides a C source file and header that allows a C program to interact with Stata’s data sets and matrices.
The implementation is relatively crude. Stata can write/read to/from C one observation at a time from/to existing variables and matrices. Stata has pretty good documentation for the functionality of their plugins, and I will not repeat all of it here.
A simple hello world program would be as follows:
- Get a C compiler (
gcc
, the GNU Compiler Collection, is standard and should be included on Mac and Linux; look into MinGW if you are on Windows to usegcc
). - Download the Stata Splugin Interface (SPI) version 2 (Stata <= 13) or 3 (Stata >= 14).
- Version 2: stplugin.c, stplugin.h
- Version 3: stplugin.c, stplugin.h
- Create hello.c (Note you should have stplugin.h and stplugin.c in the same directory).
- Now from the command line, run
- Last, from Stata navigate to your working directory and run
Is it worth the hassle?
Stata says that it’s only worth it if you are replacing a lot of interpreted ado-code and the task is not very complex. Though I agree on the latter (complex tasks will likely take more time to code in C than the time they will save) I strongly disagree on the former.
Perhaps most people realize this, but my understanding of for loops in Stata is that they are run as if you printed each block within the for loop however many times you tell it to execute. Thus
may look like three lines of code, but it’s really equivalent to 1000. The reason I started using Stata plugins was to speed up a simulation. The C code is longer and the base case are only a handful of lines in Stata, but it’s painfully slow because the bulk of the computation takes place inside a loop that does a simulation.
Below I document a real-world use case where C was 50 times faster than Stata, so for me the work was definitely worthwhile.
A real world use case
Several of the projects I work on are Randomized Control Trials. It is standard to conduct a power analysis for such projects in order to put together a proposal, etc. Having a well-powered trial is essential for the success of an RCT.
Since RCTs can give you a truly independent treatment variable, we can recover the treatment effect via simple OLS. Though parametric methods are well known and widely used to estimate power under this setup, they rely on strong assumptions. When clustering or stratification are involved, specially when the number of clusters is not very large, parametric calculations can be inaccurate.
One suggestion I got was to simulate power. I won’t outline the full rationale (read about it here), but the crux of the idea is to simulate a large number of coefficients $b$ for the equation:
\[Y_{ij} = a + b T_{ij} + g X_{ij} + e_{ij}\]where at each step of the simulation, $T_{ij}$ is simulated so that there are $NP$ individuals/clusters in treatment and $(1 - P)N$ in control. Since treatment is assigned randomly, the resulting distribution is a sample of the true distribution of $b$ under the null $H_0: b = 0$.
This does not tell us anything about power by itself, but the confidence interval can be used as the basis of an iterative procedure to simulate power. Hence coding the simulation efficiently is crucial.
Why write a plugin?
The problem above is actually very simple to implement. In pseudo-code:
Though simple, doing this efficiently is impossible in Stata. There are three prominent issues:
-
There is no way to shuffle a vector in Stata. That’s not a thing. Variables all exist in relation to each other and sorting one variable randomly will sort the entire data set. Shuffling an entire data set is much slower than shuffling one vector.
-
There is no way to compute just the regression coefficients in Stata. Stata’s
regress
computes a host of things along with the least squares solution This adds unnecessary overhead. (I have asked about how to do this before; the suggestion I got was to runquietly regress, notable
which just controls what Stata outputs, not what it computes). -
It’s not obvious how to store the results, specially with Stata/IC. Though the buffer versions of Stata should be able to handle most simulations after setting a larger
matsize
, the fact matrix sizes are capped (and in Stata/IC capped at 800), makes the function difficult to code.
The solution in Stata would look like this
This is a hugely inefficient program!
Wait, can’t Mata handle these things?
Right, Mata is the elephant in the room. If you don’t know, Mata is a programming language that is shipped with every version of Stata and it can interact with Stata relatively easily. If you have ever used an object-oriented programming language then you will recognize Mata as a more standard programming language than Stata.
Mata does afford us some efficiency, but not a lot (yes, I know about
.mlib
files and that technically Stata compiles mata into bytecode
when read into memory, but I have never found the speed improvement to
be significant).
In this case, Mata will run faster largely because it can
-
Shuffle just a single vector.
-
Get the OLS coefficients without any additional computations.
The implementation is very straightforward:
There are two problems:
-
Matrix operations in Mata are not terribly fast (certainly not compered to a compiled language like C or even a JIT-compiled language like Julia). Yes, I know Mata uses LAPACK and BLAS underneath, but it’s still largely an interpreted language.
-
There is no reason to run the loop sequentially! It is conceptually trivial to parallelize the loop. Granted, parallelism is not trivial but the fact it cannot be done, even in Stata/MP, is frustrating.
How does the solution in C look like?
C is certainly harder to write, and Stata’s primitive interaction with C makes it so getting the results back from C is annoying. BUT there is a MASSIVE speed improvement. For this particular use case it’s an order of magnitude (around 10x) over Mata (and that implementation was already faster than Stata).
Writing the wrapper for this is not too hard thanks to the GNU Scientific Library. Noting the 0-based indexing:
Save the code to pluginSimci.c
. To compile pluginSimci.plgin
, on top
of gcc
and the SPI, you will need
Again, you should have stplugin.c
and stplugin.h
in the same directory. Now on Linux/Unix, run
Depending on your system, you may also need to add -std=c99
as a flag and point to the location of the libgsl*so
files. For instance, I regularly SSH into a RedHat server, and to compile I ran
To compile in other system, you should consult Stata’s documentation. Once compiled, from Stata:
Timing performance
I don’t really know of good Stata tools to profile performance. However, it’s not too hard to time how long a command takes to run. I wrote a simple wrapper for it, which we use with the programs above:
In the example Mata ran 5x vs Stata and the plugin ran 10x vs Mata. For a 50x speed improvement, I’d say the hassle was worth it! My real-world use-case for this was power simulations for a cluster-randomized trial where the underlying clusters were comprised of 200k observations overall. Running this in Mata took on the order of days. To have it run in hours was a massive boon.