We know the big differences between trying out a calculation on a back of an envelope and writing a rigorous proof. Likewise, there are differences between prototyping in a Matlab script and writing a reliable pre that supports a reproducible research project. In a 15-minute presentation for a introductory workshop on scientific software at UW Madison math department, I showcased my own workflow on a simple example of plotting several solutions to an ODE in Matlab.
Repository for the workshop can be obtained bygit clone https://bitbucket.org/mbudisic/workshopworkflow.git
Let’s give ourselves a problem: given an ODE, run several of its initial conditions, and plot them on a graph.
Here’s an example of a solution:
% onescript.m ic = [0 1; -1 -1; -.5, .5]; A = [0,1; -5,-2]; f = @(t,x)A*x; T = [0,5]; for k = 1:size(ic,1) [t,x] = ode45(f, T, ic(k,:) ); plot( t, x(:,2), '-o'); hold all; end
Typing this into Matlab and running it indeed solves the problem that we set out to solve. However, science is more than just finding one solution to the problem that we are trying to solve. We should also ensure that our work is well explained, transparent, and reproducible (by others or by ourselves, a few months down the line). Most scientists with a “wet lab” training have been taught to keep track of experimental protocols, decisions that worked, or didn’t work, and document their progress. Computational scientists are rarely deliberately taught these skills: most of us learned to code in a class or two, and learned to write larger projects through a sink-or-swim attitude that permeates (at least) the early years of grad school.
What’s wrong with our earlier script? Well, for one, we don’t really know its history, how it was built — we just have a script that we hope runs. Digging deeper and running it reveals that every time we want to plot the image, Matlab needs to re-run its ODE solver. This is not a problem in this case, but for more demanding computations it would be. If we look at the code, variables are named unintuitively — what is ‘f’, what is ‘A’? There’s no documentation either. Finally, if we want to change the way things are plotted, we have to dig through the script, find the line that does the plotting and then tinker with it, all the while re-running everything to test whether our graphs look any better.
Let’s take a page from software engineering, where thinking about, building, maintaining, and reverse-engineering large software codes is the bread-and-butter. “Best Practices for Scientific Computing”, a PLoS One paper, brings a list of dos-and-donts that can help you convert your code prototype into a reliable code.
“Why do I care?”, you say, “I don’t write any large projects myself! I’m a mathematician, not a programmer!”. This may be the case, but some of this practices are very simple to adopt, yet make writing even the simplest codes more robust. Additionally, many of these tips apply to writing papers in LaTeX, and don’t tell me mathematicians don’t code their papers in LaTeX. As I go through my workflow, I’ll refer to some of these tips for easier orientation.
First, to help maintain the history of the file, we use version control (Tip 3 in BPSC). If you ever played a long video game in which you saved your progress before going through a tough spot, and then backed out to it until you made through without a scratch, then you used a basic version control. Version control tools, like Git and Mercurial, allow you to do so for your code, your papers, and basically anything that can be saved as a file (although you might skip applying them to large datafiles, images, and such).
In my daily work, I use “git”. To start a git repository for your project, type
$ git init
in the folder where your project resides. Let’s add the script that we created before and then “commit” our changes to the repository, saving our progress.
$ git add onescript.m $ git commit -m "onescript.m added"
Our progress has now been saved and we can easily back out to it (using
git checkout command) if we need to later.
Splitting into modules
Parts of the code responsible for distinct functions should be split into separate functions, or files (Tip 4b in BPSC). For example, here we can identify that generating trajectories and visualizing them are two separate tasks, so we will create two empty files for storing their functions:
$ touch generate.m $ touch plotall.m
Since functions are now separate, we need a top-level function to string them together. This is sometimes referred to as the “harness”, “build” or “run” script. (Tip 2c in BPSC) Let’s create an empty file for it, and then save our progress (commit to the git repository)
$ touch run.m $ git add generate.m plotall.m run.m $ git commit -m "Created empty function file"
Notice that we didn’t really make huge progress yet, but we are still saving – it’s better to save often (Tip 3a in BPSC) then realizing that a big block of changes has to be saved in one big commit.
Saving intermediate data
Let’s first write the run script to figure out how our program should look like. The simplest version could be:
% run.m data = generate; % generate trajectories plotall(data); % plot trajectories
If we stopped here, we wouldn’t be doing much aside from splitting our code up. We can use the fact that functions of the code are split up to invoke only parts that are needed. E.g., if we are re-running a particular data run, we can load the trajectories from the disk, instead of computing them using ODE solver.
function run(name) matfile = [name '.mat']; if ~exist(matfile, 'file') data = generate(5); % Generate 5 trajectories save(matfile,'-struct','data') % Save data from this run. else data = load(matfile); % Load data from this run. end plotall(data);
Save the progress:
$ git add run.m $ git commit -m "Added saving and loading data from disk."
In simple files like this, it’s easy to read the code a year after and realize what it’s doing. But even slightly more complicated code will require a bit more documentation. Start documenting for your own sake as you code. Furthermore, strive for the code to be self-documenting: variables and functions should have sensible names so that it most of the purpose can be understood from reading the code. For everything else, there are comments and separate documentation.
For example, we can expand our function to look like:
function run(name) %RUN Visualize a data run. % % RUN(name) If name.mat exists, load the data and visualize them. % Otherwise, generate a data set, store it as name.mat and visualize it. % % See also: generate, plotall, onescript % % Form the appropriate extension. matfile = [name '.mat']; if ~exist(matfile, 'file') data = generate(5); % Generate 5 trajectories data.README = ['This is the dataset for an ODE run named ' name]; save(matfile,'-struct','data') % Save data from this run. disp(['Saved: ' data.README]); else data = load(matfile); % Load data from this run. disp(['Loaded: ' data.README]); end % Visualize data. plotall(data);
MATLAB and some other programming languages, e.g., Python, use special comment lines to generate documentation for the file.
The first comment line in a Matlab file, called H1 line, is the comment that Matlab prints when you issue
>> help run
Git commits provide further description of the history of the file. For example, typing
$ git log
$ git log run.m
gives you, respectively, the full history of the repository, or just those commits that included the file run.m.
Collaboration and code-borrowing
There are several benefits of having a responsible workflow while writing your code.
First, perhaps obviously by now, is that your own work is streamlined and better documented so you can work efficiently (and that’s a good thing!)
Second, it makes you less “ashame” of your code, and therefore more likely to share it publicly. For example, here are my Github repositories. Now, sharing the code is not only an altruistic goal, but a part of transparent research. Would you accept a theorem from a paper that just says “trust me, I can prove this?” Of course not. The same should hold true for code we use to generate our numerical results. By coding responsibly in the first place you can cut time between having a working code, and publishing it online.
Finally, adopting the use of version control, debugging, and documentation in an everyday life makes them easier and easier to use. Then, when someone else shares their own code online, you have the tools to access it, modify it, use it. For example, MATLAB Central (Mathworks’ code-sharing website) now integrates with git. Overleaf (an online collaborative LaTeX editor) also allows (multiple) users to pull and push from the same repository, seamlessly merging the changes into a paper.
If you are looking for something worthwhile to “
git pull“, get yourself
export_fig which is one of the best user-written pieces of Matlab software. In short, it creates publication-level images from your Matlab graphs.
Git it by:
$ git clone https://github.com/altmany/export_fig
and then add the resulting folder into your Matlab path
>> addpath() >> savepath
and enjoy pretty images:
Validation and testing
In Matlab, Python and similar “duck-typed” languages, a common source of errors stems from functions not being precise about the type of its arguments. Furthermore, aside from “unsigned integers”, programming languages rarely have facilities for specifying that a number in a desired range.
It is prudent to ensure that the arguments really are what we expect them to be, and emit an error otherwise (Tip 5a) In Matlab, using “validateattributes” is an extremely useful function designed exactly for this purpose.
If you open the “generate.m” file from the repository, you will see the line
which (as the comments say), check that the input is a positive, finite integer. If you try to pass 7.3 to the function, you get
>> generate(7.3) Error using generate (line 11) Expected input to be integer-valued.
Now, when you are designing a massive piece of code, it might emit an error from its depths and sometimes it can be challenging to write out all the variables at the very time an error occurred. For this reason, use a debugger (Tip 5d). It is a tool that can stop your program just at the moment it breaks down and allows you to examine the guts of your code as the offending line sees it.
In Matlab, such “breakpoints” are extremely easy to set up. We can just say “stop whenever you encounter an error”:
>> dbstop if error
If we again type
generate(7.3) , Matlab prevents the program from exiting after it emits the error, and allows us to both see the offending code line, and type commands into Matlab’s command line as if we are executing code line-by-line.
Here is a short list of resources where you can learn more about responsible scientific computing:
- Software Carpentry – plenty of tutorials, presentations, and “best practices”, from a group promoting software engineering-like practices in scientific computing
- Best Practices for Scientific Computing, from PLoS One (2014) — a journal paper giving a good overviews of basic best practices
- Matlab’s Advanced Software Development documentation
- Git Tutorial for Scientists and Soloists