Scientific software workflow

We know the big differences between trying out a calculation on a back of an envelope and writing a rigorous proof. Likewise, there are differences between prototyping in a Matlab script and writing a reliable pre that supports a reproducible research project. In a 15-minute presentation for a introductory workshop on scientific software at UW Madison math department, I showcased my own workflow on a simple example of plotting several solutions to an ODE in Matlab.

Repository for the workshop can be obtained by

git clone https://bitbucket.org/mbudisic/workshopworkflow.git

Let’s give ourselves a problem: given an ODE, run several of its initial conditions, and plot them on a graph.

Here’s an example of a solution:

% onescript.m
ic = [0 1; -1 -1; -.5, .5];
A = [0,1; -5,-2];
f = @(t,x)A*x;
T = [0,5];

for k = 1:size(ic,1)
[t,x] = ode45(f, T, ic(k,:) );

plot( t, x(:,2), '-o');
hold all;
end

Typing this into Matlab and running it indeed solves the problem that we set out to solve. However, science is more than just finding one solution to the problem that we are trying to solve. We should also ensure that our work is well explained, transparent, and reproducible (by others or by ourselves, a few months down the line). Most scientists with a “wet lab” training have been taught to keep track of experimental protocols, decisions that worked, or didn’t work, and document their progress. Computational scientists are rarely deliberately taught these skills: most of us learned to code in a class or two, and learned to write larger projects through a sink-or-swim attitude that permeates (at least) the early years of grad school.

What’s wrong with our earlier script? Well, for one, we don’t really know its history, how it was built — we just have a script that we hope runs. Digging deeper and running it reveals that every time we want to plot the image, Matlab needs to re-run its ODE solver. This is not a problem in this case, but for more demanding computations it would be. If we look at the code, variables are named unintuitively — what is ‘f’, what is ‘A’? There’s no documentation either. Finally, if we want to change the way things are plotted, we have to dig through the script, find the line that does the plotting and then tinker with it, all the while re-running everything to test whether our graphs look any better.

Let’s take a page from software engineering, where thinking about, building, maintaining, and reverse-engineering large software codes is the bread-and-butter. “Best Practices for Scientific Computing”, a PLoS One paper, brings a list of dos-and-donts that can help you convert your code prototype into a reliable code.

“Why do I care?”, you say, “I don’t write any large projects myself! I’m a mathematician, not a programmer!”. This may be the case, but some of this practices are very simple to adopt, yet make writing even the simplest codes more robust. Additionally, many of these tips apply to writing papers in LaTeX, and don’t tell me mathematicians don’t code their papers in LaTeX. As I go through my workflow, I’ll refer to some of these tips for easier orientation.

Version control

First, to help maintain the history of the file, we use version control (Tip 3 in BPSC). If you ever played a long video game in which you saved your progress before going through a tough spot, and then backed out to it until you made through without a scratch, then you used a basic version control. Version control tools, like Git and Mercurial, allow you to do so for your code, your papers, and basically anything that can be saved as a file (although you might skip applying them to large datafiles, images, and such).

In my daily work, I use “git”. To start a git repository for your project, type

$ git init

in the folder where your project resides. Let’s add the script that we created before and then “commit” our changes to the repository, saving our progress.

$ git add onescript.m
$ git commit -m "onescript.m added"

Our progress has now been saved and we can easily back out to it (using git checkout command) if we need to later.

Splitting into modules

Parts of the code responsible for distinct functions should be split into separate functions, or files (Tip 4b in BPSC). For example, here we can identify that generating trajectories and visualizing them are two separate tasks, so we will create two empty files for storing their functions:

$ touch generate.m
$ touch plotall.m

Since functions are now separate, we need a top-level function to string them together. This is sometimes referred to as the “harness”, “build” or “run” script. (Tip 2c in BPSC) Let’s create an empty file for it, and then save our progress (commit to the git repository)

$ touch run.m
$ git add generate.m plotall.m run.m
$ git commit -m "Created empty function file"

Notice that we didn’t really make huge progress yet, but we are still saving – it’s better to save often (Tip 3a in BPSC) then realizing that a big block of changes has to be saved in one big commit.

Saving intermediate data

Let’s first write the run script to figure out how our program should look like. The simplest version could be:

% run.m
data = generate; % generate trajectories
plotall(data); % plot trajectories

If we stopped here, we wouldn’t be doing much aside from splitting our code up. We can use the fact that functions of the code are split up to invoke only parts that are needed. E.g., if we are re-running a particular data run, we can load the trajectories from the disk, instead of computing them using ODE solver.

function run(name)
matfile = [name '.mat'];

if ~exist(matfile, 'file')
data = generate(5); % Generate 5 trajectories
save(matfile,'-struct','data') % Save data from this run.
else
data = load(matfile); % Load data from this run.
end
plotall(data);

Save the progress:

$ git add run.m
$ git commit -m "Added saving and loading data from disk."

Documenting

In simple files like this, it’s easy to read the code a year after and realize what it’s doing. But even slightly more complicated code will require a bit more documentation. Start documenting for your own sake as you code. Furthermore, strive for the code to be self-documenting: variables and functions should have sensible names so that it most of the purpose can be understood from reading the code. For everything else, there are comments and separate documentation.

For example, we can expand our function to look like:

function run(name)
%RUN Visualize a data run.
%
% RUN(name) If name.mat exists, load the data and visualize them.
% Otherwise, generate a data set, store it as name.mat and visualize it.
%
% See also: generate, plotall, onescript
%

% Form the appropriate extension.
matfile = [name '.mat'];

if ~exist(matfile, 'file')
data = generate(5); % Generate 5 trajectories
data.README = ['This is the dataset for an ODE run named ' name];
save(matfile,'-struct','data') % Save data from this run.
disp(['Saved: ' data.README]);
else
data = load(matfile); % Load data from this run.
disp(['Loaded: ' data.README]);
end

% Visualize data.
plotall(data);

MATLAB and some other programming languages, e.g., Python, use special comment lines to generate documentation for the file.

The first comment line in a Matlab file, called H1 line, is the comment that Matlab prints when you issue

>> help run

Git commits provide further description of the history of the file. For example, typing

$ git log

or

$ git log run.m

gives you, respectively, the full history of the repository, or just those commits that included the file run.m.

Collaboration and code-borrowing

There are several benefits of having a responsible workflow while writing your code.

First, perhaps obviously by now, is that your own work is streamlined and better documented so you can work efficiently (and that’s a good thing!)

Second, it makes you less “ashame” of your code, and therefore more likely to share it publicly. For example, here are my Github repositories. Now, sharing the code is not only an altruistic goal, but a part of transparent research. Would you accept a theorem from a paper that just says “trust me, I can prove this?” Of course not. The same should hold true for code we use to generate our numerical results. By coding responsibly in the first place you can cut time between having a working code, and publishing it online.

Finally, adopting the use of version control, debugging, and documentation in an everyday life makes them easier and easier to use. Then, when someone else shares their own code online, you have the tools to access it, modify it, use it. For example, MATLAB Central (Mathworks’ code-sharing website) now integrates with git. Overleaf (an online collaborative LaTeX editor) also allows (multiple) users to pull and push from the same repository, seamlessly merging the changes into a paper.

If you are looking for something worthwhile to “git pull“, get yourself export_fig which is one of the best user-written pieces of Matlab software. In short, it creates publication-level images from your Matlab graphs.

Git it by:

$ git clone https://github.com/altmany/export_fig

and then add the resulting folder into your Matlab path

>> addpath()
>> savepath

and enjoy pretty images:

A run of the ODE system, saved by export_fig.

A run of the ODE system, saved by export_fig.

Validation and testing

In Matlab, Python and similar “duck-typed” languages, a common source of errors stems from functions not being precise about the type of its arguments. Furthermore, aside from “unsigned integers”, programming languages rarely have facilities for specifying that a number in a desired range.

It is prudent to ensure that the arguments really are what we expect them to be, and emit an error otherwise (Tip 5a) In Matlab, using “validateattributes” is an extremely useful function designed exactly for this purpose.

If you open the “generate.m” file from the repository, you will see the line

validateattributes(N, {'numeric'},{'integer','positive','finite'});

which (as the comments say), check that the input is a positive, finite integer. If you try to pass 7.3 to the function, you get

>> generate(7.3)
Error using generate (line 11)
Expected input to be integer-valued.

Now, when you are designing a massive piece of code, it might emit an error from its depths and sometimes it can be challenging to write out all the variables at the very time an error occurred. For this reason, use a debugger (Tip 5d). It is a tool that can stop your program just at the moment it breaks down and allows you to examine the guts of your code as the offending line sees it.

In Matlab, such “breakpoints” are extremely easy to set up. We can just say “stop whenever you encounter an error”:

>> dbstop if error

If we again type generate(7.3) , Matlab prevents the program from exiting after it emits the error, and allows us to both see the offending code line, and type commands into Matlab’s command line as if we are executing code line-by-line.

When

When “dbstop if error” is issued, Matlab halts the execution exactly at the time of error. Behavior is turned off by “dbclear if error”.

Further reading

Here is a short list of resources where you can learn more about responsible scientific computing:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s