Friday, December 4, 2009

The need to publish scientific datasets and code

One of the interesting angles of the so-called CLIMATEGATE scandal, where files and e-mails of climate change researchers were stolen and selectively published, is the READ_ME file documenting someone's work trying to read old data files and getting models to work. Whatever ethical problems the e-mails reveal -- and I agree there are some -- the file is solid documentation that these are real scientists grappling with problems that are all too familiar to me as an astronomer, especially because I have programmed in IDL and Fortran. It seems physicists and software developers have the same reaction. Here's a sample from the file:

4. Successfully ran the IDL regridding routine quick_interp_tdm.pro (why IDL?! Why not F90?!) to produce '.glo' files.

5. Currently trying to convert .glo files to .grim files so that we can compare with previous output. However the progam suite headed by globulk.f90 is not playing nicely - problems with it expecting a defunct file system (all path widths were 80ch, have been globally changed to 160ch) and also no guidance on which reference files to choose. It also doesn't seem to like files being in any directory other than the current one!!

6. Temporarily abandoned 5., getting closer but there's always another problem to be evaded. Instead, will try using rawtogrim.f90 to convert straight to GRIM. This will include non-land cells but for comparison purposes that shouldn't be a big problem... [edit] noo, that's not gonna work either, it asks for a 'template grim filepath', no idea what it wants (as usual) and a serach for files with 'grim' or 'template' in them does not bear useful fruit. As per usual. Giving up on this approach altogether.

I have any number of text files that look like this, though I doubt as I am quite as good in documenting each step anymore, since I don't work in big projects lately. One of the worst aspects is trying to figure out what someone else really did -- whether it's regridding temperature data stored in binary files (as in the stolen file), or seeing how the 2MASS pipeline really calculates PSF and aperture photmetry, and where the aperture corrections come from (as in my life as a Staff Scientist at IPAC.) More generally, NASA and all astronomers historically had a lot of problems preserving old data since file formats were not standardized decades ago. The days when the data are a photographic plate that can be preserved in a vault for a century are long gone. In short, the READ_ME file is a snapshot of what many scientists' work is like.

Also interesting is what this reveals about how scientists should publish today. My friend David Hogg is a great advocate for publishing the code along with data and model results in journal articles. I think he's correct and astronomical (and climate!) journals should work to encourage this is in the future. I'm on the Spitzer Science Users Panel, and in a recent meeting we recommended that the Spitzer Science Center release all the computer code used to process Spitzer Space Telescope data. All this pipeline will not compile or work on other computers, it does provide a precise documentation of what was really done, and if some poor future grad student finds himself recreating it, at least he or she has a better chance of getting the answer right in less time. Furthermore, it may be useful as a model to other projects. Given the public controversy over global warming, publishing the full model codes and datasets needs to be encouraged. It may be reveal some sloppy comments or evenmistakes, but in the long run it will benefit the scientific community and the broader public.