EcoPress

Repeatable and transparent data analysis: making the leap from Excel to Python (with tutorial)

2355590274_a8fea439dd_z

Green tree python. Image courtesy of Sebastian Niedlich, via Flickr under the Creative Commons attribution license.

By John Field

For most of my time in graduate school, I avoided learning a programming language like the plague.  I did all my analyses in spreadsheets and sometimes Access, and my figures in Excel.  Finally things got to a point where the scale of the analysis I wanted to do got so big that there wasn’t any alternative to writing a program to automate it.  After a lot of research and consultations I decided to teach myself Python via the Software Carpentry series of online tutorials, as recommended to me by Andrew Tredennick.  And boy, I wish I had done it sooner- getting into basic programming for data analysis was much easier that I had feared, and has really opened up a new world for me.

“…Excel…is still notorious for errors relating to copying & pasting data or specifying cell ranges…In contrast, careful code-based analysis is highly consistent with the scientific method”

But to back up a bit, why might you want to start coding your data analyses in Python, or any other language, in the first place?  For analyses that you repeat often or that you eventually intend to publish, there are several reasons to move beyond point-and-click software like Excel.  In many cases the data might be too big to treat any other way.  Recently several high-profile journals have published articles (for example, here and here) about how computing skills can be essential to tackle many modern scientific problems and the ‘big data’ that often underlie them, but graduate training and peer review processes have been slow to adapt to this new world.  Even if you are at a scale to sneak by with Excel, it is still notorious for errors relating to copying & pasting data or specifying cell ranges, sometimes with big consequences.

In contrast, careful code-based analysis is highly consistent with the scientific method:

  • repeatability– coding your analysis doesn’t mean that you won’t make mistakes (my first drafts are always full of errors!), but it means that any mistakes will happen consistently every time you run an analysis, and thus can be traced back and corrected.
  • hypotheses testing– it might take longer to write the code initially than to do a point-and-click analysis, but the amount of time to run the second or third or forth analysis drops to zero (hopefully).  This makes it that much more likely that you’ll repeat the analysis for different datasets in the future, or modify your script slightly to look for different effects in the existing dataset (though be careful with that!).
  • transparency and sharing– it’s much easier to hand off a piece of code to a collaborator or reviewer, rather than a long set of instructions describing a series of copy & paste operations in Excel.  This improves your ability to collaborate and have others help quality control your work.

Why choose Python over other languages like PERL, R, or MATLAB?  For one, Python is open source and free (unlike MATLAB), and gaining popularity (unlike PERL).  If you have a Mac running OSX Tiger or more recent, you already have a copy installed, and it’s on many networked Unix/Linux machines out there.  Like R, it’s a high-level language that’s interpreted, so you don’t have to mess around with compiling your code every time you want to test it.  I’ve heard that it’s better than R at navigating through networks, manipulating files, and talking to the computer’s operating system so you can automate running other programs from within your script.  Python uses the indentation of each line of code to define where loops and functions start and end, so it’s straightforward to read and error-check- I’ve never had to spend much time chasing down hard-to-find stray brackets that cause stuff to break!  And while I can’t personally vouch for Python’s ability to do fancy statistical analyses like R does, check out this blog if that’s of interest to you.

So how do you proceed?

  • Download the program.  Even if you do already have a copy installed, it’s a good idea to download a separate Python distribution where someone has packaged in a bunch of useful modules already- the Canopy Enthought Python Distribution (EPD) is a good one to start with.
  • Start going through the Software Carpentry Python tutorial, and the Shell tutorial as well if you work on a Mac or one of the NREL network machines.
  • As questions come up, try searching or posting on Stack Overflow– chances are good someone has had the same question in the past.
  • If you find yourself getting into more complex programs, do yourself a favor and download an Integrated Development Environment (IDE) like PyScripter or PyCharm (that’s what I use).  These are programs to help you keep track of variables and data across multiple files, and help you catch bugs in real time as you are coding.
  • Learn how to do version control in GitHub or some version of SVN so you can keep tabs on your code as it evolves over time (a lot of IDEs will have this built in).

For those interested in learning more, below is a tutorial illustrating a simple script that covers many of the potential benefits of code-based data analysis mentioned above.  This script reads in a directory full of DayCent model output .lis files (DayCent is a biogeochemical model that runs on a daily time step; .lis output files are generated in the form of space delimited data with variable outputs in columns, timesteps in rows), plots a subset of the data, and saves a summary of it to file.  It then archives all output to a separate time-stamped directory along with a copy of itself, to help you remember what you did and when you did it, and hopefully be able to re-run it later if need be.

"""This is your script's 'docstring', a place to write a general
description of what your code does, for example:
This script analyzes a set of DayCent .lis file output, computes
average rates of change in soil carbon, plots all change in soil
carbon vs. time, and archives all results.
"""

# import statements allow you to access useful functions defined
# in other modules outside this one
import os
import glob
import shutil
import datetime
import matplotlib.pyplot as plt

# You should define functions for operations that will be repeated
# multiple times or re-used in different modules.  These functions
# won't be run until they are called in the main body of the code
# below.  Each function can have its own 'docstring' description.
def lis_reader(lis_fpath, time_column, soc_column, head_skip,
               tail_skip):
    """This function reads a DayCent .lis output file and returns
    lists of time elapsed and change in soil organic carbon, as
    well as the average rate of change. It's good practice to
    describe all arguments (inputs) to your function, including
    the data-type (string, integer, etc.) like this:
    Args-
        lis_fpath (str):  file/path to .lis file to be read
        time_column (int):  'time' column in the .lis file.
            Note that python starts counting from 0, not 1!
        soc_column (int):  'soc' column in the .lis file.
        head_skip (int):  number of header lines to skip over
            before starting to read data
        tail_skip (int):  number of lines to skip over at end
            of file (DayCent .lis files always include an extra
            repeated line!)
    """
    # This opens the target .lis file as an object called
    # 'lis_file' and reads its data line-by-line, selectively
    # saving time data and soc data into lists of floating-point
    # (i.e. decimal) values, skipping header rows as needed and
    # trimming redundant data at the end.
    lis_file = open(lis_fpath, 'r')
    time_values = []
    soc_values = []
    for i in range(head_skip):
        next(lis_file)
    for line in lis_file:
        time_values.append(float(line.split()[time_column]))
        soc_values.append(float(line.split()[soc_column]))
    for j in range(tail_skip):
        del time_values[-1]
        del soc_values[-1]

    # Next, we note the initial soil organic level and simulation
    # starting time and re-compute our 'soc_values' and
    # 'time_values' lists as relative changes.
    initial_time = time_values[0]
    initial_soc = soc_values[0]
    for k in range(len(time_values)):
        time_values[k] = time_values[k]-initial_time
        soc_values[k] = soc_values[k]-initial_soc

    # Now we compute the average rate of change in soil organic
    # carbon by subtracting the first soc value from the last
    # soc value, and dividing by the time elapsed
    soc_rate = (soc_values[-1] - soc_values[0]) / time_values[-1]

    # Finally, we send our results back to the main program from
    # which this function was called
    return time_values, soc_values, soc_rate

# This is where the main body of the code starts- when you run the
# script it starts executing line-by-line from this point.  This
# first statement prints out the contents of the docstring for the
# user's reference.
print __doc__
print

# We'll want to save all of our average soc change rate results
# in a comma-separated-value file format.  We'll start by defining
# a text string we'll use to hold that data, beginning with a
# header of column labels separated by commas and the new-line
# character (\n) at the end.
results_string = "file, soc change rate (gC/y) \n"

# Next, we prompt the user for the location of the DayCent output
# data they wish to analyze.  For each .lis file found there we
# run our lis_reader() function, add the results to a plot using
# the .lis file name as a label, and write the results at the end
# of our results text string.
data_path = raw_input("Please enter the file path to your data: ")
for f in glob.glob(os.path.join(data_path, '*.lis')):
    # First, note the name of the .lis file being read
    file_name = f.split("/")[-1]
    # In my .lis files, time is reported in the 1st column and soc in
    # the 10th column, and I have 3 header lines and 1 redundant
    # trailing line I need to skip:
    time, soc, rate = lis_reader(f, 0, 9, 3, 1)
    # This adds my results as an x-y series to a plot
    plt.plot(time, soc, label=file_name)
    # Display the results to the screen, and add the data to the
    # results string
    print file_name, "average rate of SOC change: ", rate, "gC/y"
    results_string += "%s, %f \n" % (file_name, rate)

# After having lopped through and analyzed all the data, we create
# a time-stamped results directory within our data directory....
time_stamp = datetime.datetime.now().strftime("%Y-%m-%d_%H.%M")
results_path = data_path+"/"+time_stamp
os.mkdir(results_path)
# ... add a title, legend, and axis labels to our plot and save...
plt.title("DayCent-simulated change in soil organic carbon,\n\
conversion of Louisiana cropland/rangeland to switchgrass")
plt.legend(fontsize=10, loc=5)
plt.xlabel("Elapsed time (y)")
plt.ylabel("Change in SOC (gC)")
plt.savefig(results_path+"/plot.png")
# ... save our average soc rate change data in .csv format...
results_file = results_path+"/results.csv"
output = open(results_file, "w")
output.write(results_string)
output.close()
# ... and archive a copy of the analysis script itself for future
# reference.
shutil.copy(__file__, results_path)
print
# THE END!

Here’s what the output looks like on the command line:

nrel-105-184:blog_python_code johnfield$ python daycent_analysis.py
This is your script’s ‘docstring’, a place to write a general
description of what your code does, for example:
This script analyzes a set of DayCent .lis file output, computes
average rates of change in soils carbon, plots all results, and
archives all results and associated meta-data.

Please enter the file path to your data: ./DayCent_output
graze_C_bioenergy.lis average rate of SOC change:  -27.5594 gC/y
graze_SiL_bioenergy.lis average rate of SOC change:  -44.5013 gC/y
graze_SL_bioenergy.lis average rate of SOC change:  -25.9306 gC/y
till_C_bioenergy.lis average rate of SOC change:  19.5517 gC/y
till_SiL_bioenergy.lis average rate of SOC change:  15.1144 gC/y
till_SL_bioenergy.lis average rate of SOC change:  8.23603 gC/y

nrel-105-184:blog_python_code johnfield$

And here’s the resulting figure:

Yup, you should definitely be cultivating your bioenergy feedstocks on previously-cropped lands, rather than converting rangeland...

Yup, you should definitely be cultivating your bioenergy feedstocks on previously-cropped lands, rather than converting rangeland…

So there you have it:  easy, repeatable, self-documenting data analysis including a figure in less than 130 lines of code (and most of those are comments)!

10 comments

  1. Melannie Hartman

    Thanks John! Your post is timely. Just this afternoon I started looking into using SciPy to process NetCDF files. But I didn’t quite know where to start. This helps a lot!

    • Atanga

      Hi Melannie. This script is very helpful but after a while trying to adopt it for my daycent .lis files, it does not work for me. I have no prior experience in the python language and I am using python 3.5. I have tried the script in both python 2.7 and 3.5 in vain. Please help me out.

  2. Pingback: EcoPress guest post on basic data analysis in Python | Energy and the Future

  3. Tanga

    am getting this message from using this code. Can some one help?
    print __doc__
    ^
    SyntaxError: Missing parentheses in call to ‘print’

  4. Atanga

    I have been trying to adopt this script for my daycent .lis output files for sometime now in vain. Please can you help me out? I am using python 3.5 and I have no prior programming language experience. This script helps me a lot to understand the language faster. Hoping for your kind reply.

    • Hello, I apologize for the slow reply, I just received notification of your comments recently! The error you are getting is related to the version of Python- version 3 made some big changes to basic commands like the ‘print’ and ‘raw_input’ statement that are not backwards-compatible with version 2.7 that is commonly used by those running on Macs and among many scientific module developers. Anyway, below is a slightly-modified version that works for me when I run it with Python version 3.3.2, which should be very very similar to your version- let me know (john.L.field@colostate.edu) if it works for you!

      “””This is your script’s ‘docstring’, a place to write a general
      description of what your code does, for example:
      This script analyzes a set of DayCent .lis file output, computes
      average rates of change in soils carbon, plots all results, and
      archives all results and associated meta-data.
      “””

      # import statements allow you to access useful functions defined
      # in other modules outside this one
      import os
      import glob
      import shutil
      import datetime
      import matplotlib.pyplot as plt

      # You should define functions for operations that will be repeated
      # multiple times or re-used in different modules. These functions
      # won’t be run until they are called in the main body of the code
      # below. Each function can have its own ‘docstring’ description.
      def lis_reader(lis_fpath, time_column, soc_column, head_skip,
      tail_skip):
      “””This function reads a DayCent .lis output file and returns
      lists of time elapsed and change in soil organic carbon, as
      well as the average rate of change. It’s good practice to
      describe all arguments (inputs) to your function, including
      the data-type (string, integer, etc.) like this:
      Args-
      lis_fpath (str): file/path to .lis file to be read
      time_column (int): ‘time’ column in the .lis file.
      Note that python starts counting from 0, not 1!
      soc_column (int): ‘soc’ column in the .lis file.
      head_skip (int): number of header lines to skip over
      before starting to read data
      tail_skip (int): number of lines to skip over at end
      of file (DayCent .lis files always include an extra
      repeated line!)
      “””
      # This opens the target .lis file as an object called
      # ‘lis_file’ and reads its data line-by-line, selectively
      # saving time data and soc data into lists of floating-point
      # (i.e. decimal) values, skipping header rows as needed and
      # trimming redundant data at the end.
      lis_file = open(lis_fpath, ‘r’)
      time_values = []
      soc_values = []
      for i in range(head_skip):
      next(lis_file)
      for line in lis_file:
      time_values.append(float(line.split()[time_column]))
      soc_values.append(float(line.split()[soc_column]))
      for j in range(tail_skip):
      del time_values[-1]
      del soc_values[-1]

      # Next, we note the initial soil organic level and simulation
      # starting time and re-compute our ‘soc_values’ and
      # ‘time_values’ lists as relative changes.
      initial_time = time_values[0]
      initial_soc = soc_values[0]
      for k in range(len(time_values)):
      time_values[k] = time_values[k]-initial_time
      soc_values[k] = soc_values[k]-initial_soc

      # Now we compute the average rate of change in soil organic
      # carbon by subtracting the first soc value from the last
      # soc value, and dividing by the time elapsed
      soc_rate = (soc_values[-1] – soc_values[0]) / time_values[-1]

      # Finally, we send our results back to the main program from
      # which this function was called
      return time_values, soc_values, soc_rate

      # This is where the main body of the code starts- when you run the
      # script it starts executing line-by-line from this point. This
      # first statement prints out the contents of the docstring for the
      # user’s reference.
      print(__doc__)
      print()

      # We’ll want to save all of our average soc change rate results
      # in a comma-separated-value format file. We’ll start by defining
      # a text string we’ll use to hold that data, beginning with a
      # header of column labels separated by commas and the new-line
      # character (\n) at the end.
      results_string = “file, soc change rate (gC/y) \n”

      # Next, we prompt the user for the location of the DayCent output
      # data they wish to analyze. For each .lis file found there we
      # run our lis_reader() function, add the results to a plot using
      # the file name as a label, and add the results to our text string.
      data_path = input(“Please enter the file path to your data: “)
      for f in glob.glob(os.path.join(data_path, ‘*.lis’)):
      # First, note the name of the .lis file being read
      file_name = f.split(“/”)[-1]
      # In my .lis files, time is reported in the 1st column and soc in
      # the 10th column, and I have 3 header lines and 1 redundant
      # trailing line I need to skip:
      time, soc, rate = lis_reader(f, 0, 9, 3, 1)
      # This adds my results as an x-y series to a plot
      plt.plot(time, soc, label=file_name)
      # Display the results to the screen, and add the data to the
      # results string
      print(file_name, “average rate of SOC change”, rate, “gC/y”)
      results_string += “%s, %f \n” % (file_name, rate)

      # After having lopped through and analyzed all the data, we create
      # a time-stamped results directory within our data directory….
      time_stamp = datetime.datetime.now().strftime(“%Y-%m-%d_%H.%M”)
      results_path = data_path+”/”+time_stamp
      os.mkdir(results_path)
      # … add a title, legend, and axis labels to our plot and save…
      plt.title(“DayCent-simulated change in soil organic carbon,\n\
      conversion of Louisiana cropland/rangeland to switchgrass”)
      plt.legend(fontsize=10, loc=5)
      plt.xlabel(“Elapsed time (y)”)
      plt.ylabel(“Change in SOC (gC)”)
      plt.savefig(results_path+”/plot.png”)
      # … save our average soc rate change data in .csv format…
      results_file = results_path+”/results.csv”
      output = open(results_file, “w”)
      output.write(results_string)
      output.close()
      # … and archive a copy of the analysis script itself for future
      # reference.
      shutil.copy(__file__, results_path)
      # THE END!

      • Oops, it looks like my code indentation gets all messed up when I post it as a comment… How about this- start with the original code in the post above, and try making the following changes by hand:
        Line 77: {print __doc__} change to {print(__doc__)}
        Line 78: {print} change to {print()}
        Line 92: {data_path = raw_input(“Please enter the file path to your data: “)} change to {data_path = input(“Please enter the file path to your data: “)}
        Line 104: { print file_name, “average rate of SOC change: “, rate, “gC/y”} change to { print(file_name, “average rate of SOC change: “, rate, “gC/y”)}
        Line 127: {print} change to {print()}

    • How about this- start with the original code in the post above, and try making the following changes by hand:
      Line 77: {print __doc__} change to {print(__doc__)}
      Line 78: {print} change to {print()}
      Line 92: {data_path = raw_input(“Please enter the file path to your data: “)} change to {data_path = input(“Please enter the file path to your data: “)}
      Line 104: { print file_name, “average rate of SOC change: “, rate, “gC/y”} change to { print(file_name, “average rate of SOC change: “, rate, “gC/y”)}
      Line 127: {print} change to {print()}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: