Monday, February 11, 2013

Installing a full stack Python data analysis environment on OSX

It is a bit of an effort to install the latest numpy and scipy libraries due to their dependencies on Fortran libraries, while Apple is (was? ...) mostly neglecting all Fortran issues. Furthermore, when trying to compile a collection of these libraries for Python 3k and 64 bit, all these headaches seem to magnify enough so that attempting to use them with Py3k on OSX has to be very much discouraged by the author. Getting NumPy to work on 64 bit with Py3k and playing nicely the various "high-level" libraries and the R environment can become a nightmare on a Mac. If you really need to use Py3k, but are married to OSX, you are probably better off installing a virtual machine with your favorite Linux distro than trying to get this ensemble to work natively. In other words, OSX is not a good platform for scientific computing, and you are living easier with what is available - unless you prefer spending your time on tinkering over productivity. You have been warned…

This guide will outline how to install the following Python software stack:
  • NumPy (1.6.2)
  • SciPy (0.10.1)
  • MatPlotLib (1.2.0)
  • IPython (0.13.2)
  • Scikit-Learn (0.13)
  • RPy2 (2.3.2)
Optional instructions to install the two "new kids on the block", Pandas and StatsModels, are provided, too. To get a 64 bit version of all this software installed on OSX 10.6 through 10.8 that does work "out of the box" (although admittedly not optimally due to the Fortran issues!) without too much of a hassle, follow these steps:

Preparatory Setup

It is assumed you are using distribute and pip to install Python packages. This means, you need to have the following setup done already:

curl -O http://python-distribute.org/distribute_setup.py
sudop python distribute_setup.py
curl -O https://raw.github.com/pypa/pip/master/contrib/get-pip.py
sudo python get-pip.py


Furthermore, we are assuming a 64 bit build of Python 2.7 as the target environment. If you only need/want to use a stack compiled for a 32 bit architecture, simpler paths than the one laid out here might work as well.

NumPy

Download and install a version of numpy that is not too new (currently, 1.6 worked, while numpy-dev is at 1.8) for the latest version of OSX and the oldest still supported version of Python (currently, 2.7) to have any chance of success.

The "newest" NumPy package that was found to work was the numpy-1.6.2 package for Python 2.7, while when using any newer package, the post-installation check in the Python interpreter:

>>> import numpy; numpy.test('full')

did not pass without errors. What you never want to be seeing are errors directly related to the Fortran compiler. This would probably mean that you have your own version of Fortran installed; The best remedy in that case is to remove it and the tests should pass.

SciPy

Again, fetch a version where not too many tests fail (and even on Ubuntu LTS 12.04, the tests "test_io.test_imread" and "test_expon" are known too fail and are considered to be a non-issue). On OSX 10.7 with Python 2.7, it is possible to install the 0.10.1 package and the final

>>> import scipy; scipy.test()

check passes with "only" 9 failures. If you use newer versions, more tests will fail. In general, these two core libraries are the hardest part and it is essential to get particularly NumPy installed correctly for everything else to work.

MatPlotLib

The next step is the installation of matplotlib; There are pre-compiled OSX packages for Python 2.7 available, and the latest version (1.2.0 at the time of this writing) should work without any trouble. To ensure the installation worked, try this in the Python interpreter:

>>> from pylab import *; plot([1,2,3]); show()

and you should see a plot with a straight diagonal. To ensure you have the right library, also check:

>>> import matlpotlib; matplotlib.__version__

And you should see the desired version number you were trying to install.

IPython

First of all, a different readline installation is necessary:

sudo easy_install-2.7 readline

Note that readline has to be installed using easy_install, not pip! Now, the default installation way should work and we can simply do:

sudo pip install ipython[zmq,qtconsole,notebook,test]

To make sure the installation worked, execute the newly installed iptest script.

Scikit-Learn

This again is pretty straightforward; Do:

sudo pip install scikit-learn

nosetests sklearn --exe

This nosetest will produce one (and only one) error: "Split arrays or matrices into random train and test subsets". But according to the developer, this is a non-issue and can be ignored.

Last, if you are interested in using two more experimental and novel libraries on Python that are attempting to rid the requirement of using R (and/or rpy), you might want to install Pandas and StatsModels. If you prefer non-experimental, production stable libraries, you are probably advised to use R and RPy2, as RPy ("version 1") often tends to have issues.

Pandas

(Python Data Analysis Library) Again, the default installation route should work:

sudo pip install pandas

To ensure the library is operational, run (should not produce any errors):

nosetests pandas

StatsModels

As with Pandas, we can use the "default installation pathway", but need to first install an undocumented dependency for this module (patsy):

sudo pip install patsy
sudo pip install statsmodels

To check the installation worked, open a Python interpreter session and do:

>>> import statsmodels.api as sm
>>> sm.test()

Here, several tests seem to be failing and it is not clear at all if this is expected or not. StatsModels has several hundreds of open issues and should probably be considered very experimental at this stage.

RPy2

Again, the standard installation works (assuming you have R itself installed already, at least!):

sudo pip install rpy2

To ensure the install worked, run the tests as:

python -m 'rpy2.tests'

You should not be seeing any problems.

E voilà - you now have a fully functioning environment for running all kinds and sorts of statistical data analyses and machine learning algorithms!