Getting Started With SciPy

user-agent-graph

SciPy is the name for a collection of python packages used for data analysis and similar scientific pursuits. This document is a rough braindump of what it takes to get it installed in a virtual environment on Linux since I have to do it fairly frequently and usually forget some of it.

Basic Installation

First install some OS packages.

# It's way too hard to get these working in a virtualenv
sudo aptitude install python-gtk2 python-qt4
# Dependencies of some things we're going to build
sudo aptitude install libblas-dev liblapack-dev
# Many things will want to compile fortran.
sudo aptitude install gfortran

Next build the virtual environment.

Before you do this, make sure to configure the pip cache. This saves you sloooowly redownloading everything if the build doesn’t work first time.

$ cat ~/.pip/pip.conf
[global]
download_cache = ~/var/cache/pip

Then you can install.

virtualenv ~/var/venvs/pylab
~/var/venvs/pylab/bin/pip install numpy 
~/var/venvs/pylab/bin/pip install scipy
~/var/venvs/pylab/bin/pip install pandas

If you don’t get it working then google the error and you’ll almost certainly be
able to find a missing dependency.

Here is what is in my venv. (I could not get pip freeze to work while site-packages were enabled but you can just touch the file again and it will work).

bunker@normandie:~/var/venv/pylab$ ./bin/pip freeze -v
Jinja2==2.8
MarkupSafe==0.23
Pygments==2.0.2
argparse==1.2.1
backports.ssl-match-hostname==3.4.0.2
certifi==2015.9.6.2
decorator==4.0.4
funcsigs==0.4
functools32==3.2.3.post2
ipykernel==4.1.1
ipython==4.0.0
ipython-genutils==0.1.0
jsonschema==2.5.1
jupyter-client==4.1.1
jupyter-core==4.0.6
matplotlib==1.4.3
mistune==0.7.1
mock==1.3.0
nbconvert==4.0.0
nbformat==4.0.1
nose==1.3.7
notebook==4.0.6
numpy==1.10.1
pandas==0.17.0
path.py==8.1.2
pbr==1.8.1
pexpect==4.0.1
pickleshare==0.5
ptyprocess==0.5
pyparsing==2.0.3
python-dateutil==2.4.2
pytz==2015.6
pyzmq==14.7.0
qtconsole==4.1.0
scipy==0.16.0
simplegeneric==0.8.1
six==1.10.0
terminado==0.5
tornado==4.2.1
traitlets==4.0.0
wsgiref==0.1.2

Using OS Python for Some Things

I briefly mentioned to use the OS for python's Qt and GTK bindings. If you fotgot to do this then you can easily turn your virtualenv into a non-isolated version by removing the following empty file in your virtualenv:

rm ~/var/venv/pylab/lib/python2.7/no-global-site-packages.txt

You can put it back again to reverse the operation.

If you don’t mind messing with symlinks then you can simply use ln -s for those directories in the global site-packages that you want.

Using Graphical Consoles

Install some more things:

~/var/venvs/pylab/bin/pip install ipython
~/var/venvs/pylab/bin/pip install notebook
~/var/venvs/pylab/bin/pip install qtconsole

You can pretty much explore these and see what works best for you.

Ipython is the standard command-line shell with colours and completion and similar. If you have no graphical environment then you can still write graphs to file. An image viewer like feh is useful for these.

If you install pylab then using --pylab inline will give you a bunch of matlab like imports.

Notebook gives you an html browser and shareable notebooks.

Qtconsole is the same as ipython but has graphical output. This is particularly useful since when you generate graphs they will show up inline.

Missing Things

I did not install pylab since I didn’t think I needed it so the above pip freeze does not include this. Pylab gives you a matlab-like environment and since I’m not familiar with matlab (and merely want to write scripts that produce pretty graphs) I didn’t see a need for this. The --pylab argument for ipython needs pylab.

I found when installing it later that it was necessary to configure my PATH using the activate script from virtualenv or it wouldn’t find certain dependencies. Also cython was necessary.

~/var/venvs/pylab/bin/pip install pylab 

Example Script

Here’s the very quickly thrown together script which produced the attached image:

import numpy as np
# If no X11 DISPLAY then we need to set a backend or it won't generate the png
# import matplotlib
# matplotlib.use('Agg')
import matplotlib.pyplot as plt
import pandas as pd

import re
import woothee
import datetime
from dateutil.parser import parse as parse_date

parts = [
r'(?P\S+)', # host %h
r'\S+', # indent %l (unused)
r'(?P\S+)', # user %u
r'\[(?P.+)\]', # time %t
r'"(?P.*)"', # request "%r"
r'(?P[0-9]+)', # status %>s
r'(?P\S+)', # size %b (careful, can be '-')
r'"(?P.*)"', # referrer "%{Referer}i"
r'"(?P.*)"', # user agent "%{User-agent}i"
]
pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')

count = 0
agents = {}
times = {}

# I guess the problem here is you have to re-parse this massive set of data
# rather than just filtering it down in memory. I could probably have stored
# records in the data frame.
with open("raw") as io:
    for line in io:
        count += 1
        hit = pattern.match(line).groupdict()

        time = re.search(r'\[([^\]]+)\]', line).group(1)
        time, zone = hit['time'].split(" ")
        time = datetime.datetime.strptime(time, "%d/%b/%Y:%H:%M:%S")

        if hit['request'].startswith("GET /images/"):
            continue

        if hit['request'].startswith("GET /product_thumb.php"):
            continue

        if hit['request'].startswith("GET /js/"):
            continue

        if hit['request'].startswith("GET /css/"):
            continue

        if hit['request'].startswith("GET /scripts/"):
            continue

        agent = hit['agent']
        agent_details = woothee.parse(hit['agent'])

        agent_name = agent_details['name']
        if agent_name == "UNKNOWN":
            if agent.startswith("curl/") and 'criteo' in agent:
                agent_name = "Criteo Curl"
            elif agent.startswith("python-request/"):
                agent_name = "Probably ESCIA"
            else:
                agent_name = agent

        times.setdefault(time, {})
        times[time].setdefault(agent_name, 0)
        times[time][agent_name] += 1

        agents.setdefault(agent_name, {})
        agents[agent_name].setdefault(time, 0)
        agents[agent_name][time] += 1

        # if count > 1000:
        # break

start_date = min(times.keys())
end_date = max(times.keys())

date = start_date.date()

start_date = datetime.datetime.combine(date, datetime.time(12, 43, 0))
end_date = datetime.datetime.combine(date, datetime.time(13, 23, 0))
dates = pd.date_range(start_date, end_date, freq='S')

tses = {}
sums = []

# There maybe be easier ways to do this but not ovious from a quick scan of the
# docs.
for name in agents:
    ser = pd.Series(agents[name], index=dates)
    sums.append((name, ser.sum()))
    tses[name] = ser

sums.sort(key=lambda elt: elt[1], reverse=True)
biggest = [name for (name, _) in sums[0:10]]

# or tses.keys() for all.
columns = biggest

# Label is so huge it's impossible to read.
df = pd.DataFrame(tses, index=dates, columns=columns)

# This does return a value, not change state. Don't get confused!
df = df.fillna(0)

# It does look somewhat crappy... but there are many other functions to help.
df.plot(figsize=(24, 8), stacked=True)

# Uses confusing global state but it works.
plt.legend(loc=2, prop={'size':6})
plt.savefig("out.png")
Advertisements
Getting Started With SciPy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s