code and stuff: 2009

I recently read this super great lecture that was featured at PyCon this year. It's very much a follow-up to his super great lecture from last year's PyCon, so you should probably read that one first unless you already have.

It's a great introduction to these funny things in python called coroutines, which are sort of like generators, but backwards. I don't want to repeat anything he said here, because that's not useful, but unless I say the following, nobody will understand anything else I'm writing, without reading the first twenty or so of his slides (they go by quick so maybe you should just do that):

Generators are functions that generate data. Great. Glad we got that out of the way. More seriously, they're usually replacements for functions that would otherwise create and return a list of data. The reason you'd want to do this is that a generator has the power to create one item at a time and return it.

This is a fantastic performance optimization, because suddenly, instead of building up a whole list in memory inside this generator function (when it's just going to get passed back anyway), you pop back an item at a time, and let the caller deal with them. Naturally, it can just collect them in a list if it wants to (and this is a popular, but slightly ineffective use of generators), or it can do some processing on each element as it comes back. You can imagine, at least intuitively, that this can speed things up, if only because it has the nice property of smoothing out the computation (much like turning up the timer frequency in your kernel): that is, instead of having big chokepoints in your call stack, data just flows right through quite easily.

Where it gets Way Cool is when you realize that you can call a generator from another generator. Now, you can build up chains of generators instead of chains of list-processing functions, and suddenly, all your data zooms up the call stack instead of stopping at each frame and waiting for the other elements to get processed.

So, the model for generators is that you have some source of data that generates items, like a file:


def cat(*filenames):
    for filename in filenames:
        with open(filename, 'r') as f:
            for line in f:
                yield line

Then you write functions that consume this data, by iterating over its contents:


def grep(pattern, *filenames):
    for filename in filenames:
        for line in cat(filename):
            if pattern in line:
                print(line)

While this is a simple (and incomplete) example, you get the idea. Now imagine grep has 'yield line' in place of 'print(line)'...you get the idea. The end result is that you have a source of data, several layers of processing, and a sink (in this case, 'cat' is the source and 'grep' is the sink), and the data flows from the source to the sink. Note that the source is at the bottom of the call stack, and the sink is at the top (if you have your stack correctly oriented).

For coroutines, this is reversed. Without delving in to the details, all you do is flip the stack, so that at the top, you have the source, pushing data down through your pipeline, and at the bottom, you have a sink. The best part is that coroutines can do anything generators can, and the even bester part is that they can do more.

In the interest of completeness, here's the same example as above, with coroutines (just so you can get your head around the difference):


def cat(filenames, *targets):
    # sadly, we cannot do the same *filenames as above, but it's not a huge deal
    # to pass in a list, is it?

    for filename in filenames:
        with open(filename, 'r') as f:
            for line in f:
                for target in targets:
                    target.send(line)
    # looks about the same, right?  and look at that sexy, sexy multiplexing
    # action in the previous two lines!

import sys

def grep(pattern, out=sys.stdout):  # just added that out parameter for kicks
    # ignore the following, it's just boilerplate that we need for every coroutine
    try:
        while True:

            # okay start paying attention here

            line = (yield)  # this is where we get data from our caller
            if pattern in line:
                print(line, file=out)  # or target.send(line)

    # also ignore this bit for the same reason
    except GeneratorExit as e:
        for target in targets:
            target.close()
        return

# usage:

# equivalent to $ cat foo.txt bar.txt | grep spam
cat(['foo.txt', 'bar.txt'], grep('spam'))

# I don't even think there's a simple equivalent on the shell (let me know if there is)
cat(['foo.txt', 'bar.txt'], grep('spam'), grep('eggs', out=sys.stderr))

For this, we must examine why people usually prefer a push model to a pull model. Generators pull. Typically, people (webadmins) hate pull, because users always turn their update frequency up and hammer servers. In this case though, we don't care (because our users, if we're a generator-admin, know everything we know, because the interpreter manages control flow and all that magic business). What we do care about is multiplexing.

Think about your call tree now. Each function (call) has only one caller, but many potential callees. In a generator, the data flows from the callees, through the generator, to the caller. Note that data can only get more concentrated. With coroutines, data flows from the caller, through the generator, to any or all of the callees! Suddenly we have multiplexing! Before you complain, we can still combine data, but it involves a detail (that a generator is still an object we can pass around and have multiple functions send to --- just be careful about overloading it --- see the lecture for more details) of which discussion here would risk significant redundancy.

So here we are! IF YOU CAN HEAR BE ABOVE, AND YOU ALREADY READ THE LECTURES, THIS IS THE PART YOU SHOULD SKIP TO. Now you know enough about coroutines and why they are cool, so I can show you the totally bitchin', chrome-plated-with-pinstripes coroutines I wrote to demonstrate how awesome both they, and I, are:


#!/usr/bin/python3

import sys

def coroutine(fun):
    """This manages a funky detail of coroutines that is dealt with very well
    in the lectures, in exactly this way."""

    def wrapped(*args, **kwargs):
        gen = fun(*args, **kwargs)
        gen.send(None)  # prime
        return gen
    # normalize
    wrapped.__dict__.update(fun.__dict__)
    wrapped.__name__ = fun.__name__
    wrapped.__doc__ = fun.__doc__
    return wrapped

@coroutine
def identity(*ts):
    """The identity function.  Good for multiplexing, I guess, if you want to
    write coroutines that don't multiplex, but still have that
    functionality."""

    try:
        while True:
            # In the examples below, this bit inside the while loop is what
            # you're supposed to pay attention to.  The try/except is just to
            # deal with generator.close(), and you have to do it, but it's zero
            # thinking time.

            elt = (yield)
            for t in ts:
                t.send(elt)

    except GeneratorExit as e:
        for t in ts:
            t.close()
        return

@coroutine
def grep(pat, *ts):
    """For each sequence sent, if 'pat in sequence', send sequence to each
    target.  Note that this works for lists, dicts, strings, and anything else
    for which 'in' has meaning."""

    try:
        while True:

            seq = (yield)
            if pat in seq:
                for t in ts:
                    t.send(seq)

    except GeneratorExit as e:
        for t in ts:
            t.close()
        return

@coroutine
def format(fmt, *ts):
    """Assume anything sent is a tuple, and use do string formatting with fmt
    on it, then send to all targets."""

    try:
        while True:

            tpl = (yield)
            s = fmt % tpl
            for t in ts:
                t.send(s)

    except GeneratorExit as e:
        for t in ts:
            t.close()
        return

def prepend(pfx, *ts):
    """A special case of format.  Good for debugging I guess."""

    return format(pfx + '%s', *ts)

def append(sfx, *ts):
    """Forgot to add newlines?  format('\n', ...) is here to help!"""

    return format('%s' + sfx, *ts)

@coroutine
def foldl(fun, *ts, init=None):
    """Here we are...functional programming!

    If this is not the traditional definition of foldl (and foldr below), I
    apologize wholeheartedly.  Whatever, it's kind of cool.
    
    The idea here is that it accumulates everything sent, and when the data
    ends, that is, when someone calls generator.close(), we send the
    accumulated value to all the targets."""

    try:
        while True:

            elt = (yield)
            if init is None:
                init = elt
            else:
                init = fun(init, elt)

    except GeneratorExit as e:
        for t in ts:
            t.send(init)
            t.close()
        return

@coroutine
def foldr(fun, *ts, init=None):
    """Same as foldl but fun is applied backwards.  Kinda silly I guess."""

    try:
        while True:

            elt = (yield)
            if init is None:
                init = elt
            else:
                init = fun(elt, init)

    except GeneratorExit as e:
        for t in ts:
            t.send(init)
            t.close()
        return

def sum(*ts):
    """A simple use of foldl (or foldr if you prefer)."""

    return foldl(lambda x, y: x + y, *ts)

def prod(*ts):
    """Another simple use of foldl/foldr."""

    return foldl(lambda x, y: x * y, *ts)

def push_list(lst, *ts):
    """Push a list through targets."""

    for elt in lst:
        for t in ts:
            t.send(elt)
    else:
        for t in ts:
            t.close()

def stdin(*ts):
    """This function acts like a source for data.  Pretty simple."""

    for line in sys.stdin:
        for t in ts:
            t.send(line)
    else:
        for t in ts:
            t.close()

@coroutine
def stdout():
    """Prints each line it receives to stdout.  Nice for setting up unix-like
    pipelines."""

    try:
        while True:

            print((yield), file=sys.stdout, end='')

    except GeneratorExit as e:
        return

@coroutine
def stderr():
    """Prints each line it receives to stderr.  Nice for setting up unix-like
    pipelines."""

    try:
        while True:

            print((yield), file=sys.stderr, end='')

    except GeneratorExit as e:
        return

A nice sample usage of this is the following:


>>> from coroutines import append, push_list, stdout, sum
>>> push_list(range(100), sum(append('\n', stdout())))
4950

So there you have it. My night of fun with coroutines. Hope you enjoyed it!

code and stuff

Thursday, April 2, 2009

Python coroutines

introduction

Followers

Blog Archive

About Me