Saturday 18 August 2012

SWIG and binary strings

Introduction

This post explains a few methods for passing binary strings to a C function and getting a binary string back to Python (via arguments or a struct).

The test function we are wrapping reverses strings. This can be done using two arrays (with a ptr pointing to the end of the output string):
for (i=0; i<size_in; i++)
{
    *(ptr-i) = *(s_in+i);
}
Or, we can reverse the string in place, using a temporary variable. In this case, ptr points to the end of the input string, and we swap *(ptr-i) and *(s_in+i) using a temporary variable. Because of this, we only iterate over half the input array:
for (i=0; i<size_in/2; i++)
{
    temp = *(ptr-i);

    *(ptr-i) = *(s_in+i);
    *(s_in+i) = temp;
}
Complete code: rev.c and rev.h - setup.py compiles the string based Python wrapper (using the rev.i SWIG interface file), whereas setup_num.py compiles a Numpy based wrapper (both the in/out and inplace version, uses revnum.i). The test.py file helps testing the wrappers.

Passing built-in strings

Passing binary strings from Python to C

SWIG has a helper function for that:
%apply (char *STRING, size_t LENGTH) {
    ( unsigned char *s_in, size_t size_in) }
Notice the char *STRING, size_t LENGTH vs unsigned char *s_in, size_t size_in. There is no unsigned char *STRING.

Passing strings from C to Python via a structure

A structure is created, which holds both a char *buffer and a size_t size.
See this link for a detailed example. Briefly, the method relies on a structure typedef as in:
typedef struct binary_data {
    int size;
    unsigned char* data;
} binary_data;
and in the interface file:
%typemap(out) binary_data {
    $result = PyString_FromStringAndSize($1.data,$1.size);
}

Passing strings from C to Python via arguments

This method relies on the cstring.i helper.
%cstring_output_allocate_size(unsigned char **s_out, size_t *size_out, free(*$1)); 
None of these are Python specific and should work when exchanging strings with other languages.

Using Numpy.i

This one on the other hand is Python specific. Wrapper code generated by numpy.i only copies array when needed. Data exchange between C and Numpy arrays should therefore be at least as fast or faster than using Python strings. We'll check that later. I've explained elsewhere (here and here) how the numpy.i interface works, so let's get right to the point. I'll use:
  • Input:
    unsigned char* IN_ARRAY1, int DIM1
  • Output:
    unsigned char** ARGOUTVIEWM_ARRAY1, int* DIM1
  • Inplace:
    unsigned char* INPLACE_ARRAY1, size_t DIM1
This is the SWIG interface for passing data in/out as arguments:
%apply (unsigned char* IN_ARRAY1, int DIM1) {(unsigned char *s_in, size_t size_in)}

%apply (unsigned char** ARGOUTVIEWM_ARRAY1, int* DIM1) {(unsigned char **s_out, size_t *size_out)}

void reverse(unsigned char *s_in, size_t size_in, unsigned char **s_out, size_t *size_out);

And this one (using inline code) is for reversing arrays in place.
%apply (unsigned char* INPLACE_ARRAY1, size_t DIM1) {(unsigned char *s_in, size_t size_in)}

%inline %{

void inplace(unsigned char *s_in, size_t size_in)
{
    size_t i;
    unsigned char temp, *ptr = NULL;
    ptr = s_in + (size_in - 1);

    #pragma omp parallel for \
        default(shared) private(i,temp)

    for (i=0 ; i<size_in/2; i++)
    {
        temp = *(ptr-i);
        *(ptr-i) = *(s_in+i);
        *(s_in+i) = temp;
    }
}

%}

Code uses OpenMP (hence the pragma statement) but this of course isn't strictly necessary.

Putting it all together

Again, all the code is located over on my ezwidget repository: A small Python script (test.py) reverses a 5MB string 1001 times (so that the in place version actually reverses the string).

Result follows:
string version: 1001 times took 5.02 seconds
olleHolleH
numpy version: 1001 times took 2.72 seconds
olleHolleH
inplace version: 1001 times took 1.40 seconds
olleHolleH
As expected, the Numpy version is faster than passing strings to C functions. This may not be the case for non-contiguous arrays or when type changing or conversion to and from strings is involved.

The in-place version really flies as it bypasses creating temporary arrays and uses the memory allocated for the input array instead. Again, when using non-contiguous arrays, this may not be the case.

Strings can be converted to Numpy arrays and back using:
  • arr = numpy.fromstring(s,numpy.uint8)
  • s = arr.tostring()

That's it for today! Improvements and comments very welcome.

Monday 4 June 2012

Practical example of optimising an algorithm written in C using OpenMP

Introduction

Here we take a look at the optimisation of cool image processing algorithm, John Gauch's Weighted Region Ranking, using OpenMP. Basically, wrr is an adaptive contrast enhancement algorithm which maximises the contrast locally, for each pixels, in relation to their neighbours, up to a defined extent.

John Gauch puts it better than I could:
There are a number of non-linear methods for enhancing the details in images. One approach I have investigated is a variation of ahe called weighted region ranking. Here a circular neighborhood about each point is searched to count the number of points lighter and darker than the central point. This gives us the rank of each pixel which we display as an output image. The size of the search region and the amount of histogram smoothing prior to ranking control the amount of enhancement performed.
I used SWIG and numpy.i to wrap the code for Python. The single threaded code and multi-threaded were tested using the same simple testing script.

The 512x512 16 bit image I chose for testing is a very nice picture of the M51 galaxy in the astronomical FITS format, as downloaded from this SAO Telescope Data Center page (original source: Kitt Peak National Observatory).

FITS data

M51, histogram equalized image
My purpose was to get a nice 16 bit image to work with, not to delve into the FITS format. In any case, PyFITS was most useful, and I've included a little python script to convert FITS to 16 bit compressed TIFF (using pylibtiff) and to also save an 8 bit equalized image in both PNG and JPEG.

Reformating wrr.c for wrapping with SWIG/numpy.i

I didn't change much in the original wrr.c file, just the function prototype and one helper function to create the table of line pointers. This is what I really like about working with SWIG: The Python logic (contained in wrr.i) is kept separate from the actual code (wrr.c).

Modified wrr function prototype:

void wrr(int *ImageIn, int Ydim, int Xdim, int nbits, float BlurStdDev, float WeightStdDev, int Radius, int Step, int **ImageOut, int *YdimOut, int *XdimOut);
  • int *ImageIn, int Ydim, int Xdim will be the 32 bit numpy array data coming from Python
  • nbits (subject to change) is the number of bits of useful information in the 32 bit output array (so up to 31 as I used signed integer arrays)
  • BlurStdDev, WeightStdDev, Radius as defined on the wrr page
  • Step indicates the step size over the region of interest for calculating the pixel rank. For a 40x40 neighbourhood for example, a step size of 4 means only 100 equidistant pixels are considered instead of the 40x40 = 1600 pixels. This may introduce some artifacts, so use with caution.
  • int **ImageOut, int *YdimOut, int *XdimOut is wrapped by SWIG into a memory managed numpy array. The memory allocated for the array will be automatically de-allocated once the numpy array is destroyed, as outlined here.
Just as a historic side note, when I first experimented with the algorithm back in 1995, the machine I needed to run the algorithm on was a 486 DX2-66. I used a stripped down version of the Region Ranking algorithm which I compiled in DPMI mode using DJGPP. Even with the stepping trick, processing a 640x640 digital Chest X-Ray image still took 15 seconds.

Line pointer table creation:

Creating these tables greatly improves the readability of C code. Instead of array[y*width+x], one can write line_array[y][x]. The following code is used to create the line_array table (myalloc.c):

void **make_strides(void *Image, int Ydim, int Xdim, size_t size) 
{
    char *ptr;
    void **out = NULL;
    size_t stride;
    int i;

    if (Image == NULL)
    {
        errno = ENOMEM;
        goto end;
    }

    //assumes pointers are all the same size.
    out = (void **)malloc(Ydim*sizeof(void *));
    if (out == NULL)
    {
        errno = ENOMEM;
        goto end;
    }

    //stride is image line size bytes
    stride = size * Xdim;
    
    //we're using a char pointer for 1 byte increments.
    ptr = (char *)Image;

    for (i=0; i<Ydim; i++)
    {
        out[i] = (void *)ptr;
        //now can increment ptr in bytes using stride
        ptr+=stride;
    }

end:
    return out;
}

Hopefully the code is simple enough to understand. If I missed something, don't hesitate to point it out!

A typical call to make_strides would therefore look like:

    /* Allocate weight array */
    weight_size = (2 * Radius + 1);
    ImageWeight = (float *)malloc(weight_size * weight_size * sizeof(float));
    Weight = (float **)make_strides(ImageWeight, weight_size, weight_size, sizeof(float));

OK, time to check what wrr does on our M51 image.

Single thread version

The BlurStdDev value comes-in handy for removing the background noise. You will want that pretty high, although there's a definite computational cost associated with it.

Here we check increasing values of Radius (5,10,20,40).
Other parameters: BlurStdDev = 100.0, WeightStdDev = 20.0, Step = 1


Radius=5 Radius=10
Radius=20 Radius=40
Processing time increases with Radius. A simple test conducted on an Intel Core2 Quad Q8300 @ 2.50GHz, gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5), Ubuntu 12.04 LTS, 64 bit produces the following. YMMV:
  • Radius 5: took 0.23s
  • Radius 10: took 0.83s
  • Radius 20: took 3.10s
  • Radius 40: took 11.96s

Multi-threaded version (using OpenMP)

From the OpenMP webpage:
The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer. 
I'm by all means no expert on OpenMP, so take what I say with a fistfull of salt. This is what I've had to do to integrate PRAGMA based OpenMP statements in the wrr function:
  • I added an include <omp.h> statement to wrr.c
  • I added "-fopenmp" to the compile_args list, and "-lgomp" to the extra_link_args list in setup.py
  • We have 4 nested loops in wrr, y,x,dy,dx. I found that multi-threading the outer loop gave the fastest code. As far as I understand, each line in the output image is produced by a different thread.
To expand on the last point, this is the OpenMP statement I used:

    if (BlurStdDev <= 1)
    {
        #pragma omp parallel for        \
            default(shared) private(x,dy,dx,Area,Below,Center,Temp)

        for (y = 0; y < Ydim; y++)
            for (x = 0; x < Xdim; x++)

The most difficult part (at least at the very basic level I'm using OpenMP at) is to identify within one thread, which variables are shared and which variables are private to the thread. In this case, the loop index and temporary variables are private. Everything else, the y outer loop index, data and line pointer tables are shared.

In the second test, we need to add Blur as one of the private variables:

    /* Handle slower case with histogram blurring */
    else
    {
        #pragma omp parallel for        \
            default(shared) private(x,dy,dx,Area,Below,Center,Temp, Blur)

        for (y = 0; y < Ydim; y++)
            for (x = 0; x < Xdim; x++)
            {

... and that's it! Time to check the multi-threaded versions:
  • Radius 5: took 0.09s
  • Radius 10: took 0.31s
  • Radius 20: took 1.17s
  • Radius 40: took 4.59s
Speed-up achieved for Radius=40: The multi-threaded version (on 4 cores) is 2.6x faster than the single threaded one. There's probably room for improvement, but it's a start!

EDIT


To get a better idea of the code speed-up (wall time) and thread creation impact (CPU clock time), I wrote a small test case in C (use the Makefile).

Depending on what's running in the background, output will look something like this:

./test_wrr -n 5


Testing the single-threaded version:
  iteration 1/5, 0.794s
  iteration 2/5, 0.792s
  iteration 3/5, 0.791s
  iteration 4/5, 0.791s
  iteration 5/5, 0.791s
  mean for 5 repeats: 0.792s


Testing the multi-threaded version:
  iteration 1/5, 0.242s
  iteration 2/5, 0.238s
  iteration 3/5, 0.243s
  iteration 4/5, 0.244s
  iteration 5/5, 0.240s
  mean for 5 repeats: 0.241s


Multi-threaded version speed-up: 3.28x
Thread creation impact: 11.65%



As always, code is on the EZwidgets SVN.


Do comment if you see any ways of improving the code. Thanks!

Sunday 6 May 2012

First look at Riverbank SIP (Python bindings for C and C++ libraries)

This is where I take a step back and start looking at improving my rapid prototyping Python, numpy and C pipeline. I'm looking into replacing SWIG with SIP.

I've posted two tutorials about generating Python bindings using SWIG and numpy.i:
These describe a lot of the programming I do. Using SWIG, I can write code which is completely free from any Python/C constructs and easily generate usable bindings. The second tutorial shows automatic memory deallocation when the Python arrays are deleted. This makes my code truly "pythonic".

On the Python side of things, a wealth of libraries (Matplotlib, PIL, VTK, etc...) allows me to load various datasets (from databases, various microscopy image formats, synthetic data...), test the code and display useful data. Also if needed, I can use my code in a native application.

Some simple GCC optimisations have really made my code fly:
  • native CPU optimisation enabled by -march=native
  • auto-vectorisation, enabled at -O3 (and detailed by -ftree-vectorizer-verbose=2)
  • OpenMP multithreading enabled by -fopenmp requires a bit more work but is well worth the trouble (more about this here).
So back to bindings, and let me reiterate that both SWIG and numpy.i have been good and loyal friends over the years. I have never had to worry about my bindings until some recent work which is mostly Python but repeatedly calls a few scattered functions written in C. In this case, it's worth trying to optimise the bindings as well as the code, so it's time to look at some alternatives to SWIG. I found some anecdotal evidence that I should be looking into SIP as a replacement for SWIG.

My first baby step was to adapt SWIG's example.fact() to SIP. Result code is here.

EDIT: As a second baby step, I wrote a C++ example. Result code is here.

Next step will be to convert numpy.i to SIP. Maybe start with IN_ARRAY1, try and stress the binding code and compare to SWIG.

Stay tuned! And of course, if you want to share your own experience with SWIG / SIP / Cython, I'd love to hear about it.