Saturday 18 August 2012

SWIG and binary strings

Introduction

This post explains a few methods for passing binary strings to a C function and getting a binary string back to Python (via arguments or a struct).

The test function we are wrapping reverses strings. This can be done using two arrays (with a ptr pointing to the end of the output string):
for (i=0; i<size_in; i++)
{
    *(ptr-i) = *(s_in+i);
}
Or, we can reverse the string in place, using a temporary variable. In this case, ptr points to the end of the input string, and we swap *(ptr-i) and *(s_in+i) using a temporary variable. Because of this, we only iterate over half the input array:
for (i=0; i<size_in/2; i++)
{
    temp = *(ptr-i);

    *(ptr-i) = *(s_in+i);
    *(s_in+i) = temp;
}
Complete code: rev.c and rev.h - setup.py compiles the string based Python wrapper (using the rev.i SWIG interface file), whereas setup_num.py compiles a Numpy based wrapper (both the in/out and inplace version, uses revnum.i). The test.py file helps testing the wrappers.

Passing built-in strings

Passing binary strings from Python to C

SWIG has a helper function for that:
%apply (char *STRING, size_t LENGTH) {
    ( unsigned char *s_in, size_t size_in) }
Notice the char *STRING, size_t LENGTH vs unsigned char *s_in, size_t size_in. There is no unsigned char *STRING.

Passing strings from C to Python via a structure

A structure is created, which holds both a char *buffer and a size_t size.
See this link for a detailed example. Briefly, the method relies on a structure typedef as in:
typedef struct binary_data {
    int size;
    unsigned char* data;
} binary_data;
and in the interface file:
%typemap(out) binary_data {
    $result = PyString_FromStringAndSize($1.data,$1.size);
}

Passing strings from C to Python via arguments

This method relies on the cstring.i helper.
%cstring_output_allocate_size(unsigned char **s_out, size_t *size_out, free(*$1)); 
None of these are Python specific and should work when exchanging strings with other languages.

Using Numpy.i

This one on the other hand is Python specific. Wrapper code generated by numpy.i only copies array when needed. Data exchange between C and Numpy arrays should therefore be at least as fast or faster than using Python strings. We'll check that later. I've explained elsewhere (here and here) how the numpy.i interface works, so let's get right to the point. I'll use:
  • Input:
    unsigned char* IN_ARRAY1, int DIM1
  • Output:
    unsigned char** ARGOUTVIEWM_ARRAY1, int* DIM1
  • Inplace:
    unsigned char* INPLACE_ARRAY1, size_t DIM1
This is the SWIG interface for passing data in/out as arguments:
%apply (unsigned char* IN_ARRAY1, int DIM1) {(unsigned char *s_in, size_t size_in)}

%apply (unsigned char** ARGOUTVIEWM_ARRAY1, int* DIM1) {(unsigned char **s_out, size_t *size_out)}

void reverse(unsigned char *s_in, size_t size_in, unsigned char **s_out, size_t *size_out);

And this one (using inline code) is for reversing arrays in place.
%apply (unsigned char* INPLACE_ARRAY1, size_t DIM1) {(unsigned char *s_in, size_t size_in)}

%inline %{

void inplace(unsigned char *s_in, size_t size_in)
{
    size_t i;
    unsigned char temp, *ptr = NULL;
    ptr = s_in + (size_in - 1);

    #pragma omp parallel for \
        default(shared) private(i,temp)

    for (i=0 ; i<size_in/2; i++)
    {
        temp = *(ptr-i);
        *(ptr-i) = *(s_in+i);
        *(s_in+i) = temp;
    }
}

%}

Code uses OpenMP (hence the pragma statement) but this of course isn't strictly necessary.

Putting it all together

Again, all the code is located over on my ezwidget repository: A small Python script (test.py) reverses a 5MB string 1001 times (so that the in place version actually reverses the string).

Result follows:
string version: 1001 times took 5.02 seconds
olleHolleH
numpy version: 1001 times took 2.72 seconds
olleHolleH
inplace version: 1001 times took 1.40 seconds
olleHolleH
As expected, the Numpy version is faster than passing strings to C functions. This may not be the case for non-contiguous arrays or when type changing or conversion to and from strings is involved.

The in-place version really flies as it bypasses creating temporary arrays and uses the memory allocated for the input array instead. Again, when using non-contiguous arrays, this may not be the case.

Strings can be converted to Numpy arrays and back using:
  • arr = numpy.fromstring(s,numpy.uint8)
  • s = arr.tostring()

That's it for today! Improvements and comments very welcome.