EZ's musings

Wednesday 19 February 2014

A manual acquisition script for Micro-Manager

Edits

2014-12-24: Code clean-up. Tested with upcoming MM 1.4.20. Contrast limits (autostretch / fixed / full scale) are now properly saved and displayed in ImageJ / Fiji. Channel group now read from the Multi-D Acquisition panel, no need to mess with the source code any more.
2014-12-05: I have edited the Beanshell script to make it work in Micro-Manager 1.4.19

Introduction

This bean shell script for Micro-Manager implements multi-channel acquisition for manual microscopes. It is based on manualAcq.bsh by Nenad Amodaj, Feb 2008 and Nico Stuurman, April 2008.

Script dialog showing channel, exposure time, image directory and image prefix

The script dialog lets the user know which filter to switch to next (including brightfield if needed) and the resulting images are saved as multi-tiff stacks. Most of the settings (channel names and colours, acquisition times, root directory) can now be changed through the Multi-D acquisition panel instead of being hard-coded as in the original script.

Setup

Step 1: Download from github

For now, the beanshell script is available from this link.

Step 2: Hardware configuration

The script relies on the DemoHub / DemoWheel for the Filter names. So in addition to your camera (here I have a Ximea greyscale camera), you will need to add the DemoCamera DHub and select DWheel.

Then on step 5 of the hardware configuration wizard, change the channel names to the filters available on your microscope. Add a brightfield channel if available. Here for instance, I have renamed states 0,1,2,3 to Brightfield, DAPI, FITC, TRITC.

Save your configuration and you're done with the wizard.

Step 3: Channel group

Create a group called "Channel" which uses "DWheel-Label". The name of the group however isn't important:

This is in addition to the groups necessary to control your camera (e.g. bit depth, gain, etc...). Save the config when you're done.

At this stage, you can edit the config file manually, and comment out any unused DWheel states:

# Preset: State-9

#ConfigGroup,Channel,State-9,DWheel,Label,State-9

# Preset: State-8

#ConfigGroup,Channel,State-8,DWheel,Label,State-8

If you do, don't forget to restart Micro-Manager when you are done.

Step 4: Multi-D acquisition setup

Micro-Manager's Multi-D acquisition setup dialog now provides an easy way to interact with the manual acquisition script. You will need to select "Channel" as the Channel group (or whatever name you gave it) and add the Channels in the order you want to acquire them.

Here for example, the order in which the channels are acquired is Brightfield, DAPI, FITC, TRITC. With this config, I will be doing my hunting and focussing using Brightfield, then select the filters as instructed by the script.

As a side note, the script recognises which channels are in use, as well as the exposure times (although these can also be changed from the script if needed). The "Use?" column, as well as the buttons "Up" and "Down" provide an easy way to quickly modify an existing configuration. These configurations can also be saved.

A working channel group with defined channels

An important note: you need to "Close" the Multi-D acquisition dialog to make the new settings available within the rest of Micro-Manager.

Procedure

For now, launch the script through the script panel. Images will be saved in the "directory root" + "prefix" directory. The directory root is the one defined in the Multi-D acquisition panel. The name prefix however, cannot yet be read from the Multi-D acquisition panel. By default, the prefix is "image".

Note 1: If the combined directory name already exists, the prefix will be incremented until an empty slot is found. This stem is then used to prefix each of the created files within that directory.

Note 2: The prefix can be changed at any time and a valid prefix will be created. Each time the prefix is changed, the image counter is reset.

Note 3: The exposure time can be changed and the current channel will be held until the right exposure is found. Press OK again (without having changed the exposure time) to cycle to the next channel.

Cycling through the channels.

Error handling

The following errors are handled appropriately:

No channel available or no channel in use in Multi-D acquisition: The user is prompted to add channels.
A directory of that name / prefix already exists: In this case, a new prefix is calculated.
The output directory cannot be created or a file of that name exists but is not a directory: The user is prompted to change either the image directory or image prefix. Only the last channel needs re-acquiring (when you press OK).

Converting the stacks to RGB images

If you need to convert these stacks to RGB images, you can do it in ImageJ. For each of the channels, adjust the brightness / contrast (Image -> Adjust -> Brightness/Contrast -> Auto). Then Image -> Colors -> Make Composite. Then save the image to an appropriate format.

Sample image

Three channels (b,g,r).

The raw image is available from here: image45_0002_MMStack.ome.tif

Bugs / future improvements

For versions of Micro-Manager prior to 1.4.17, uncomment the line gui.initializeAcquisition(acqName, ...
The initial prefix is hardcoded to read "image". I do not know how to read the prefix from the Multi-D Acquisition Panel.
~~The album which shows the composite image should really be cleared when a new image is being acquired.~~
Easy way to launch the script, i.e. not from the script panel but from its own toolbar or from the ImageJ toolbar. Working on that.
~~min/max for each channel need a bit of work. At the moment, CHANNEL_MINS are 0 and CHANNEL_MAXES to max depth.~~

Additional Notes

Closing a window without prompting for save

This is something that took me a bit of time to work out. If you have the title of the window (here we use acqName), do this:

mmAcq = gui.getAcquisition(acqName);

mmAcq.promptToSave(false);

gui.closeAcquisitionWindow(acqName);

You can also check that the window actually exists before trying to close it:

if (gui.acquisitionExists(acqName)) { ... }

Blanking a (tagged) image

I'm half way there with this one. From a tagged image, you can create an ImageJ ImageProcessor. Then use the set() method to blank the image with set(0).

ImageProcessor ip = ImageUtils.makeProcessor(img);

ip.set(0);

... however, I haven't found yet how to get tagged images from an acquisition window. Hence in my program, I store the tagged images in an ArrayList as I snap them.

Tuesday 18 June 2013

New numpy.i typemaps for working with lists of 2-D and 3-D arrays.

Introduction

In pull request #3451 I mentioned two new typemaps for working with lists of 2-D arrays (and 3-D arrays), either as an input argument, or lists of 2-D arrays (and 3-D arrays) as an in-place argument.

DATA_TYPE** IN_ARRAY3, DIM_TYPE DIM1, DIM_TYPE DIM2, DIM_TYPE DIM3
DATA_TYPE** INPLACE_ARRAY3, DIM_TYPE DIM1, DIM_TYPE DIM2, DIM_TYPE DIM3
DATA_TYPE** IN_ARRAY4, DIM_TYPE DIM1, DIM_TYPE DIM2, DIM_TYPE DIM3,
  DIM_TYPE DIM4
DATA_TYPE** INPLACE_ARRAY4, DIM_TYPE DIM1, DIM_TYPE DIM2, DIM_TYPE DIM3,
  DIM_TYPE DIM4

When working with lists of arrays, using these new typemaps vs using the existing ones avoids copying data into a contiguous block of memory first.

The following for example is very costly (both in memory allocation and in time the operation takes):

a = numpy.ones((1024,1024),numpy.uint8)
la = [a]*100
b = numpy.mean(numpy.array(la,float),axis=0).astype(numpy.uint8)

Unfortunately, wrapping a C function with the usual

DATA_TYPE* IN_ARRAY3, DIM_TYPE DIM1, DIM_TYPE DIM2, DIM_TYPE DIM3

will not speed things up much. A contiguous memory block of size DIM1*DIM2*DIM3 must still be allocated.

For this particular operation however, contiguity between slices is not necessary. If the slices themselves are contiguous, all that is needed is a list of pointers to the individual slices and the slices dimensions.

Limitations

For both the INPUT and INPLACE datatypes, the arrays in the list must all have the same type and size. For in-place operation, the slices must also be contiguous.

What happens in the background

IN_ARRAY3 / IN_ARRAY4 typemaps

When a list of 2-D or 3-D arrays (slices / cubes) is sent as an argument to the SWIG wrapped function, the following hapens:

We check that the Python object is a list.
Two arrays are created:

object_array is a list of pointers, with a maximum size of n, the number of elements in the list.
is_new_object_array records whether a new contiguous object needed creating

**object_array, dim1 (number of slices), dim2, dim3 (height and width of the slices) are passed to the C function. For lists of 3-D arrays, dim2,dim3,dim4 are the depth,height and width of the cubes.
After completion of the wrapped function, the destructor loops through object_array and new_object_array and destroys temporary objects (object_array != NULL && new_object_array[i] == 1). The destructor may also be called if an error occurred (one of the slices is a different size for example).
new_object_array is freed.
object_array is freed.

INPLACE_ARRAY3 / INPLACE_ARRAY4 typemaps

Almost the same process as before, but without the creation of temporary contiguous arrays (so no new_object_array needed). If one of the slices / cubes is a different size or isn't contiguous, we have an error state and the object_array is freed.

An example, processing video sequences

The reason I implemented these new typemaps was to process large lists of image arrays loaded from video sequences. Potentially, thousands of frames.

OpenCV has all the tools necessary for reading frames from a video file. Here is some code to store frames (as grayscale) into a list:

import cv2
import mytest

fn = r"ATM.mpg"
vidcap = cv2.VideoCapture(fn)

max_count = 50
count = 0
stack = []

success,image = vidcap.read()
if success:
    while success:
        stack.append(mytest.grayscale(image))
        count += 1

        success,image = vidcap.read()
        if count >= max_count:
            break

if count == 0:
    sys.exit(1)

Just as a sidenote, frames are immediately converted from RGB to grayscale using the grayscale function wrapped in mytest (see mytest.c).

For testing, I used the ATM sequence available from http://ivylab.kaist.ac.kr/demo/vs/dataset.htm (many thanks to KAIST for making these available!).

IN_ARRAY3 ** example code

Let's get back to our original example. We want to accelerate

b = numpy.mean(numpy.array(la, float),axis=0).astype(numpy.uint8)

for a list of 2-D arrays.

Here's what I wrote (see mytest.c):

void average(unsigned char **ArrayIn, int Zdim, int Ydim, int Xdim, unsigned char **pArrayOut, int *YdimOut, int *XdimOut)
{
    int z,i;

    unsigned char *ArrayOut = NULL;
    float temp;

    //allocating memory for the output image
    if (*pArrayOut == NULL) {
        *pArrayOut = (unsigned char *)calloc(Ydim*Xdim,sizeof(unsigned char));
    }
    ArrayOut = *pArrayOut;

    if (ArrayOut == NULL) {
        errno = ENOMEM;
        goto end;
    }

    #pragma omp parallel for        \
        default(shared) private(z,temp)

    for (i = 0; i < Ydim*Xdim; i++)
    {
        temp = 0;
        for (z = 0; z < Zdim; z++)
        {
            temp += ArrayIn[z][i];
        }
        temp = temp / Zdim;
        if (temp > 255)
            temp = 255;

        ArrayOut[i] = (unsigned char)temp;
    }

end:
    *pArrayOut = ArrayOut;
    *YdimOut = Ydim;
    *XdimOut = Xdim;
}

This, together with the typemap definition in mytest.i is all the code needed! The new IN_ARRAY3 typemap takes care of generating an **ArrayIn, which is then passed on to the average function (together with Zdim, Ydim, Xdim).

As mentioned previously, the output array is of type ARGOUTVIEWM_ARRAY2, which means the array we created inside the C function will be freed when the Python numpy array is destroyed. These will be available as standard in the next version of numpy.i .

INPLACE_ARRAY3 ** example code

If for some reason, we wanted to subtract this averaged image from each frame in the stack (as an in-place operation), here is what we would do:

void bsub(unsigned char **ArrayInpl, int Zdim, int Ydim, int Xdim, unsigned char *ArrayIn2, int Ydim2, int Xdim2)
{
    int i,z;

    if (Ydim!=Ydim2 || Xdim!=Xdim2) {
        errno = E2BIG;
        goto end;
    }

    #pragma omp parallel for        \
        default(shared) private(i)

    for (z = 0; z < Zdim; z++)
    {
        for (i = 0; i< Ydim*Xdim; i++)            
            ArrayInpl[z][i] = abs(ArrayInpl[z][i]-ArrayIn2[i]);
    }

end:
    return;
}

As usual, all the code is available from the EZwidgets googlecode SVN

Wrapping up

the mytest library can be compiled using the provided setup_mytest.py with the following command:

>> python setup_mytest.py build_ext --inplace

and some tests are performed with test_sequence.py:

>> python test_sequence.py

loaded 50 (h576 x w704) frames
testing mytest average...
  mytest average took 0.01 seconds, sum=51430105

testing numpy.average . . . . .
  numpy.mean took 2.45 seconds, sum=51430105

testing inplace background subtraction...
  bsub took 0.02 seconds (inplace op, done once)

saving images...

Particularly telling, is the difference between averaging using the mytest function and using numpy.mean.

The average image generated from the 50 first frames and abs(frame 33 - average) follow.

If you have any comments, please head back to pull request #3451 . Thank you.

Saturday 18 August 2012

SWIG and binary strings

Introduction

This post explains a few methods for passing binary strings to a C function and getting a binary string back to Python (via arguments or a struct).

The test function we are wrapping reverses strings. This can be done using two arrays (with a ptr pointing to the end of the output string):

for (i=0; i<size_in; i++)
{
    *(ptr-i) = *(s_in+i);
}

Or, we can reverse the string in place, using a temporary variable. In this case, ptr points to the end of the input string, and we swap *(ptr-i) and *(s_in+i) using a temporary variable. Because of this, we only iterate over half the input array:

for (i=0; i<size_in/2; i++)
{
    temp = *(ptr-i);

    *(ptr-i) = *(s_in+i);
    *(s_in+i) = temp;
}

Complete code: rev.c and rev.h - setup.py compiles the string based Python wrapper (using the rev.i SWIG interface file), whereas setup_num.py compiles a Numpy based wrapper (both the in/out and inplace version, uses revnum.i). The test.py file helps testing the wrappers.

Passing built-in strings

Passing binary strings from Python to C

SWIG has a helper function for that:

%apply (char *STRING, size_t LENGTH) {
    ( unsigned char *s_in, size_t size_in) }

Notice the char *STRING, size_t LENGTH vs unsigned char *s_in, size_t size_in. There is no unsigned char *STRING.

Passing strings from C to Python via a structure

A structure is created, which holds both a char *buffer and a size_t size.

See this link for a detailed example. Briefly, the method relies on a structure typedef as in:

typedef struct binary_data {
    int size;
    unsigned char* data;
} binary_data;

and in the interface file:

%typemap(out) binary_data {
    $result = PyString_FromStringAndSize($1.data,$1.size);
}

Passing strings from C to Python via arguments

This method relies on the cstring.i helper.

%cstring_output_allocate_size(unsigned char **s_out, size_t *size_out, free(*$1));

None of these are Python specific and should work when exchanging strings with other languages.

Using Numpy.i

This one on the other hand is Python specific. Wrapper code generated by numpy.i only copies array when needed. Data exchange between C and Numpy arrays should therefore be at least as fast or faster than using Python strings. We'll check that later. I've explained elsewhere (here and here) how the numpy.i interface works, so let's get right to the point. I'll use:

Input:
```
unsigned char* IN_ARRAY1, int DIM1
```

Output:

unsigned char** ARGOUTVIEWM_ARRAY1, int* DIM1

Inplace:

unsigned char* INPLACE_ARRAY1, size_t DIM1

This is the SWIG interface for passing data in/out as arguments:

%apply (unsigned char* IN_ARRAY1, int DIM1) {(unsigned char *s_in, size_t size_in)}

%apply (unsigned char** ARGOUTVIEWM_ARRAY1, int* DIM1) {(unsigned char **s_out, size_t *size_out)}

void reverse(unsigned char *s_in, size_t size_in, unsigned char **s_out, size_t *size_out);

And this one (using inline code) is for reversing arrays in place.

%apply (unsigned char* INPLACE_ARRAY1, size_t DIM1) {(unsigned char *s_in, size_t size_in)}

%inline %{

void inplace(unsigned char *s_in, size_t size_in)
{
    size_t i;
    unsigned char temp, *ptr = NULL;
    ptr = s_in + (size_in - 1);

    #pragma omp parallel for \
        default(shared) private(i,temp)

    for (i=0 ; i<size_in/2; i++)
    {
        temp = *(ptr-i);
        *(ptr-i) = *(s_in+i);
        *(s_in+i) = temp;
    }
}

%}

Code uses OpenMP (hence the pragma statement) but this of course isn't strictly necessary.

Putting it all together

Again, all the code is located over on my ezwidget repository: A small Python script (test.py) reverses a 5MB string 1001 times (so that the in place version actually reverses the string).

Result follows:

string version: 1001 times took 5.02 seconds
olleHolleH
numpy version: 1001 times took 2.72 seconds
olleHolleH
inplace version: 1001 times took 1.40 seconds
olleHolleH

As expected, the Numpy version is faster than passing strings to C functions. This may not be the case for non-contiguous arrays or when type changing or conversion to and from strings is involved.

The in-place version really flies as it bypasses creating temporary arrays and uses the memory allocated for the input array instead. Again, when using non-contiguous arrays, this may not be the case.

Strings can be converted to Numpy arrays and back using:

arr = numpy.fromstring(s,numpy.uint8)
s = arr.tostring()

That's it for today! Improvements and comments very welcome.

Monday 4 June 2012

Practical example of optimising an algorithm written in C using OpenMP

Introduction

Here we take a look at the optimisation of cool image processing algorithm, John Gauch's Weighted Region Ranking, using OpenMP. Basically, wrr is an adaptive contrast enhancement algorithm which maximises the contrast locally, for each pixels, in relation to their neighbours, up to a defined extent.

John Gauch puts it better than I could:

There are a number of non-linear methods for enhancing the details in images. One approach I have investigated is a variation of ahe called weighted region ranking. Here a circular neighborhood about each point is searched to count the number of points lighter and darker than the central point. This gives us the rank of each pixel which we display as an output image. The size of the search region and the amount of histogram smoothing prior to ranking control the amount of enhancement performed.

I used SWIG and numpy.i to wrap the code for Python. The single threaded code and multi-threaded were tested using the same simple testing script.

The 512x512 16 bit image I chose for testing is a very nice picture of the M51 galaxy in the astronomical FITS format, as downloaded from this SAO Telescope Data Center page (original source: Kitt Peak National Observatory).

FITS data

M51, histogram equalized image

My purpose was to get a nice 16 bit image to work with, not to delve into the FITS format. In any case, PyFITS was most useful, and I've included a little python script to convert FITS to 16 bit compressed TIFF (using pylibtiff) and to also save an 8 bit equalized image in both PNG and JPEG.

Reformating wrr.c for wrapping with SWIG/numpy.i

I didn't change much in the original wrr.c file, just the function prototype and one helper function to create the table of line pointers. This is what I really like about working with SWIG: The Python logic (contained in wrr.i) is kept separate from the actual code (wrr.c).

Modified wrr function prototype:

void wrr(int *ImageIn, int Ydim, int Xdim, int nbits, float BlurStdDev, float WeightStdDev, int Radius, int Step, int **ImageOut, int *YdimOut, int *XdimOut);

int *ImageIn, int Ydim, int Xdim will be the 32 bit numpy array data coming from Python
nbits (subject to change) is the number of bits of useful information in the 32 bit output array (so up to 31 as I used signed integer arrays)
BlurStdDev, WeightStdDev, Radius as defined on the wrr page
Step indicates the step size over the region of interest for calculating the pixel rank. For a 40x40 neighbourhood for example, a step size of 4 means only 100 equidistant pixels are considered instead of the 40x40 = 1600 pixels. This may introduce some artifacts, so use with caution.
int **ImageOut, int *YdimOut, int *XdimOut is wrapped by SWIG into a memory managed numpy array. The memory allocated for the array will be automatically de-allocated once the numpy array is destroyed, as outlined here.

Just as a historic side note, when I first experimented with the algorithm back in 1995, the machine I needed to run the algorithm on was a 486 DX2-66. I used a stripped down version of the Region Ranking algorithm which I compiled in DPMI mode using DJGPP. Even with the stepping trick, processing a 640x640 digital Chest X-Ray image still took 15 seconds.

Line pointer table creation:

Creating these tables greatly improves the readability of C code. Instead of array[y*width+x], one can write line_array[y][x]. The following code is used to create the line_array table (myalloc.c):

void **make_strides(void *Image, int Ydim, int Xdim, size_t size) 
{
    char *ptr;
    void **out = NULL;
    size_t stride;
    int i;

    if (Image == NULL)
    {
        errno = ENOMEM;
        goto end;
    }

    //assumes pointers are all the same size.
    out = (void **)malloc(Ydim*sizeof(void *));
    if (out == NULL)
    {
        errno = ENOMEM;
        goto end;
    }

    //stride is image line size bytes
    stride = size * Xdim;
    
    //we're using a char pointer for 1 byte increments.
    ptr = (char *)Image;

    for (i=0; i<Ydim; i++)
    {
        out[i] = (void *)ptr;
        //now can increment ptr in bytes using stride
        ptr+=stride;
    }

end:
    return out;
}

Hopefully the code is simple enough to understand. If I missed something, don't hesitate to point it out!

A typical call to make_strides would therefore look like:

    /* Allocate weight array */
    weight_size = (2 * Radius + 1);
    ImageWeight = (float *)malloc(weight_size * weight_size * sizeof(float));
    Weight = (float **)make_strides(ImageWeight, weight_size, weight_size, sizeof(float));

OK, time to check what wrr does on our M51 image.

Single thread version

The BlurStdDev value comes-in handy for removing the background noise. You will want that pretty high, although there's a definite computational cost associated with it.

Here we check increasing values of Radius (5,10,20,40).
Other parameters: BlurStdDev = 100.0, WeightStdDev = 20.0, Step = 1


Radius=5	Radius=10

Radius=20	Radius=40

Processing time increases with Radius. A simple test conducted on an Intel Core2 Quad Q8300 @ 2.50GHz, gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5), Ubuntu 12.04 LTS, 64 bit produces the following. YMMV:

Radius 5: took 0.23s
Radius 10: took 0.83s
Radius 20: took 3.10s
Radius 40: took 11.96s

Multi-threaded version (using OpenMP)

From the OpenMP webpage:

The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer.

I'm by all means no expert on OpenMP, so take what I say with a fistfull of salt. This is what I've had to do to integrate PRAGMA based OpenMP statements in the wrr function:

I added an include <omp.h> statement to wrr.c
I added "-fopenmp" to the compile_args list, and "-lgomp" to the extra_link_args list in setup.py
We have 4 nested loops in wrr, y,x,dy,dx. I found that multi-threading the outer loop gave the fastest code. As far as I understand, each line in the output image is produced by a different thread.

To expand on the last point, this is the OpenMP statement I used:

    if (BlurStdDev <= 1)
    {
        #pragma omp parallel for        \
            default(shared) private(x,dy,dx,Area,Below,Center,Temp)

        for (y = 0; y < Ydim; y++)
            for (x = 0; x < Xdim; x++)

The most difficult part (at least at the very basic level I'm using OpenMP at) is to identify within one thread, which variables are shared and which variables are private to the thread. In this case, the loop index and temporary variables are private. Everything else, the y outer loop index, data and line pointer tables are shared.

In the second test, we need to add Blur as one of the private variables:

    /* Handle slower case with histogram blurring */
    else
    {
        #pragma omp parallel for        \
            default(shared) private(x,dy,dx,Area,Below,Center,Temp, Blur)

        for (y = 0; y < Ydim; y++)
            for (x = 0; x < Xdim; x++)
            {

... and that's it! Time to check the multi-threaded versions:

Radius 5: took 0.09s
Radius 10: took 0.31s
Radius 20: took 1.17s
Radius 40: took 4.59s

Speed-up achieved for Radius=40: The multi-threaded version (on 4 cores) is 2.6x faster than the single threaded one. There's probably room for improvement, but it's a start!

EDIT

To get a better idea of the code speed-up (wall time) and thread creation impact (CPU clock time), I wrote a small test case in C (use the Makefile).

Depending on what's running in the background, output will look something like this:

./test_wrr -n 5

Testing the single-threaded version:
iteration 1/5, 0.794s
iteration 2/5, 0.792s
iteration 3/5, 0.791s
iteration 4/5, 0.791s
iteration 5/5, 0.791s
mean for 5 repeats: 0.792s

Testing the multi-threaded version:
iteration 1/5, 0.242s
iteration 2/5, 0.238s
iteration 3/5, 0.243s
iteration 4/5, 0.244s
iteration 5/5, 0.240s
mean for 5 repeats: 0.241s

Multi-threaded version speed-up: 3.28x
Thread creation impact: 11.65%

As always, code is on the EZwidgets SVN.

Do comment if you see any ways of improving the code. Thanks!

Sunday 6 May 2012

First look at Riverbank SIP (Python bindings for C and C++ libraries)

This is where I take a step back and start looking at improving my rapid prototyping Python, numpy and C pipeline. I'm looking into replacing SWIG with SIP.

I've posted two tutorials about generating Python bindings using SWIG and numpy.i:

These describe a lot of the programming I do. Using SWIG, I can write code which is completely free from any Python/C constructs and easily generate usable bindings. The second tutorial shows automatic memory deallocation when the Python arrays are deleted. This makes my code truly "pythonic".

On the Python side of things, a wealth of libraries (Matplotlib, PIL, VTK, etc...) allows me to load various datasets (from databases, various microscopy image formats, synthetic data...), test the code and display useful data. Also if needed, I can use my code in a native application.

Some simple GCC optimisations have really made my code fly:

native CPU optimisation enabled by -march=native
auto-vectorisation, enabled at -O3 (and detailed by -ftree-vectorizer-verbose=2)
OpenMP multithreading enabled by -fopenmp requires a bit more work but is well worth the trouble (more about this here).

So back to bindings, and let me reiterate that both SWIG and numpy.i have been good and loyal friends over the years. I have never had to worry about my bindings until some recent work which is mostly Python but repeatedly calls a few scattered functions written in C. In this case, it's worth trying to optimise the bindings as well as the code, so it's time to look at some alternatives to SWIG. I found some anecdotal evidence that I should be looking into SIP as a replacement for SWIG.

My first baby step was to adapt SWIG's example.fact() to SIP. Result code is here.

EDIT: As a second baby step, I wrote a C++ example. Result code is here.

Next step will be to convert numpy.i to SIP. Maybe start with IN_ARRAY1, try and stress the binding code and compare to SWIG.

Stay tuned! And of course, if you want to share your own experience with SWIG / SIP / Cython, I'd love to hear about it.