Monday, January 2, 2012

Irken status, LLVM bitcode

This is just a quick note that the Irken project has not died, only gone into a temporary hibernation.
When in the course of history it becomes necessary to earn a living...

Anyway, I have some tangentially related comments about LLVM.
My current employer is interested in maybe doing a little LLVM JIT work, wherein my experience with compilers may prove useful.  A nice side effect for me is that I finally get to play around a bit with LLVM, something which I carefully avoided while working on Irken.

My first reaction was to pick up llvm-py, which looked like a fairly complete interface to the various libraries that make up the llvm system.

Big Mistake.  It doesn't build any more.  It's only a little over a year old.  The last release of llvm-py was for LLVM 2.8rc2.  I am now running 3.0.

Here's the problem: the LLVM API's are constantly changing.  And that's a good thing, really.  But since those C++ API's are the only well-supported interface to the system, it's a moving target.

There are three main options for a compiler writer who wants to target LLVM.  The first is to use the C++ API.  (or you could use the completely undocumented and incomplete C interface).

The second option is to write llvm assembly.  This is not a bad option, but will be slower because of the need to parse the assembly.

The third is to write llvm 'bitcode'.  I thought this was probably the right approach.  Bitcode is a binary representation of the LLVM IR language, it seems that it would be likely to change much more slowly than the library interfaces.

The problem is that the person who designed the bitcode file format was on meth at the time.  This has got to be one of the most evil formats I've ever seen.  I think they were trying to make the resulting file as small as possible, but at the expense of counting every bit.  (perhaps an early LLVM project flew on a space probe?)  Just to give you an idea, 3 and 4-bit integers (not aligned on any boundary) are a common field type.  And not just that, but some of the fields are variably sized, starting at 2 bits and going up.  Symbols are represented using 6-bit characters.  Also, the concept of 'bitstream' is taken very literally, the bits in a file are viewed as if it were one large integer, starting from the LSB.  This is actually an elegant approach, but it is completely different from every other file format or network protocol.  I had to continually resist using common parsing patterns with it.   There's also a compression-like method of 'abbreviating' common record types.  And after all that work to save bits, at the end of every block the bit stream is aligned to 32 bits!

Ok, got that off my chest.

My sense is that nobody uses the bitcode format because of this (the only other implementation I could find was in Haskell, and was also impenetrable because Everything Is A Monad, You Know).  Most people just succumb to the pressure to write against the C++ API, and will then spend the rest of their lives reading the llvm-dev list to keep track of the changes.

I have mostly decoded the format, and I think I could actually output it for a compiler.
But I think I'll split the difference and start out by outputting llvm assembly to start with.  If the day comes when I need to shave some of the 10MB off the resulting .so file, and maybe speed things up, I'll revisit the bitcode issue.

In the meanwhile I've had a whack at using Cython's C++ support to make a minimal interface to LLVM where I can feed either llvm assembly or bitcode to the JIT and run it.

1 comment:

  1. Heh, another issue I forgot to mention: because the API is changing so much, it's actually quite dangerous to look for any information about LLVM anywhere on the net other than the official LLVM site. If you start grepping around for function names, you'll end up looking at 8 month old web pages that describe how it *used* to be done. So lots of tutorials and documentation are useless, as well.