Monday, May 23, 2011

git, github, clang

I've imported my svn history into git, and published the whole thing on github:

https://github.com/samrushing/irken-compiler/

Still trying to grok git and github, there'll probably be a stall for a week or so.

In the meanwhile, I have interesting news:  I was able to take out all the 'nested functions' which relied on a gcc extension.  I resisted this for a long time because early testing had shown that declaring the 'registers' as local register variables in the main function was a huge win.

After a quick test the other day, I found that moving everything out to global variables has no visible impact on performance.  I don't know if this is because of improvements in gcc & llvm, or if I just wasn't trying a large enough program.

That means I'm now able to compile the output with clang, rather than relying on llvm+dragonegg, which can be much more annoying to build, install, and use.

The good news continues: although the performance of the generated code is identical (maybe a tiny bit faster), the compilation time has been cut quite significantly.  On my machine an LLVM -O2 compile dropped from 18m to 8m.  The -O0 compile time is even sweeter, about 3s, which is actually faster than Irken itself, which hovers right around 5s to self-compile.  So a roughly 10s edit-compile cycle will definitely speed the development of Irken up.

Now the bad news.  The no-nested-funs version causes 'as' on snow leopard to take a lunch break.  It actually nails the CPU for 10 or 15 minutes before finally emerging as if nothing was wrong.  No problems on FreeBSD 7.3 with gcc-4.5.3, though.  This could be a significant annoyance, since I expect many people trying out Irken would prefer to do so with a stock compiler.

More bad news:  clang-2.9 crashes when fed Irken's self-compiled output.  I was going to file a bug against it, but when I checked out the svn version and built it the problem was gone.  Sigh.

This is why I'm trying to figure out how to make branches on github, since it'd be nice to maintain both versions for a while until these issues are ironed out.  I have made a local branch called 'no-nested-funs', I just haven't figured out how to publish it on github.

Sunday, May 15, 2011

nested datatypes, 2-3 trees, numerical encoding

This paper by Hinze & Paterson discusses the use of 'nested datatypes' to implement 2-3 trees, but in a way that uses the type system to automatically handle the invariants.  A blog post from Mark Chu-Carroll is a bit easier to follow.  Here's his example of a sequence encoded as a 2-3 tree (in Haskell):


data Tree t = Zero t | Succ (Tree (Node t))
data Node s = Node2 s s | Node3 s s s

Each level of the tree contains sub-trees that are 'one less' via the numeric encoding scheme.

This sent me back to Okasaki's book, where I'm finally starting to grok those last two chapters.  This bizarre mingling of numerical encoding and data structures lights a fire in the parts of my brain dedicated to data structures.  If feel as if I could automatically see the relationship between red-black trees, 2-3 trees, and binary number representations.... if only I could double my IQ.



I have this cool puzzle called the 'Spin-Out', which is actually a physical demonstration of the properties of Gray Codes.  It seems to me that Gray Codes and balanced trees should be a natural fit.


Unfortunately, I can't experiment with any of these ideas in Irken (or OCaml, or SML) because the nested datatypes require polymorphic recursion.

Type inference with polymorphic recursion is undecidable.  Haskell allows it only if the user provides a manual type signature.  I'm guessing this is impossible (i.e., unsafe) with SML/OCaml because of the presence of assignment?  Could such a feature safely be used with functions that the compiler can verify are pure?