Tuesday, May 18, 2010

The Advantages of Tagged Immediates

First, a definition. An 'immediate' is a value that can fit in a register. That is, a non-pointer value. In lisp, these are things like ints, bools, characters. Most lisp implementations use tag bits to identify these objects at runtime, and Irken has inherited this design.

The advantage is that the garbage collector and runtime can tell pointer from non-pointer, and can print out any type without consulting some table generated by the compiler.

The disadvantage is mostly with integers - Irken uses a one-bit tag, the lowest bit. Any value with the lowest bit set is an integer immediate. To get its value you merely shift right. This hurt a lot in the 16-bit world, a little in the 32-bit world, and I will venture to say not at all in a 64-bit world.

Bools, characters, unit types, etc... are represented by a value having the second bit set.

Pointer types are any value that has the lower two bits cleared; i.e., 32-bit aligned addresses. Very neat how that works out.

Now, the hardcore ML guys (like Mlton and OCaml) have pretty much eliminated all tagging, and use unboxed tuples whenever possible (i.e., a datatype containing three values will consist of exactly three words in the heap, no type+length word). They do this because they can, and I suppose to be more competitive with C.

My feeling is that the advantages of run-time tagging outweigh the disadvantages, especially when debugging a new compiler. I've had enough experience grepping through core dumps of python processes that I really appreciate having a heap image that can be combed through, analyzed by a coroner. [In fact, at IronPort we wrote an entire suite of tools for doing post-mortem analysis on python core files. Sadly they still haven't been open-sourced].

While hacking up the first test VM, I noticed another neat hack that claws back some of the disadvantage of boxes and tags.

Imagine you have a datatype like this:

(datatype key
  (:number int)
  (:letter char))

It would be rather inefficient to put 'keys' into one-tuples, only to have a tagged immediate inside. So Irken actually detects this case, and eliminates one-tuples like this whenever possible. At run-time, a 'key' will consist of either one or the other immediate type, distinguished by the tag that's already on these values.

The hack won't work for cases where you include more than one use of the same base type, for example:

(datatype key
  (:number int)
  (:funny int)
  (:letter char))

Since there's no way at runtime to distinguish the two integer alternatives. (But you'll still get it for the 'letter' alternative).

This hack works really well for a VM, where you need exactly this kind of datatype to represent the union of all possible types.

No comments:

Post a Comment