Friday, May 21, 2010

A possible approach to writing an LLVM backend

Playing around with dragonegg, I've found that I can get pretty detailed llvm-ir/asm output, and I can even make sense of how the C output by Irken is getting translated. Seems like this could be a handy way of getting started on writing an llvm backend. Not that I'm thinking of doing that. Must... resist...

Here's the count-to-a-million test code in Scheme/Irken:

(let loop ((n 1000000))
  (if (zero? n)
      "done"
      (loop (- n 1))))

Here's the C output:

FUN_loop_8:
    r0 = varref (0,0);
    if PXLL_IS_TRUE(PXLL_TEST(unbox(r0)==0)) {
      r0 = (object*) &constructed_0;
      PXLL_RETURN(0);
    } else {
      r0 = varref (0,0);
      r1 = (object *) 3;
      r0 = box(unbox(r0)-unbox(r1));
      lenv[2] = r0;
      goto FUN_loop_8;
    }
    PXLL_RETURN(0);
  L1:
  r1 = allocate (TC_CLOSURE, 2);
  r1[1] = &&FUN_loop_8; r1[2] = lenv;
  r0[2] = r1;
  r0 = allocate (TC_TUPLE, 2);
  r1 = (object *) 2000001;
  r0[2] = r1;
  r1 = top[2];
  r0[1] = r1[2]; lenv = r0; goto FUN_loop_8;

And the relevant LLVM IR:

"<bb 12>":                                        ; preds = %"<bb 17>", %"<bb 13>", %"<bb 24>"
  %D.5560_160 = phi i8* [ inttoptr (i64 2000001 to i8*), %"<bb 24>" ], [ %D.5560_160, %"<bb 13>" ], [ %4, %"<bb 17>" ] ; <i8*> [#uses=3]
  %3 = icmp ult i8* %D.5560_160, inttoptr (i64 2 to i8*) ; <i1> [#uses=1]
  br i1 %3, label %"<bb 13>", label %"<bb 17>"

"<bb 13>":                                        ; preds = %"<bb 12>"
  %D.4871_48 = load i8** %2, align 8              ; <i8*> [#uses=1]
  indirectbr i8* %D.4871_48, [label %"<bb 12>", label %Lreturn]

"<bb 17>":                                        ; preds = %"<bb 12>"
  %n.85_162 = ptrtoint i8* %D.5560_160 to i64     ; <i64> [#uses=1]
  %D.5572_16323 = add i64 %n.85_162, -2           ; <i64> [#uses=1]
  %D.5579_167 = or i64 %D.5572_16323, 1           ; <i64> [#uses=1]
  %4 = inttoptr i64 %D.5579_167 to i8*            ; <i8*> [#uses=2]
  store i8* %4, i8** %11, align 8
  br label %"<bb 12>"

Tuesday, May 18, 2010

the pattern-matching parser, mostly done

I have a nearly finished python-like-language parser, written using pattern matching.

The lexer is generated as a DFA by parse/lexer.py

The parser tables are generated by a set of tools (including Jason Evans' Parsing.py module) from a simplified version of the Python 1.5.2 Grammar.

The DFA engine is now in a separate file, lib/lexer.scm.

Here's the parser so far.

The lexer tokens are fed through a one-token buffer that generates INDENT and DEDENT tokens when necessary. This is done by the top-level 'parse' function.

Following that are the expr datatypes, print functions, and finally the parsing functions themselves.

Here's a typical parsing function:

(define p-while-stmt
  ;; while_stmt: 'while' test ':' suite ['else' ':' suite]
  (_ test _ (item:nt _ body) (item:nt _ else)) -> (expr:while (p-expr test) (p-suite body) (p-else else))
  x -> (perror "p-while-stmt" x))

Oh, and here's some sample output.

The Advantages of Tagged Immediates

First, a definition. An 'immediate' is a value that can fit in a register. That is, a non-pointer value. In lisp, these are things like ints, bools, characters. Most lisp implementations use tag bits to identify these objects at runtime, and Irken has inherited this design.

The advantage is that the garbage collector and runtime can tell pointer from non-pointer, and can print out any type without consulting some table generated by the compiler.

The disadvantage is mostly with integers - Irken uses a one-bit tag, the lowest bit. Any value with the lowest bit set is an integer immediate. To get its value you merely shift right. This hurt a lot in the 16-bit world, a little in the 32-bit world, and I will venture to say not at all in a 64-bit world.

Bools, characters, unit types, etc... are represented by a value having the second bit set.

Pointer types are any value that has the lower two bits cleared; i.e., 32-bit aligned addresses. Very neat how that works out.

Now, the hardcore ML guys (like Mlton and OCaml) have pretty much eliminated all tagging, and use unboxed tuples whenever possible (i.e., a datatype containing three values will consist of exactly three words in the heap, no type+length word). They do this because they can, and I suppose to be more competitive with C.

My feeling is that the advantages of run-time tagging outweigh the disadvantages, especially when debugging a new compiler. I've had enough experience grepping through core dumps of python processes that I really appreciate having a heap image that can be combed through, analyzed by a coroner. [In fact, at IronPort we wrote an entire suite of tools for doing post-mortem analysis on python core files. Sadly they still haven't been open-sourced].

While hacking up the first test VM, I noticed another neat hack that claws back some of the disadvantage of boxes and tags.

Imagine you have a datatype like this:

(datatype key
  (:number int)
  (:letter char))

It would be rather inefficient to put 'keys' into one-tuples, only to have a tagged immediate inside. So Irken actually detects this case, and eliminates one-tuples like this whenever possible. At run-time, a 'key' will consist of either one or the other immediate type, distinguished by the tag that's already on these values.

The hack won't work for cases where you include more than one use of the same base type, for example:

(datatype key
  (:number int)
  (:funny int)
  (:letter char))

Since there's no way at runtime to distinguish the two integer alternatives. (But you'll still get it for the 'letter' alternative).

This hack works really well for a VM, where you need exactly this kind of datatype to represent the union of all possible types.

Monday, May 17, 2010

Getting Comfortable with Type Inference

Now that I've had many months of experience working with a type-inferred language, I'm beginning to notice a change in the way I work.

Statically-typed functional languages have a well-deserved reputation for spitting out obtuse, nearly useless type errors. Irken is no exception. In fact, it's probably much worse than most languages because a type error dumps you into a Python traceback. Also, it doesn't bother to track source line number information, so you're really on your own.

Over the past month or two, especially since I added pattern matching, I've noticed that my 'type accuracy' is improving. The chances that something will be correctly typed the first time I try to compile it are higher than they were. As I become familiar with the type system, and pattern matching, Irken has been training me. It's as if a type error (and the resulting jaunt through pdb) is a slap on the nose. A clean compile is like a nice treat. Good human.

This has changed my edit-compile loop a little - I tend to compile more often. If I've only changed one or two lines since my last compile, I'm pretty sure I know where the typo is.

This makes me wonder if one of ML's problems is related to this. Old-timers don't need to spend all that effort making a "Microsoft Bob" type error reporter. They see "itypes.TypeError: ((()->wta), (wjx->vpq))" and they know they forgot to pass an argument to that function call they just added.

Newcomers, though. They get that same feeling they had the first time they struggled with a C or Ada compiler.

When/if I rewrite Irken in itself, I think I'll try to do a better job. In "HM(X) Type Inference is CLP(X) Solving", Sulzmann and Stuckey argue that a constraint-based solver can do a better job of error reporting; so here's hoping I can figure that out when the time comes.

(p.s. I have to admit that OCaml's reporting is pretty snazzy, though)

Tuesday, May 11, 2010

the temptations of LLVM

I'm still plugging away at the parser and vm for the python-like-language.

But in another window I have the language reference for LLVM open, and as I browse through it I'm becoming increasingly distracted. There are many wonderful snacks - things like add with overflow detection, intrinsic functions like memmove, nice floating-point and vector insn support, atomic swap operations, etc...

An interesting aspect of LLVM is that it doesn't really support a register model. Instead, it uses SSA, which means that you just keep inventing names for each result. Now, this will either be a perfect fit for Irken's CPS-based compiler, or it will make it impossible to do. I haven't decided yet.

A long time ago, CPS and SSA were proven to be equivalent. From what I can tell, the compiler world has moved on and pretends that CPS never existed.

Monday, May 3, 2010

A statically typed VM?

Trying to wrap my head around this idea... The VM I'm playing around with right now is dynamically typed - but its implementation is statically typed. I started thinking about making a bytecode implementation of Irken itself (rather than a Python-like language).

But I'm not sure it even makes sense to try to write a static VM in a static language, because you'd have to do the equivalent of 'void*' casting everywhere. (Of course, this is what you do when you write a VM in C, but it feels a much more natural thing to do in C 8^).

I was curious to see how OCaml dealt with this issue... and was quite surprised to find a VM written in C.