Monday, July 26, 2010

new 'defmacro'

Here's a taste of what the new macros look like. It's just like syntax-rules, but without the redundant 'define-syntax'/'syntax-rules' layers. I've also lost a layer of parens by using something closer to the Irken pattern-matching syntax.

;; -*- Mode: Irken -*-

;; derived expressions

(defmacro and
  (and)                 -> #t
  (and test)            -> test
  (and test1 test2 ...) -> (if test1 (and test2 ...) #f)
  )

;; *not* the same as a Scheme <or>, this returns a boolean.
(defmacro or
  (or) -> #f
  (or test) -> test
  (or test1 test2 ...) -> (if test1 #t (or test2 ...))
  )

(defmacro let
  ;; normal <let> here, we just rename it to our core
  ;; binding construct, <let_splat>
  (let ((name val) ...) body1 body2 ...)
  -> (let_splat ((name val) ...) body1 body2 ...)

  ;; named let
  (let tag ((name val) ...) body1 body2 ...)
  -> (letrec ((tag (lambda (name ...) body1 body2 ...)))
       (tag val ...))
  )

Wednesday, July 21, 2010

The Macro Wars

Back in the 80's I worked mostly in Common Lisp, and therefore learned the relatively simple 'defmacro'. Over the years of exposure to Scheme, I was often curious about the more sophisticated macro systems (I think first described in R4RS, but rarely implemented). Most Scheme systems still used a backquote-style macro facility.

I never tried the 'hygienic' macro systems, partly because I didn't really care about the issue they're trying to solve, but mostly because they're just dauntingly complex. One reason is that people still seem to have not settled on one standard system. Maybe R6RS has?

If you want a taste, look over the PLT/Racket documentation here and here.

This discussion covers the strain between the two camps pretty well.

I've not completely decided yet whether I'm going to add macros to Irken. I'm coding something up now to try out. If it's a win, it stays in. But I'm leaning toward something like 'syntax-rules', with a simplified syntax. Pattern-matching will be a natural fit with the rest of Irken. [Why does scheme require both 'define-syntax' and 'syntax-rules' as two separate forms, repeating the name of the macro yet again?]

Tuesday, June 29, 2010

The Evil Value Restriction and Lisp Macros

One of the more annoying features of the HM type system is how it interacts poorly with imperative features. Every few weeks I completely forget these limitations and try to do something that in any other language would be completely natural, and get bit again.

The heart of the problem is exposed by the idea of using a mutable list datatype. When you create an empty list ('nil'), it has type "list 'a" - in other words, 'nil' can take on any type you like.

In Irken, as long as you stick with the datatype constructors themselves, this works. You can create lists of different types right next to each other and have all the type safety you like.

But as soon as you try to wrap that up - for example, into a stack ADT - you hit the value restriction.

In OCaml, this problem manifests as 'weak' type variables - e.g., " list '_a ", which just records the fact that '_a is currently unknown, and will instantiate to only one type.

One reason this keeps biting me unexpectedly is that I've changed the compiler to (by default) only type the program after inlining. This make copies of smaller polymorphic functions/interfaces, thereby side-stepping the problem with the value restriction - instead of making one copy of a function that works at multiple types, inlining causes that code to be duplicated at each site. I only notice VR problems when I force it to type twice, or turn off inlining.

I'm wondering: as a practical matter, maybe I should just embrace this. I'm not qualified to invent a new type system that's going to fix all this. But a 'stupid' substitution system (like C++ templates or lisp macros) makes the problem go away, and that's good enough for me.

AFAIK none of the ML's have a macro system. Is this perhaps a side-effect of ML fans' universal hatred of Lisp?

How might macros interact with HM? For example, could we design a 'polymorphic class' that would essentially copy the code at each use site?

Friday, May 21, 2010

A possible approach to writing an LLVM backend

Playing around with dragonegg, I've found that I can get pretty detailed llvm-ir/asm output, and I can even make sense of how the C output by Irken is getting translated. Seems like this could be a handy way of getting started on writing an llvm backend. Not that I'm thinking of doing that. Must... resist...

Here's the count-to-a-million test code in Scheme/Irken:

(let loop ((n 1000000))
  (if (zero? n)
      "done"
      (loop (- n 1))))

Here's the C output:

FUN_loop_8:
    r0 = varref (0,0);
    if PXLL_IS_TRUE(PXLL_TEST(unbox(r0)==0)) {
      r0 = (object*) &constructed_0;
      PXLL_RETURN(0);
    } else {
      r0 = varref (0,0);
      r1 = (object *) 3;
      r0 = box(unbox(r0)-unbox(r1));
      lenv[2] = r0;
      goto FUN_loop_8;
    }
    PXLL_RETURN(0);
  L1:
  r1 = allocate (TC_CLOSURE, 2);
  r1[1] = &&FUN_loop_8; r1[2] = lenv;
  r0[2] = r1;
  r0 = allocate (TC_TUPLE, 2);
  r1 = (object *) 2000001;
  r0[2] = r1;
  r1 = top[2];
  r0[1] = r1[2]; lenv = r0; goto FUN_loop_8;

And the relevant LLVM IR:

"<bb 12>":                                        ; preds = %"<bb 17>", %"<bb 13>", %"<bb 24>"
  %D.5560_160 = phi i8* [ inttoptr (i64 2000001 to i8*), %"<bb 24>" ], [ %D.5560_160, %"<bb 13>" ], [ %4, %"<bb 17>" ] ; <i8*> [#uses=3]
  %3 = icmp ult i8* %D.5560_160, inttoptr (i64 2 to i8*) ; <i1> [#uses=1]
  br i1 %3, label %"<bb 13>", label %"<bb 17>"

"<bb 13>":                                        ; preds = %"<bb 12>"
  %D.4871_48 = load i8** %2, align 8              ; <i8*> [#uses=1]
  indirectbr i8* %D.4871_48, [label %"<bb 12>", label %Lreturn]

"<bb 17>":                                        ; preds = %"<bb 12>"
  %n.85_162 = ptrtoint i8* %D.5560_160 to i64     ; <i64> [#uses=1]
  %D.5572_16323 = add i64 %n.85_162, -2           ; <i64> [#uses=1]
  %D.5579_167 = or i64 %D.5572_16323, 1           ; <i64> [#uses=1]
  %4 = inttoptr i64 %D.5579_167 to i8*            ; <i8*> [#uses=2]
  store i8* %4, i8** %11, align 8
  br label %"<bb 12>"

Tuesday, May 18, 2010

the pattern-matching parser, mostly done

I have a nearly finished python-like-language parser, written using pattern matching.

The lexer is generated as a DFA by parse/lexer.py

The parser tables are generated by a set of tools (including Jason Evans' Parsing.py module) from a simplified version of the Python 1.5.2 Grammar.

The DFA engine is now in a separate file, lib/lexer.scm.

Here's the parser so far.

The lexer tokens are fed through a one-token buffer that generates INDENT and DEDENT tokens when necessary. This is done by the top-level 'parse' function.

Following that are the expr datatypes, print functions, and finally the parsing functions themselves.

Here's a typical parsing function:

(define p-while-stmt
  ;; while_stmt: 'while' test ':' suite ['else' ':' suite]
  (_ test _ (item:nt _ body) (item:nt _ else)) -> (expr:while (p-expr test) (p-suite body) (p-else else))
  x -> (perror "p-while-stmt" x))

Oh, and here's some sample output.

The Advantages of Tagged Immediates

First, a definition. An 'immediate' is a value that can fit in a register. That is, a non-pointer value. In lisp, these are things like ints, bools, characters. Most lisp implementations use tag bits to identify these objects at runtime, and Irken has inherited this design.

The advantage is that the garbage collector and runtime can tell pointer from non-pointer, and can print out any type without consulting some table generated by the compiler.

The disadvantage is mostly with integers - Irken uses a one-bit tag, the lowest bit. Any value with the lowest bit set is an integer immediate. To get its value you merely shift right. This hurt a lot in the 16-bit world, a little in the 32-bit world, and I will venture to say not at all in a 64-bit world.

Bools, characters, unit types, etc... are represented by a value having the second bit set.

Pointer types are any value that has the lower two bits cleared; i.e., 32-bit aligned addresses. Very neat how that works out.

Now, the hardcore ML guys (like Mlton and OCaml) have pretty much eliminated all tagging, and use unboxed tuples whenever possible (i.e., a datatype containing three values will consist of exactly three words in the heap, no type+length word). They do this because they can, and I suppose to be more competitive with C.

My feeling is that the advantages of run-time tagging outweigh the disadvantages, especially when debugging a new compiler. I've had enough experience grepping through core dumps of python processes that I really appreciate having a heap image that can be combed through, analyzed by a coroner. [In fact, at IronPort we wrote an entire suite of tools for doing post-mortem analysis on python core files. Sadly they still haven't been open-sourced].

While hacking up the first test VM, I noticed another neat hack that claws back some of the disadvantage of boxes and tags.

Imagine you have a datatype like this:

(datatype key
  (:number int)
  (:letter char))

It would be rather inefficient to put 'keys' into one-tuples, only to have a tagged immediate inside. So Irken actually detects this case, and eliminates one-tuples like this whenever possible. At run-time, a 'key' will consist of either one or the other immediate type, distinguished by the tag that's already on these values.

The hack won't work for cases where you include more than one use of the same base type, for example:

(datatype key
  (:number int)
  (:funny int)
  (:letter char))

Since there's no way at runtime to distinguish the two integer alternatives. (But you'll still get it for the 'letter' alternative).

This hack works really well for a VM, where you need exactly this kind of datatype to represent the union of all possible types.

Monday, May 17, 2010

Getting Comfortable with Type Inference

Now that I've had many months of experience working with a type-inferred language, I'm beginning to notice a change in the way I work.

Statically-typed functional languages have a well-deserved reputation for spitting out obtuse, nearly useless type errors. Irken is no exception. In fact, it's probably much worse than most languages because a type error dumps you into a Python traceback. Also, it doesn't bother to track source line number information, so you're really on your own.

Over the past month or two, especially since I added pattern matching, I've noticed that my 'type accuracy' is improving. The chances that something will be correctly typed the first time I try to compile it are higher than they were. As I become familiar with the type system, and pattern matching, Irken has been training me. It's as if a type error (and the resulting jaunt through pdb) is a slap on the nose. A clean compile is like a nice treat. Good human.

This has changed my edit-compile loop a little - I tend to compile more often. If I've only changed one or two lines since my last compile, I'm pretty sure I know where the typo is.

This makes me wonder if one of ML's problems is related to this. Old-timers don't need to spend all that effort making a "Microsoft Bob" type error reporter. They see "itypes.TypeError: ((()->wta), (wjx->vpq))" and they know they forgot to pass an argument to that function call they just added.

Newcomers, though. They get that same feeling they had the first time they struggled with a C or Ada compiler.

When/if I rewrite Irken in itself, I think I'll try to do a better job. In "HM(X) Type Inference is CLP(X) Solving", Sulzmann and Stuckey argue that a constraint-based solver can do a better job of error reporting; so here's hoping I can figure that out when the time comes.

(p.s. I have to admit that OCaml's reporting is pretty snazzy, though)