The Alien Tongue: 2010

Sunday, December 19, 2010

self-hosting

Just a quick note to explain the relative silence here recently. I've started rewriting Irken in itself, originally just as an exercise, but now with a little enthusiasm. Unfortunately there's the danger of second-system syndrome, which I'm trying to keep under control by focusing on eliminating design flaws, and not introducing new features.

Currently, I have the reader, the transformer, and the macro packages written. Next up are the pattern match compiler and the typing engine. Should be pretty smooth sailing after that.

how about a pattern-matching version of "let loop"?

I tend to use the "named let" idiom a lot, to me it's the most general form of looping construct, and is more readable than artificial constructs like 'do'.

I've begun to pine for a pattern-matching version of that construct, though. Not sure how well it would fly, but it would nicely generalize pattern matching. The reason I like 'named let' is because it lets you compactly introduce a recursive function while keeping the initial arguments in a place that makes the whole construct easier to understand.

Here's what the 'length' function looks like right now:

(define (length l)
  (define fun
    acc ()        -> acc
    acc (hd . tl) -> (fun (+ 1 acc) tl)
  (fun 0 l)))

And here's what it might look like with a pattern-matching 'named let' variant:

(define (length l)
  (let loop (0 l)
    acc ()        -> acc
    acc (hd . tl) -> (loop (+ 1 acc) tl)))

Visually I think it should be easy to distinguish from a normal let by the lack of extra levels of parens in the 'binding' spot... and this is in addition to the obvious -> symbols.

One bad side-effect of this is that let could no longer be done with defmacro (as in lib/derived.scm), and would require a special hack in the parser...

Wednesday, November 3, 2010

proper datatypes for bool and symbol

I had previously been tagging user-immediates in a way that wasted tag-space and didn't match up with the obvious implementation used for things like bools and chars.

This is now fixed. A nice side-effect of this change is that matches against bools are now detected as complete or incomplete.

Incomplete matches are now detected earlier in the compiler, in a way that makes it easier to figure out the code that actually triggered it.

I've also, for fun, started writing the lisp reader in irken itself.

Tuesday, October 26, 2010

equirecursive types

Just a quick status update. As my 'user' program has become larger, the type solver has begun to slow down horribly. I'm on the wrong end of some horrible O(x) where x is probably n^4 or worse.

So I decided to try an experiment, going back to the classic algorithm W. This only took a couple of days! However. One thing I didn't realize I was getting for 'free' from the constraint-based solver was recursive types. Now I'm forced to actually understand the concept.

I see two potential outcomes: 1) I grok recursive types well enough to fix my unifier, and the compiler gets much much faster; or 2) I rewrite the constraint solver to be more efficient. It's possible that I understand things well enough now to be able to do so.

If anyone out there can explain how to do unification with equirecursive types, I'd love to have the help!

Sunday, September 19, 2010

first draft of a language tutorial

I've slapped together a tutorial/language document, I'd love to get some feedback on it. In order to save time and space, it is aimed at programmers that are already familiar with Lisp/Scheme; mostly it introduces the MLish concepts and unique quirks of Irken.

The Irken Language.

Feedback of any kind would be much appreciated!

Monday, July 26, 2010

new 'defmacro'

Here's a taste of what the new macros look like. It's just like syntax-rules, but without the redundant 'define-syntax'/'syntax-rules' layers. I've also lost a layer of parens by using something closer to the Irken pattern-matching syntax.

;; -*- Mode: Irken -*-

;; derived expressions

(defmacro and
  (and)                 -> #t
  (and test)            -> test
  (and test1 test2 ...) -> (if test1 (and test2 ...) #f)
  )

;; *not* the same as a Scheme <or>, this returns a boolean.
(defmacro or
  (or) -> #f
  (or test) -> test
  (or test1 test2 ...) -> (if test1 #t (or test2 ...))
  )

(defmacro let
  ;; normal <let> here, we just rename it to our core
  ;; binding construct, <let_splat>
  (let ((name val) ...) body1 body2 ...)
  -> (let_splat ((name val) ...) body1 body2 ...)

  ;; named let
  (let tag ((name val) ...) body1 body2 ...)
  -> (letrec ((tag (lambda (name ...) body1 body2 ...)))
       (tag val ...))
  )

Wednesday, July 21, 2010

The Macro Wars

Back in the 80's I worked mostly in Common Lisp, and therefore learned the relatively simple 'defmacro'. Over the years of exposure to Scheme, I was often curious about the more sophisticated macro systems (I think first described in R4RS, but rarely implemented). Most Scheme systems still used a backquote-style macro facility.

I never tried the 'hygienic' macro systems, partly because I didn't really care about the issue they're trying to solve, but mostly because they're just dauntingly complex. One reason is that people still seem to have not settled on one standard system. Maybe R6RS has?

If you want a taste, look over the PLT/Racket documentation here and here.

This discussion covers the strain between the two camps pretty well.

I've not completely decided yet whether I'm going to add macros to Irken. I'm coding something up now to try out. If it's a win, it stays in. But I'm leaning toward something like 'syntax-rules', with a simplified syntax. Pattern-matching will be a natural fit with the rest of Irken. [Why does scheme require both 'define-syntax' and 'syntax-rules' as two separate forms, repeating the name of the macro yet again?]

Tuesday, June 29, 2010

The Evil Value Restriction and Lisp Macros

One of the more annoying features of the HM type system is how it interacts poorly with imperative features. Every few weeks I completely forget these limitations and try to do something that in any other language would be completely natural, and get bit again.

The heart of the problem is exposed by the idea of using a mutable list datatype. When you create an empty list ('nil'), it has type "list 'a" - in other words, 'nil' can take on any type you like.

In Irken, as long as you stick with the datatype constructors themselves, this works. You can create lists of different types right next to each other and have all the type safety you like.

But as soon as you try to wrap that up - for example, into a stack ADT - you hit the value restriction.

In OCaml, this problem manifests as 'weak' type variables - e.g., " list '_a ", which just records the fact that '_a is currently unknown, and will instantiate to only one type.

One reason this keeps biting me unexpectedly is that I've changed the compiler to (by default) only type the program after inlining. This make copies of smaller polymorphic functions/interfaces, thereby side-stepping the problem with the value restriction - instead of making one copy of a function that works at multiple types, inlining causes that code to be duplicated at each site. I only notice VR problems when I force it to type twice, or turn off inlining.

I'm wondering: as a practical matter, maybe I should just embrace this. I'm not qualified to invent a new type system that's going to fix all this. But a 'stupid' substitution system (like C++ templates or lisp macros) makes the problem go away, and that's good enough for me.

AFAIK none of the ML's have a macro system. Is this perhaps a side-effect of ML fans' universal hatred of Lisp?

How might macros interact with HM? For example, could we design a 'polymorphic class' that would essentially copy the code at each use site?

Friday, May 21, 2010

A possible approach to writing an LLVM backend

Playing around with dragonegg, I've found that I can get pretty detailed llvm-ir/asm output, and I can even make sense of how the C output by Irken is getting translated. Seems like this could be a handy way of getting started on writing an llvm backend. Not that I'm thinking of doing that. Must... resist...

Here's the count-to-a-million test code in Scheme/Irken:

(let loop ((n 1000000))
  (if (zero? n)
      "done"
      (loop (- n 1))))

Here's the C output:

FUN_loop_8:
    r0 = varref (0,0);
    if PXLL_IS_TRUE(PXLL_TEST(unbox(r0)==0)) {
      r0 = (object*) &constructed_0;
      PXLL_RETURN(0);
    } else {
      r0 = varref (0,0);
      r1 = (object *) 3;
      r0 = box(unbox(r0)-unbox(r1));
      lenv[2] = r0;
      goto FUN_loop_8;
    }
    PXLL_RETURN(0);
  L1:
  r1 = allocate (TC_CLOSURE, 2);
  r1[1] = &&FUN_loop_8; r1[2] = lenv;
  r0[2] = r1;
  r0 = allocate (TC_TUPLE, 2);
  r1 = (object *) 2000001;
  r0[2] = r1;
  r1 = top[2];
  r0[1] = r1[2]; lenv = r0; goto FUN_loop_8;

And the relevant LLVM IR:

"<bb 12>":                                        ; preds = %"<bb 17>", %"<bb 13>", %"<bb 24>"
  %D.5560_160 = phi i8* [ inttoptr (i64 2000001 to i8*), %"<bb 24>" ], [ %D.5560_160, %"<bb 13>" ], [ %4, %"<bb 17>" ] ; <i8*> [#uses=3]
  %3 = icmp ult i8* %D.5560_160, inttoptr (i64 2 to i8*) ; <i1> [#uses=1]
  br i1 %3, label %"<bb 13>", label %"<bb 17>"

"<bb 13>":                                        ; preds = %"<bb 12>"
  %D.4871_48 = load i8** %2, align 8              ; <i8*> [#uses=1]
  indirectbr i8* %D.4871_48, [label %"<bb 12>", label %Lreturn]

"<bb 17>":                                        ; preds = %"<bb 12>"
  %n.85_162 = ptrtoint i8* %D.5560_160 to i64     ; <i64> [#uses=1]
  %D.5572_16323 = add i64 %n.85_162, -2           ; <i64> [#uses=1]
  %D.5579_167 = or i64 %D.5572_16323, 1           ; <i64> [#uses=1]
  %4 = inttoptr i64 %D.5579_167 to i8*            ; <i8*> [#uses=2]
  store i8* %4, i8** %11, align 8
  br label %"<bb 12>"

Tuesday, May 18, 2010

the pattern-matching parser, mostly done

I have a nearly finished python-like-language parser, written using pattern matching.

The lexer is generated as a DFA by parse/lexer.py

The parser tables are generated by a set of tools (including Jason Evans' Parsing.py module) from a simplified version of the Python 1.5.2 Grammar.

The DFA engine is now in a separate file, lib/lexer.scm.

Here's the parser so far.

The lexer tokens are fed through a one-token buffer that generates INDENT and DEDENT tokens when necessary. This is done by the top-level 'parse' function.

Following that are the expr datatypes, print functions, and finally the parsing functions themselves.

Here's a typical parsing function:

(define p-while-stmt
  ;; while_stmt: 'while' test ':' suite ['else' ':' suite]
  (_ test _ (item:nt _ body) (item:nt _ else)) -> (expr:while (p-expr test) (p-suite body) (p-else else))
  x -> (perror "p-while-stmt" x))

Oh, and here's some sample output.

The Advantages of Tagged Immediates

First, a definition. An 'immediate' is a value that can fit in a register. That is, a non-pointer value. In lisp, these are things like ints, bools, characters. Most lisp implementations use tag bits to identify these objects at runtime, and Irken has inherited this design.

The advantage is that the garbage collector and runtime can tell pointer from non-pointer, and can print out any type without consulting some table generated by the compiler.

The disadvantage is mostly with integers - Irken uses a one-bit tag, the lowest bit. Any value with the lowest bit set is an integer immediate. To get its value you merely shift right. This hurt a lot in the 16-bit world, a little in the 32-bit world, and I will venture to say not at all in a 64-bit world.

Bools, characters, unit types, etc... are represented by a value having the second bit set.

Pointer types are any value that has the lower two bits cleared; i.e., 32-bit aligned addresses. Very neat how that works out.

Now, the hardcore ML guys (like Mlton and OCaml) have pretty much eliminated all tagging, and use unboxed tuples whenever possible (i.e., a datatype containing three values will consist of exactly three words in the heap, no type+length word). They do this because they can, and I suppose to be more competitive with C.

My feeling is that the advantages of run-time tagging outweigh the disadvantages, especially when debugging a new compiler. I've had enough experience grepping through core dumps of python processes that I really appreciate having a heap image that can be combed through, analyzed by a coroner. [In fact, at IronPort we wrote an entire suite of tools for doing post-mortem analysis on python core files. Sadly they still haven't been open-sourced].

While hacking up the first test VM, I noticed another neat hack that claws back some of the disadvantage of boxes and tags.

Imagine you have a datatype like this:

(datatype key
  (:number int)
  (:letter char))

It would be rather inefficient to put 'keys' into one-tuples, only to have a tagged immediate inside. So Irken actually detects this case, and eliminates one-tuples like this whenever possible. At run-time, a 'key' will consist of either one or the other immediate type, distinguished by the tag that's already on these values.

The hack won't work for cases where you include more than one use of the same base type, for example:

(datatype key
  (:number int)
  (:funny int)
  (:letter char))

Since there's no way at runtime to distinguish the two integer alternatives. (But you'll still get it for the 'letter' alternative).

This hack works really well for a VM, where you need exactly this kind of datatype to represent the union of all possible types.

Monday, May 17, 2010

Getting Comfortable with Type Inference

Now that I've had many months of experience working with a type-inferred language, I'm beginning to notice a change in the way I work.

Statically-typed functional languages have a well-deserved reputation for spitting out obtuse, nearly useless type errors. Irken is no exception. In fact, it's probably much worse than most languages because a type error dumps you into a Python traceback. Also, it doesn't bother to track source line number information, so you're really on your own.

Over the past month or two, especially since I added pattern matching, I've noticed that my 'type accuracy' is improving. The chances that something will be correctly typed the first time I try to compile it are higher than they were. As I become familiar with the type system, and pattern matching, Irken has been training me. It's as if a type error (and the resulting jaunt through pdb) is a slap on the nose. A clean compile is like a nice treat. Good human.

This has changed my edit-compile loop a little - I tend to compile more often. If I've only changed one or two lines since my last compile, I'm pretty sure I know where the typo is.

This makes me wonder if one of ML's problems is related to this. Old-timers don't need to spend all that effort making a "Microsoft Bob" type error reporter. They see "itypes.TypeError: ((()->wta), (wjx->vpq))" and they know they forgot to pass an argument to that function call they just added.

Newcomers, though. They get that same feeling they had the first time they struggled with a C or Ada compiler.

When/if I rewrite Irken in itself, I think I'll try to do a better job. In "HM(X) Type Inference is CLP(X) Solving", Sulzmann and Stuckey argue that a constraint-based solver can do a better job of error reporting; so here's hoping I can figure that out when the time comes.

(p.s. I have to admit that OCaml's reporting is pretty snazzy, though)

Tuesday, May 11, 2010

the temptations of LLVM

I'm still plugging away at the parser and vm for the python-like-language.

But in another window I have the language reference for LLVM open, and as I browse through it I'm becoming increasingly distracted. There are many wonderful snacks - things like add with overflow detection, intrinsic functions like memmove, nice floating-point and vector insn support, atomic swap operations, etc...

An interesting aspect of LLVM is that it doesn't really support a register model. Instead, it uses SSA, which means that you just keep inventing names for each result. Now, this will either be a perfect fit for Irken's CPS-based compiler, or it will make it impossible to do. I haven't decided yet.

A long time ago, CPS and SSA were proven to be equivalent. From what I can tell, the compiler world has moved on and pretends that CPS never existed.

Monday, May 3, 2010

A statically typed VM?

Trying to wrap my head around this idea... The VM I'm playing around with right now is dynamically typed - but its implementation is statically typed. I started thinking about making a bytecode implementation of Irken itself (rather than a Python-like language).

But I'm not sure it even makes sense to try to write a static VM in a static language, because you'd have to do the equivalent of 'void*' casting everywhere. (Of course, this is what you do when you write a VM in C, but it feels a much more natural thing to do in C 8^).

I was curious to see how OCaml dealt with this issue... and was quite surprised to find a VM written in C.

Monday, April 26, 2010

tak20 benchmark runs in the VM

My favorite benchmark now runs in the VM, since I now have all three types of procedure invocation implemented: tail calls, tail recursive calls, and normal invocation. The last is the most complex, because it actually involves building a 'stack frame' and saving/restoring registers:

(datatype vmcont
  (:nil)
  (:k (vmcont) (lenv) int (vector (object)))
  )

Take a look at the current VM implementation, in Irken.

So a 'VM continuation' consists of the pointer to the next frame, the lexical environment, the PC, and a vector of saved registers.

The 'INVOKE' cps insn is turned into two VM insns, 'CALL' and 'POP'.

Even without any serious tweaking, the tak20 benchmark seems to run about the same speed as in python. But I plan to make specialized versions of CALL and POP for different numbers of saved registers. This will eliminate loops in the insns and thus get rid of allocation for the insn closures.

It's very rewarding to finally bring all this work together. Putting together a functioning VM is the very tip of an iceberg I've spent 2+ years building.

I'm even thinking about documenting the Irken language. Yeah, I know, crazy thoughts.

Next step: start building the python-like language. This would involve rewriting the byte-code compiler in Irken. OR, I could continue the nearly trivial task of supporting Irken as a byte-coded language, and start rewriting everything in Irken. As we all know, only the most fashionable and respectable languages are written in themselves. It's tempting!

Thursday, April 22, 2010

dragon egg

If you've ever tried to even read the instructions for building llvm-gcc (much less try to build it), you understand why this is good news.

http://dragonegg.llvm.org/

Sounds like gcc has gained a plugin architecture, and the llvm folks are putting together a plugin, 'DragonEgg', that uses LLVM as the back end.

Why do I care about this?

1) Clang still does not support lexical functions, which are critically important to Irken's performance.
2) The version of llvm-gcc distributed with Apple's XCode uses an older version of llvm that implements the address-of-label extension with a hack so horrible I would be embarrassed to describe it to you.

first loop in the vm!

I've implemented enough instructions in the VM that I can now do a loop (a.k.a. a tail-recursive function call).

I still need to work on the VM insns, there's a lot of allocation taking place that shouldn't... also, there's range-checking on each and every vector access.. even including stuff like the register file. Given those caveats, it's only about 4X slower than a similar python loop. I'm pretty sure I should be able to make it faster than python, we'll see.

--- cps ---
   0 = new_env [] 1
   -   push_env [0] None
   1 = close [] 'loop_0'
       0 = lit [] 3
       1 = varref [] ((0, 0), False, {n_1})
       0 = primop [0, 1] ('%=',)
       0 = test [0] None
           0 = lit [] 1
               return [0] None
           0 = varref [] ((0, 0), False, {n_1})
           1 = lit [] 2
           0 = primop [0, 1] ('%-',)
               tr_call [0] (1, )
           return [0] None
   -   store_tuple [1, 0] (0, 1, 1)
   0 = new_env [] 1
   1 = lit [] 0
   0 = store_tuple [1, 0] (0, 1, 1)
   1 = varref [] ((0, 0), True, {loop_0})
       invoke_tail [1, 0]

Monday, April 5, 2010

stirrings of a vm

I'm finally beginning to think seriously about a VM. It's been a long road.

My approach, as usual, is maximally lazy. For now, I've pushed the parser to the side while I figure out what the byte code compiler will put out. That of course depends on what the VM will look like.

To begin, I'll simply re-use the Irken compiler, which already compiles a lambda-calculus based language into a register-based CPS instruction stream.

Here's what it looks like so far.
This bytecode is consumed by the VM, which works in two phases. First, it loads the bytecode into two vectors, one holding instructions, one holding data. (Note: the insn vector is typed as "vector((->undefined))"). Then it simply invokes the first insn. See vm/vm.scm for details.

The vector of insns allows us to build a 'threaded' VM: each insn is implemented as a no-argument function. The last thing each insn does is fetch and execute the next insn.

(define (insn-literal)
  (set! regs[vm-data[pcd]] vm-data[(+ pcd 1)])
  (set! pcd (+ pcd 2))
  ((next-insn)))

"((next-insn))" compiles to a bit of C that looks like this:

lenv = r0[2]; goto *r0[1];

This has the potential to be very fast.

My current stumbling block is where/how to straddle the divide between the typed and untyped universe. The VM is written in a strongly typed language, but I want it to implement an untyped language. In order to do that I need to be able to store arbitrary objects into the VM registers. I'm thinking of a couple of different approaches right now.

1) Add an 'object' type to the Irken compiler - this will act like a normal declared datatype, but when you pattern match or vcase it, the compiler will emit code to detect our run-time tags instead. I'll probably need some way of casting things to the object type.

or

2) *Remove* run-time tagging from Irken. This would make Irken even more like OCaml. The advantages should be obvious. The disadvantage would be I would have to implement some kind of stack map and teach the garbage collector how to use it. Also, Irken would become much more difficult to debug. But I could then implement the VM using fully declared and 'normal' datatypes without adding an extra layer of type tagging.

Friday, March 26, 2010

pattern matching: red-black trees

This is based on Okasaki's "Purely Functional Data Structures", Chapter 3.3.

(include "lib/core.scm")
(include "lib/random.scm")

(datatype T
  (:E)
  (:R (T 'a) 'a (T 'a))
  (:B (T 'a) 'a (T 'a))
  )

(define (T/insert root < k)

  (define lbalance
    (T:R (T:R a x b) y c) z d -> (T:R (T:B a x b) y (T:B c z d))
    (T:R a x (T:R b y c)) z d -> (T:R (T:B a x b) y (T:B c z d))
    a x b                     -> (T:B a x b)
    )

  (define rbalance
    a x (T:R (T:R b y c) z d) -> (T:R (T:B a x b) y (T:B c z d))
    a x (T:R b y (T:R c z d)) -> (T:R (T:B a x b) y (T:B c z d))
    a x b                     -> (T:B a x b)
    )

  (define (ins n)
    (match n with
      (T:E)
      -> (T:R (T:E) k (T:E))

      (T:R l k2 r)
      -> (cond ((< k k2)
                (T:R (ins l) k2 r))
               ((< k2 k)
                (T:R l k2 (ins r)))
               (else n))

      (T:B l k2 r)
      -> (cond ((< k k2)
                (lbalance (ins l) k2 r))
               ((< k2 k)
                (rbalance l k2 (ins r)))
               (else n))
      ))

  (let ((s (ins root)))
    (match s with
      (T:R l k r) -> (T:B l k r)
      (T:B _ _ _) -> s
      (T:E s)     -> (impossible)
      )))

(define (print-spaces n)
  (let loop ((n n))
    (cond ((> n 0)
           (print-string "  ")
           (loop (- n 1))))))

(define (print-item x d)
  (print-spaces d)
  (printn x))

(define T/print
  d (T:E)       -> #u
  d (T:R l x r) -> (begin (T/print (+ d 1) l) (print-item x d) (T/print (+ d 1) r))
  d (T:B l x r) -> (begin (T/print (+ d 1) l) (print-item x d) (T/print (+ d 1) r))
  )

(define (n-random n)
  (let loop ((n n)
             (t (T:E)))
    (if (= n 0)
        t
        (loop (- n 1) (T/insert t < (random))))))

(T/print 0 (n-random 20))

parsing with pattern matching

Parsing with pattern matching is pretty sweet.

Here's a test grammar:

# test a simple predicate language
expr: predicate | atom
list: expr ("," expr)*
predicate: NAME "(" list ")"
atom: NAME | NUMBER | STRING

And the pattern-matching code to parse it into an AST:

(datatype atom
  (:name string)
  (:number int)
  (:string string)
  )

(datatype expr
  (:pred string (list (expr)))
  (:atom (atom))
  )

(define p-atom
  (item:t 'NAME x)   -> (atom:name x)
  (item:t 'NUMBER x) -> (atom:number (string->int x))
  (item:t 'STRING x) -> (atom:string x)
  _ -> (error "p-atom")
  )

(define p-list3
  (list:cons (item:t 'comma _) 
  (list:cons x 
  (list:nil))) -> (p-expr x)
  _ -> (error "p-list3")
  )

(define p-list2
  (item:nt 'list_c_1_s1 (list:cons x y)) -> (list:cons (p-list3 y) (p-list2 x))
  (item:nt 'list_c_1_s1 (list:nil))      -> (list:nil)
  _ -> (error "p-list2")
  )

(define p-list
  (item:nt 'list (list:cons expr 
                 (list:cons rest 
                 (list:nil)))) -> (list:cons (p-expr expr) (p-list2 rest))
  _ -> (error "p-list")
  )

(define p-expr2
  (item:nt 'predicate
           (list:cons (item:t 'NAME name)
           (list:cons _ ;; lparen
           (list:cons args
           (list:cons _ ;; rparen
           (list:nil))))))
  -> (expr:pred name (p-list args))

  (item:nt 'atom (list:cons x (list:nil))) -> (expr:atom (p-atom x))
  y -> (error y)
  )

(define p-expr
  (item:nt 'expr (list:cons x (list:nil))) -> (p-expr2 x)
  _ -> (error "p-expr")
  )

The patterns are a little noisy due to my lack of a special syntax for lists. I'm still on the fence about whether that's a good idea.

Thursday, March 25, 2010

new release with pattern matching

I've checked in the pattern matching code. There are 10 or so new tests for it, but none of them test it especially heavily. I'll probably rewrite lib/frb.scm (the red-black tree code) using it as the next 'test'.

http://www.nightmare.com/rushing/irken/irken.100325.tar.gz

Tuesday, March 23, 2010

impressed with llvm, again

Testing out pattern matching against constants, I wrote the following silly function:

(define thing
  0 1 -> 2
  0 y -> 3
  1 0 -> 4
  1 x -> (+ x 5)
  y z -> (+ y z)
  )
(thing 10 20)

Which turned into a nice set of if statements, doing 4 tests in total. Not bad. But I was curious to see what kind of asm LLVM would generate for it.

_vm:
Leh_func_begin7:
        pushq   %rbp
Llabel13:
        movq    %rsp, %rbp
Llabel14:
        movq    _heap0(%rip), %rax
        movq    $772, (%rax)
        movq    $10, 8(%rax)
        movq    $10, 16(%rax)
        movq    $1, 24(%rax)
        movl    $61, %eax
        popq    %rbp
        ret
Leh_func_end7:

Huh. So that bit there at the very end, "movl $61, %eax". Yeah. That's the answer, "30". Even with all the tagging/untagging, registers, embedded if statements, etc... LLVM still optimized the entire program away to a constant. Nice.

Monday, March 22, 2010

pattern matching!

I managed to ignore the subject of pattern matching for a loooong time. I wanted to see how reasonable it was to build a language using 'case'-type statements. I viewed my ability to code up something relatively complex like red-black trees as proof that they're not absolutely needed.

However, now that I've seen that it's relatively easy to add (given the existing vcase support), I'm beginning to warm up. Look over these alternate versions of 'reverse-onto'.

The problem that forced me to finally look into it was the parser. In order to walk over complex trees of tokens - keyed by symbols - I was writing some very hairy 5-levels-deep vcase code. And I remembered how much easier this kind of task is in a language like Erlang. If you look through the code for ejabberd, for example, you might be confused hunting for the xml/xmpp parser, until you realize that it's embedded in the definition of the functions themselves, using pattern matching.

As you might be able to tell from the example given, the constructor syntax is still pretty wordy. There's no easy fix for this problem, which is caused by the requirements of type inference. Most ML-like languages require that constructors are globally unique. I find this abhorrent, so instead I'm requiring the name of the datatype to be present, e.g. "list:cons".

There may be ways to make this less wordy. It would be perfectly reasonable to define the list constructors as 'C' and 'N'. (for 'cons' and 'nil', respectively). A tree datatype could use 'N' and 'E' (for 'node' and 'empty'). This would still be pretty readable.

SML seems to use its module system to make it possible to use single-letter abbreviations like this.

One other area of inequality that really bugs me: the way lists are treated specially in ML dialects. It would be really nice for the user to be able to define their own extensions in a lisp-macro-like way. Something to think about.

Saturday, March 20, 2010

The Implementation of Functional Programming Languages

I'm working on adding pattern matching to Irken. Digging through the OCaml source, I found a reference to this little gem:

"The Implementation of Functional Programming Languages" (Simon Peyton Jones, published by Prentice Hall, 1987.)

The book is available for free here: http://research.microsoft.com/en-us/um/people/simonpj/papers/slpj-book-1987/

Chapters 4 and 5 cover exactly what I need for pattern matching - in fact, it appears that their algorithm transforms a set of patterns into a tree of 'case' statements where their 'case' construct exactly matches my current 'vcase'. Nice!

How is it that after so many years of trudging through this space, I never knew about this book?

Saturday, March 13, 2010

recent changes

I've restored the ability to gc in the middle of a function. This required coming up with a bit of a trick to copy registers in and out of gc:
```
void gc_regs_in (int n) {
 switch (n) {
 case 4: heap1[6] = r3;
 case 3: heap1[5] = r2;
 case 2: heap1[4] = r1;
 case 1: heap1[3] = r0;
}}
```

Literal data built at compile-time. Big win for programs with lots of pre-initialized data (like lexer and parser tables).

Verify the output of tests.

Filled out the type declaration syntax (type variables, recursive types, etc...) and unified it with the cexp syntax. (Well, to the extent that they use a single parser now rather than two different ones).

Friday, March 12, 2010

syntax for literals

In Scheme/Lisp, literal lists are expressed using the quote syntax: '(1 2 3), which is expanded by the reader to (QUOTE 1 2 3). In Lisp, quoted list structures are important, especially for macros (which I'm deliberately trying not to support in order to avoid becoming dependent on the lisp syntax).

I need a literal syntax for Irken, and until now I've used QUOTE, but it's causing me a lot of trouble now that I'm using algebraic datatypes.

To put this in terms that a C programmer can understand, this is a literal array of integers:

  int thing[] = {1, 2, 3, 4, 5};

Without a literal syntax, you'd have to do this:

  int * thing = malloc (sizeof(int) * 5);
  thing[0] = 1;
  ...

The 'list' datatype is declared like any other - it's not built in to the compiler (though I could certainly do that):

(datatype list
 (:nil)
 (:cons 'a (list 'a))
)

The compiler will translate '(1 2) into:
(QUOTE (list:cons 1 (list:cons 2 (list:nil)))).

But this leaves all the other datatypes with no convenient syntax - and worse - with no way to identify or express literals. This has turned into a big problem for the parser and lexer, where I really need to build the literals at compile-time (otherwise I get huge executables that do nothing but build data structures).

Right now I'm playing with this syntax:

(literal (tree:node (tree:leaf 1) (tree:node (tree:leaf 2) (tree:leaf 3))))

The grammar for 'constructed' literals is:

literal ::= (<constructor> <literal> ... ) | (vector <literal> ...) | immediate

The question is - should I retain QUOTE just for lists? I hate having two syntaxes for two very similar concepts. All the ML-like languages have a special syntax for lists, which is probably an indicator that it's important. On the other hand - the special syntax might encourage people to misuse the list datatype simply because it's more convenient - e.g., you might build a binary tree out of lists (which wastes a lot of heap) rather than declaring a proper datatype.

Thursday, March 4, 2010

Parser's Working

Ok, got the parser rewritten using the new datatypes.

If you're bored, and would like to know how small a lexer and parser engine can be, take a look at tests/t_parse.scm.

Note that both the lexer and the parser are bare-bones engines that are using tables generated by other tools written in Python. In this case, a DFA lexer written by me, and an LR(1) parser generator written by Jason Evans.

Wednesday, March 3, 2010

polymorphic vs normal variants

Just a quick summary of the pros and cons of each alternative.

Polymorphic Variants

Pros:

Outrageously polymorphic - use them any way you like, anywhere
No declarations needed.
The runtime can associate certain meanings with the tags on polymorphic variants. This makes them act more like lisp type tags. For example, the runtime can identify a list and correctly print it out, rather than {u0 1 {u0 2 {u0 3 {u1}}}}. [This is probably addressed in ML by treating lists specially, rather than having the 'basis' declare the list type.]

Cons:

Each variant consumes precious tag space. Since type tags are stored in a byte, and in particular a byte with the lower two bits zeroed, there are only about 59 such tags left.
Lots of work for the type solver, because of the overly detailed types...
...thus the compiler runs slower

'Normal' Variants.

Pros:

Cleanly declared datatypes. If we're really planning on doing systems programming with this, then declared datatypes are probably non-negotiable.
Tag values are consumed only within a datatype - so rather than having 59 total different types, we can have an unlimited number of types, each of which can have up to 59 different variants. (To clarify this with an example, the list datatype uses the tags 0 and 1, because it has two variants, 'nil', and 'cons'). A tree datatype will also use tags starting at 0, since the type system guarantees that a tree will never be confused with a list at runtime.

Cons:

The runtime has no idea what it's printing out, so you can get impenetrable output like this: {u0 16 {u0 1 0} {u0 15 {u0 1 0} {u0 14 {u0 1 0}...

Datatype declarations are going to scare non-ML people off, which is a shame. Here's what the syntax currently looks like:

(datatype item
  (:nt symbol (list (item 'a)))
  (:t symbol)
  )

(datatype stack
  (:empty)
  (:elem (item 'a) int (stack 'a))
  )

Sadly, I've chosen the same solution as OCaml - to leave both kinds in. If I could collect up the nerve to remove a lot of hard-earned code, I'd probably remove polymorphic variants.

Tuesday, March 2, 2010

status update

Finally posted a new tarball today. Lots of changes over the past few weeks:

'datatype' syntax and 'normal' variants (though I've left in polymorphic variants as a 'stealth' feature)
more tests, and run_tests.py now verifies their output
'let_reg' will store bindings into registers when certain conditions are met - this really cuts down on pointless allocation for local variables.

I feel like this is getting much closer to 'useful' status, and will hopefully begin work on a VM within the next couple of weeks.

Here's the tarball.

Friday, February 12, 2010

Temporarily Giving up on Polymorphic Variants

Polymorphic Variants finally reached the point where they were causing more problems than they were worth. So I've temporarily put them aside... leaving support in the compiler, but I'm adding normal datatype/sum/union declarations back in.

The types are too expressive (see previous post)
Since the types are so complex, the solver gets hit with constraints leading to tens of thousands of type variables. This is mostly triggered by large initialization expressions like parser tables. Waiting 20 seconds to type a 40K file is not cool.
I need to get back to work, i.e. I need to make some progress.

So after plugging away for a week or so, I've now run into a new problem - the value restriction. Something I've put off dealing with for a long time. Turns out the problem is classically triggered by type constructors for things like empty lists. For example, this types without any problems:

(datatype list
  (nil)
  (cons 'a list)
)

(let ((l0 (list:nil)))
  (set! l0 (list:cons 34 l0))
  (list:cons #f l0)
)

[Note that the last line glues together a "list(bool)" onto a "list(int)"]

The solution is to restrict the polymorphism of assigned variables, hence the Value Restriction.

-Sam

Friday, February 5, 2010

restraining polymorphic variants

Since I started playing with polymorphic variants, I've repeatedly stumbled into a strange problem with them. They're too expressive!

Polymorphic variants are the flip side of polymorphic records. This is a major feature of my type system that allows you to use any label anywhere, in any way, without having to declare it first. I really like this ability, especially in the context of Irken, because it has a very Pythonic, dynamic-language feel to it.

As an example, here's some code that creates a lisp-like list:


(:cons 1 (:cons 2 (:cons 3 (:nil)))

That's it. Symbols starting with colons are constructors. No datatype declaration. Completely type safe. You just use constructors however you like. OCaml has the same feature:


# `Cons (1, `Cons (2, `Cons 3, `Nil));;
- : [> `Cons of int * [> `Cons of int * [> `Cons of int ] * [> `Nil ] ] ] =
`Cons (1, `Cons (2, `Cons 3, `Nil))

But if you look at the type that OCaml assigns to the result you see the seed of the problem. Watch this:


# `Cons (1, `Cons ("hello", `Nil));;
- : [> `Cons of int * [> `Cons of string * [> `Nil ] ] ] =
`Cons (1, `Cons ("hello", `Nil))

See? You could call that the datatype of lists that start with an int followed by a string.

With a complex data structure like a red-black tree, you get types like this:


rsum(rlabel('purple', pre(product(rsum(rlabel('empty', pre(kb), rlabel('purple',
 pre(jw), rlabel('red', pre(gz), rdefault(abs()))))), rsum(rlabel('empty', pre(nq),
 rlabel('purple', pre(nl), rlabel('red', pre(ko), rdefault(abs()))))), jc, jd)), 
rlabel('empty', pre(product()), rlabel('red', pre(product(rsum(rlabel('purple', 
pre(product(hj, hk, hl, hm)), bgc)), rsum(rlabel('purple', pre(product(hn, ho, hp,
 hq)), bgp)), hh, hi)), rdefault(abs())))))

What the hell is that?? Well, it's a type built during the compilation of one of the tree-balancing functions; a reflection of the matching structure used to examine the tree a few levels deep. Think of it as a template that sits on top of a red-black tree.

Here, the compiler is building incredibly detailed, complex types and checking them for me. What I would prefer is for it to automatically recognize recursive datatypes when I use them - without have to declare them. Or, I would like the compiler to keep variants monomorphic within a single type.

This would involve a loss of some expressive power. But how often would you want to use variants polymorphically within the same object? And if you needed that, why wouldn't you use a different label?