Friday, March 26, 2010

pattern matching: red-black trees

This is based on Okasaki's "Purely Functional Data Structures", Chapter 3.3.

(include "lib/core.scm")
(include "lib/random.scm")

(datatype T
  (:E)
  (:R (T 'a) 'a (T 'a))
  (:B (T 'a) 'a (T 'a))
  )

(define (T/insert root < k)

  (define lbalance
    (T:R (T:R a x b) y c) z d -> (T:R (T:B a x b) y (T:B c z d))
    (T:R a x (T:R b y c)) z d -> (T:R (T:B a x b) y (T:B c z d))
    a x b                     -> (T:B a x b)
    )

  (define rbalance
    a x (T:R (T:R b y c) z d) -> (T:R (T:B a x b) y (T:B c z d))
    a x (T:R b y (T:R c z d)) -> (T:R (T:B a x b) y (T:B c z d))
    a x b                     -> (T:B a x b)
    )

  (define (ins n)
    (match n with
      (T:E)
      -> (T:R (T:E) k (T:E))

      (T:R l k2 r)
      -> (cond ((< k k2)
                (T:R (ins l) k2 r))
               ((< k2 k)
                (T:R l k2 (ins r)))
               (else n))

      (T:B l k2 r)
      -> (cond ((< k k2)
                (lbalance (ins l) k2 r))
               ((< k2 k)
                (rbalance l k2 (ins r)))
               (else n))
      ))

  (let ((s (ins root)))
    (match s with
      (T:R l k r) -> (T:B l k r)
      (T:B _ _ _) -> s
      (T:E s)     -> (impossible)
      )))

(define (print-spaces n)
  (let loop ((n n))
    (cond ((> n 0)
           (print-string "  ")
           (loop (- n 1))))))

(define (print-item x d)
  (print-spaces d)
  (printn x))

(define T/print
  d (T:E)       -> #u
  d (T:R l x r) -> (begin (T/print (+ d 1) l) (print-item x d) (T/print (+ d 1) r))
  d (T:B l x r) -> (begin (T/print (+ d 1) l) (print-item x d) (T/print (+ d 1) r))
  )

(define (n-random n)
  (let loop ((n n)
             (t (T:E)))
    (if (= n 0)
        t
        (loop (- n 1) (T/insert t < (random))))))

(T/print 0 (n-random 20))

parsing with pattern matching

Parsing with pattern matching is pretty sweet.

Here's a test grammar:

# test a simple predicate language
expr: predicate | atom
list: expr ("," expr)*
predicate: NAME "(" list ")"
atom: NAME | NUMBER | STRING


And the pattern-matching code to parse it into an AST:

(datatype atom
  (:name string)
  (:number int)
  (:string string)
  )

(datatype expr
  (:pred string (list (expr)))
  (:atom (atom))
  )

(define p-atom
  (item:t 'NAME x)   -> (atom:name x)
  (item:t 'NUMBER x) -> (atom:number (string->int x))
  (item:t 'STRING x) -> (atom:string x)
  _ -> (error "p-atom")
  )

(define p-list3
  (list:cons (item:t 'comma _) 
  (list:cons x 
  (list:nil))) -> (p-expr x)
  _ -> (error "p-list3")
  )

(define p-list2
  (item:nt 'list_c_1_s1 (list:cons x y)) -> (list:cons (p-list3 y) (p-list2 x))
  (item:nt 'list_c_1_s1 (list:nil))      -> (list:nil)
  _ -> (error "p-list2")
  )

(define p-list
  (item:nt 'list (list:cons expr 
                 (list:cons rest 
                 (list:nil)))) -> (list:cons (p-expr expr) (p-list2 rest))
  _ -> (error "p-list")
  )

(define p-expr2
  (item:nt 'predicate
           (list:cons (item:t 'NAME name)
           (list:cons _ ;; lparen
           (list:cons args
           (list:cons _ ;; rparen
           (list:nil))))))
  -> (expr:pred name (p-list args))

  (item:nt 'atom (list:cons x (list:nil))) -> (expr:atom (p-atom x))
  y -> (error y)
  )

(define p-expr
  (item:nt 'expr (list:cons x (list:nil))) -> (p-expr2 x)
  _ -> (error "p-expr")
  )

The patterns are a little noisy due to my lack of a special syntax for lists. I'm still on the fence about whether that's a good idea.

Thursday, March 25, 2010

new release with pattern matching

I've checked in the pattern matching code. There are 10 or so new tests for it, but none of them test it especially heavily. I'll probably rewrite lib/frb.scm (the red-black tree code) using it as the next 'test'.

http://www.nightmare.com/rushing/irken/irken.100325.tar.gz

Tuesday, March 23, 2010

impressed with llvm, again

Testing out pattern matching against constants, I wrote the following silly function:
(define thing
  0 1 -> 2
  0 y -> 3
  1 0 -> 4
  1 x -> (+ x 5)
  y z -> (+ y z)
  )
(thing 10 20)
Which turned into a nice set of if statements, doing 4 tests in total. Not bad. But I was curious to see what kind of asm LLVM would generate for it.
_vm:
Leh_func_begin7:
        pushq   %rbp
Llabel13:
        movq    %rsp, %rbp
Llabel14:
        movq    _heap0(%rip), %rax
        movq    $772, (%rax)
        movq    $10, 8(%rax)
        movq    $10, 16(%rax)
        movq    $1, 24(%rax)
        movl    $61, %eax
        popq    %rbp
        ret
Leh_func_end7:

Huh. So that bit there at the very end, "movl $61, %eax". Yeah. That's the answer, "30". Even with all the tagging/untagging, registers, embedded if statements, etc... LLVM still optimized the entire program away to a constant. Nice.

Monday, March 22, 2010

pattern matching!

I managed to ignore the subject of pattern matching for a loooong time. I wanted to see how reasonable it was to build a language using 'case'-type statements. I viewed my ability to code up something relatively complex like red-black trees as proof that they're not absolutely needed.

However, now that I've seen that it's relatively easy to add (given the existing vcase support), I'm beginning to warm up. Look over these alternate versions of 'reverse-onto'.

The problem that forced me to finally look into it was the parser. In order to walk over complex trees of tokens - keyed by symbols - I was writing some very hairy 5-levels-deep vcase code. And I remembered how much easier this kind of task is in a language like Erlang. If you look through the code for ejabberd, for example, you might be confused hunting for the xml/xmpp parser, until you realize that it's embedded in the definition of the functions themselves, using pattern matching.

As you might be able to tell from the example given, the constructor syntax is still pretty wordy. There's no easy fix for this problem, which is caused by the requirements of type inference. Most ML-like languages require that constructors are globally unique. I find this abhorrent, so instead I'm requiring the name of the datatype to be present, e.g. "list:cons".

There may be ways to make this less wordy. It would be perfectly reasonable to define the list constructors as 'C' and 'N'. (for 'cons' and 'nil', respectively). A tree datatype could use 'N' and 'E' (for 'node' and 'empty'). This would still be pretty readable.

SML seems to use its module system to make it possible to use single-letter abbreviations like this.

One other area of inequality that really bugs me: the way lists are treated specially in ML dialects. It would be really nice for the user to be able to define their own extensions in a lisp-macro-like way. Something to think about.

Saturday, March 20, 2010

The Implementation of Functional Programming Languages

I'm working on adding pattern matching to Irken. Digging through the OCaml source, I found a reference to this little gem:

"The Implementation of Functional Programming Languages
" (Simon Peyton Jones, published by Prentice Hall, 1987.)

The book is available for free here: http://research.microsoft.com/en-us/um/people/simonpj/papers/slpj-book-1987/

Chapters 4 and 5 cover exactly what I need for pattern matching - in fact, it appears that their algorithm transforms a set of patterns into a tree of 'case' statements where their 'case' construct exactly matches my current 'vcase'. Nice!

How is it that after so many years of trudging through this space, I never knew about this book?

Saturday, March 13, 2010

recent changes

  1. I've restored the ability to gc in the middle of a function. This required coming up with a bit of a trick to copy registers in and out of gc:
    void gc_regs_in (int n) {
    switch (n) {
    case 4: heap1[6] = r3;
    case 3: heap1[5] = r2;
    case 2: heap1[4] = r1;
    case 1: heap1[3] = r0;
    }}


  2. Literal data built at compile-time. Big win for programs with lots of pre-initialized data (like lexer and parser tables).

  3. Register bindings for variables in leaf positions. This will cut down on lots of heap allocation

  4. Verify the output of tests.

  5. Filled out the type declaration syntax (type variables, recursive types, etc...) and unified it with the cexp syntax. (Well, to the extent that they use a single parser now rather than two different ones).

Friday, March 12, 2010

syntax for literals

In Scheme/Lisp, literal lists are expressed using the quote syntax: '(1 2 3), which is expanded by the reader to (QUOTE 1 2 3). In Lisp, quoted list structures are important, especially for macros (which I'm deliberately trying not to support in order to avoid becoming dependent on the lisp syntax).

I need a literal syntax for Irken, and until now I've used QUOTE, but it's causing me a lot of trouble now that I'm using algebraic datatypes.

To put this in terms that a C programmer can understand, this is a literal array of integers:

  int thing[] = {1, 2, 3, 4, 5};


Without a literal syntax, you'd have to do this:

  int * thing = malloc (sizeof(int) * 5);
thing[0] = 1;
...

The 'list' datatype is declared like any other - it's not built in to the compiler (though I could certainly do that):
(datatype list
(:nil)
(:cons 'a (list 'a))
)
The compiler will translate '(1 2) into:
(QUOTE (list:cons 1 (list:cons 2 (list:nil)))).

But this leaves all the other datatypes with no convenient syntax - and worse - with no way to identify or express literals. This has turned into a big problem for the parser and lexer, where I really need to build the literals at compile-time (otherwise I get huge executables that do nothing but build data structures).

Right now I'm playing with this syntax:

(literal (tree:node (tree:leaf 1) (tree:node (tree:leaf 2) (tree:leaf 3))))
The grammar for 'constructed' literals is:

literal ::= (<constructor> <literal> ... ) | (vector <literal> ...) | immediate

The question is - should I retain QUOTE just for lists? I hate having two syntaxes for two very similar concepts. All the ML-like languages have a special syntax for lists, which is probably an indicator that it's important. On the other hand - the special syntax might encourage people to misuse the list datatype simply because it's more convenient - e.g., you might build a binary tree out of lists (which wastes a lot of heap) rather than declaring a proper datatype.

Thursday, March 4, 2010

Parser's Working

Ok, got the parser rewritten using the new datatypes.

If you're bored, and would like to know how small a lexer and parser engine can be, take a look at tests/t_parse.scm.

Note that both the lexer and the parser are bare-bones engines that are using tables generated by other tools written in Python. In this case, a DFA lexer written by me, and an LR(1) parser generator written by Jason Evans.

Wednesday, March 3, 2010

polymorphic vs normal variants

Just a quick summary of the pros and cons of each alternative.

Polymorphic Variants

Pros:
  1. Outrageously polymorphic - use them any way you like, anywhere
  2. No declarations needed.
  3. The runtime can associate certain meanings with the tags on polymorphic variants. This makes them act more like lisp type tags. For example, the runtime can identify a list and correctly print it out, rather than {u0 1 {u0 2 {u0 3 {u1}}}}. [This is probably addressed in ML by treating lists specially, rather than having the 'basis' declare the list type.]
Cons:
  1. Each variant consumes precious tag space. Since type tags are stored in a byte, and in particular a byte with the lower two bits zeroed, there are only about 59 such tags left.
  2. Lots of work for the type solver, because of the overly detailed types...
  3. ...thus the compiler runs slower
'Normal' Variants.

Pros:
  1. Cleanly declared datatypes. If we're really planning on doing systems programming with this, then declared datatypes are probably non-negotiable.
  2. Tag values are consumed only within a datatype - so rather than having 59 total different types, we can have an unlimited number of types, each of which can have up to 59 different variants. (To clarify this with an example, the list datatype uses the tags 0 and 1, because it has two variants, 'nil', and 'cons'). A tree datatype will also use tags starting at 0, since the type system guarantees that a tree will never be confused with a list at runtime.
Cons:
  1. The runtime has no idea what it's printing out, so you can get impenetrable output like this: {u0 16 {u0 1 0} {u0 15 {u0 1 0} {u0 14 {u0 1 0}...
  2. Datatype declarations are going to scare non-ML people off, which is a shame. Here's what the syntax currently looks like:
    (datatype item
    (:nt symbol (list (item 'a)))
    (:t symbol)
    )

    (datatype stack
    (:empty)
    (:elem (item 'a) int (stack 'a))
    )
Sadly, I've chosen the same solution as OCaml - to leave both kinds in. If I could collect up the nerve to remove a lot of hard-earned code, I'd probably remove polymorphic variants.

Tuesday, March 2, 2010

status update

Finally posted a new tarball today. Lots of changes over the past few weeks:

  1. 'datatype' syntax and 'normal' variants (though I've left in polymorphic variants as a 'stealth' feature)
  2. more tests, and run_tests.py now verifies their output
  3. 'let_reg' will store bindings into registers when certain conditions are met - this really cuts down on pointless allocation for local variables.
I feel like this is getting much closer to 'useful' status, and will hopefully begin work on a VM within the next couple of weeks.

Here's the tarball.