11 April 2008

WPF Pointers

I recently went to a session at Microsoft Belgium for ISV's, where Tim Sneath gave a packed session about WPF, answering some question but mainly giving a wealth of pointers of the WPF related information out there. Enjoy!

WPF/Silverlight reference applications

Silverlight in financials

Family.Show

NBC Olympics in Silverlight (Tim showed a demo but it was local on his machine - don't think it is publicly available as the time of this writing)

Hard rock memorabilia (Tim also showed an application for actually making the deep zoom effect that you can see on the site, I guess it's included when with Silverlight 2 somehow?)

WPF Performance

WPF performance blog

WPF Performance white paper

Optimizing WPF performance on MSDN

WPF Performance suite download

WPF controls and dev tools

Kevin's bag-o-tricks

Kaxaml (XamlPad, but better)

Blendables (the only link in this list which is not free, but some nice samples in the blog, so worthwhile to visit even if you're not spending)

Snoop

Mole

Neat Trick

Did you know that an .xps file is actually just a zip file with various files in them, and that one of those files contains a xaml dialect? This is really cool to export clip art from office. Tim just took a random Office piece of clip art, printed it to xps, renamed the file extension to .zip, unpack and pasted a part of one of the files in XamlPad. One or two adjustments for fonts later, and the clip art displayed in XamlPad. Nice to keep in mind!

Technorati:

09 April 2008

Parsing Dot with F#: Part 2 - The lexer

Now that the preliminaries are over, we can get on with using the first part in the fslex/fsyacc tandem: the lexer. The lexer's responsibility is to break up the long input text into tokens that are better manageable for the parser, which in turn will produce the actual abstract syntax tree. So, instead of having to deal with the string "digraph MyGraph {  0 [label="a string_{}"] }", the lexer will turn this into a list of tokens, where each token can be attributed with some information. For the example, the output of our lexer is the following list of tokens: DIGRAPH ID["MyGraph"] LACC ID["0"] LBR ID["label"] ASSIGN ID["a string_{}"] RBR RACC.  As you can see, the only token with an attribute is the ID, in which all kinds of string names are parsed. Notice also that any ID string can contain "control characters" like { and }. The lexer's main job is to filter those kinds of things out so the parser's job is much easier.

You can tell fslex how to tokenize a string by providing it with a list of rules. Each rule is a series of regular expressions-action pairs, each producing an (attributed) token. These rules are specified in a special format, which is pre-processed to an actual F# code file by the fslex tool. I'm not going to explain that in detail, there are other sources for that.

Our parser.fsl file starts with some preliminaries:

{
open System
open DotParser
open Lexing
}

let digit = ['0'-'9']
let whitespace = [' ' '\t' ]
let newline = ('\n' | '\r' '\n')
let letter = [ 'a'-'z' 'A'-'Z' '_']

Notice here that, besides opening the System and Lexing namespaces, the DotParser namespace is opened. That's because the type that defines the tokens (actually a regular discriminated union type) is defined in the fsyacc file. We'll come back to that in the next post. Then, some basic regular expressions that are useful for almost any parser, are defined. If you're at all familiar with regular expressions, this stuff should be easy. Notcie that these are just regular let variable bindings in F#. Fslex just translates this file to actual F# code, so you're free to use F# constructs as you see fit.

Our actual parser consists of 2 rules. This is the first:

rule token = parse
| whitespace     { token lexbuf }
| ('\r' | '\t')  { token lexbuf }
| "digraph"      { DIGRAPH }
| "node"         { NODE }
| "edge"         { EDGE }
| "graph"        { GRAPH }
| "->"           { DIEDGE }
| "--"           { UNEDGE }
| "\""           { ID ( (string lexbuf.StartPos "" lexbuf) ) }
| '{'            { LACC }
| '}'            { RACC }
| '['            { LPAREN }
| ']'            { RPAREN }
| ';'            { SEMI }
| ','            { COMMA }
| '='            { ASSIGN }
| letter(letter|digit)*                { ID (lexeme lexbuf) }
| ['-']?('.'digit+ | digit+('.'digit*)? ) { ID (lexeme lexbuf) }
| newline        { lexbuf.EndPos <- lexbuf.EndPos.NextLine; token lexbuf }
| eof            { EOF }

On the left, written like a pattern match, are the regular expressions that will be matched against the input string. On the right, the tokens that are returned. As you can see in the line for lexing IDs, you can access the current lexeme, i.e. the currently-being-lexed part of the input string, by calling lexeme lexbuf. Just skipping the current lexeme can be achieved by calling token lexbuf. Also, it is useful if you keep track of the line number for debugging later. That's what the action in the newline pattern does.

One gotcha here is that the order of these regular expressions matter not only for lexing (a line higher up will take precedence over a lower line), but also for type inference. The above rule needed no type annotations, but try placing the newline rule at the top; after pre-processing, F# will complain during compilation that it needs some type annotations. It's not a big deal, but if the order does not matter for lexing, it is nicer and more flexible to change later if you don't need to put any type annotations.

One line in the above sample may puzzle you:

| "\""            { ID ( (string lexbuf.StartPos "" lexbuf) ) }

That's because this calls a second rule, called  string, which parses a string,i.e. something between double quotes ("). We go into this second rule when we encounter an open double quotes in the input. Such multiple rules are useful when you have different contexts in the input string in which different lexing rules apply. For dot, a string can contain characters such as brackets, arrows, semicolons and such that have a particular meaning in the main dot file. But when they're between double quotes, the lexer is in another context, where these characters lose their meaning and should just be added to a string that is part of the name of some entity. So once we encounter a " in our main file, we start parsing with the second rule; once the second rule finds a ", we return the string parsed so far and continue in the main rule:

and string pos s = parse
| "\\"    { string pos s lexbuf }
| "\r"    { string pos s lexbuf }
| "\""    { s }
| "\n"    { lexbuf.EndPos <- lexbuf.EndPos.NextLine;
            string pos s lexbuf }
| eof     { failwithf "end of file in string started at or near %A" pos }
| _       { string pos (s + (Lexing.lexeme lexbuf)) lexbuf }

As you can see, the string function takes the pos of the start of the string (for debugging) and accumulates the string in its parameter s. If the third pattern matches, the rule is finished: we've encountered the second ". The last pattern matches with anything that isn't matched by the previous lines, so we add it to the s accumulator.

The other matches, 1st 2nd and 4th match, are control characters that are simply skipped, and in the case of newline we update the line number. I did this because graphviz has the annoying habit of inserting a newline after a certain number of characters on one line. Presumably this is for readability, but why anyone would want to read dot files is beyond me. Anyway, Graphviz inserts a \ followed by \r in such cases, and this is what the two first rules filter out. This solved quite some pains afterwards when interpreting IDs ("230,345" is a much easier string to interpret than "230,3\\r45", especially when you don't know if, when or where graphviz will insert a line break. Yes, it actually did this in the middle of numbers.).

By the way, another typical example where it is useful to have multiple rules is when you would like to parse comments; also there the normal lexing rule does not apply. In fact, I adapted the above string function from a similar function designed to parse comments in the excellent Expert F# book.

That's it for the lexer. Paste all the stuff from this post together, and you should have a working .fsl file. Not compilable yet I'm afraid, you need the token definitions which are produced by the parser (remember the open DotParser?).

You've made it halfway! Next post: the parser.

Technorati: ,,,

02 April 2008

Parsing Dot with F#: Part 1

A while back I decided to write a parser for dot, the language used by Graphviz. This is both my first real project in F#, and in parsing. I learned the basics from the excellent Expert F# book.

I'll try to explain my solution, using fslex and fsyacc, in the order that I tackled the problem.  There are a few more basic examples out there, explaining what parsing is, what a lexer and a parser are etc., However it seems the examples given are always small, typically parsing some expressions. Graphviz' dot definitely has more of a real world flavor, and I'll present it as a real world example of using fslex and fsyacc, without explaining much about those tools per se.

This is how I see this mini-series play out:

  • Simplified dot grammar and abstract syntax tree
  • The lexer
  • The parser
  • Putting it all together

This post is part 1.

Graphviz

Graphviz is a command line tool that takes a description of a graph as input, and outputs a description (or an image) of a layout of the graph. For example, as input you can give:

digraph G {
0 [label="Type1<>", shape=box];
1 [label="Type1<Type2>", shape=box];
2 [label="T", shape=box];
3 [label="Type2", shape=box];
4 [label="Type2[], Type2*, Type2&", shape=box];
5 [label="#Type2", shape=box];
6 [label="Type2.Type3", shape=box];
0 -> 1 [ label="MakeGenericType(Type2)"];
0 -> 2 [ label="GetGenericArguments()"];
1 -> 0 [ label="GetGenericTypeDefinition()"];
1 -> 3 [ label="GetGenericArguments()"];
2 -> 0 [ label="DeclaringType"];
3 -> 4 [ label="MakeArrayType(), MakePointerType(), MakeByRefType()"];
3 -> 6 [ label="GetNestedType(Type3.Name)"];
4 -> 3 [ label="GetElementType()"];
5 -> 3 [ label="BaseType"];
6 -> 3 [ label="DeclaringType"];
}

Basically, just a sequence of nodes and edges annotated with attributes. In the above example, only label and shape are used, but dot supports many, many more. Given such a file, graphviz outputs:

digraph G {
    node [label="\N"];
    graph [bb="0,0,1028,328"];
    0 [label="Type1<>", shape=box, pos="502,302", width="1.03", height="0.50"];
    1 [label="Type1<Type2>", shape=box, pos="380,210", width="1.61", height="0.50"];
    2 [label=T, shape=box, pos="625,210", width="0.75", height="0.50"];
    3 [label=Type2, shape=box, pos="568,118", width="0.78", height="0.50"];
    4 [label="Type2[], Type2*, Type2&", shape=box, pos="249,26", width="2.61", height="0.50"];
    5 [label="#Type2", shape=box, pos="703,210", width="0.92", height="0.50"];
    6 [label="Type2.Type3", shape=box, pos="784,26", width="1.42", height="0.50"];
    0 -> 1 [label="MakeGenericType(Type2)", pos="e,322,216 465,301 381,298 181,288 161,266 156,259 156,252 161,246 170,235 252,223 312,217", lp="263,256"];
    0 -> 2 [label="GetGenericArguments()", pos="e,623,228 539,298 561,293 589,284 607,266 615,258 619,247 621,237", lp="708,256"];
    1 -> 0 [label="GetGenericTypeDefinition()", pos="s,465,293 455,290 430,284 403,275 394,266 385,256 381,240 380,228", lp="504,256"];
    1 -> 3 [label="GetGenericArguments()", pos="e,540,121 394,192 405,179 420,163 437,154 467,137 504,128 531,123", lp="528,164"];
    2 -> 0 [label=DeclaringType, pos="s,539,300 548,300 628,295 785,284 802,266 807,259 807,252 802,246 781,222 691,237 661,228 658,227 655,226 652,225", lp="863,256"];
    3 -> 4 [label="MakeArrayType(), MakePointerType(), MakeByRefType()", pos="e,155,36 540,117 432,114 50,100 32,82 26,75 27,68 32,62 40,53 93,44 145,37", lp="251,72"];
    3 -> 6 [label="GetNestedType(Type3.Name)", pos="e,733,40 596,101 616,89 645,73 671,62 688,54 707,48 725,43", lp="781,72"];
    4 -> 3 [label="GetElementType()", pos="s,540,110 532,107 515,100 495,92 480,82 470,75 472,67 461,62 441,51 390,42 343,36", lp="557,72"];
    5 -> 3 [label=BaseType, pos="e,595,136 676,192 655,178 625,157 602,141", lp="692,164"];
    6 -> 3 [label=DeclaringType, pos="s,596,117 605,117 687,113 875,103 894,82 915,58 873,43 835,34", lp="956,72"];
}

The basic structure of the file is the same: a sequence of nodes and edges, but Graphviz has added position, height, width and other layout info. This is actually the file that was used to draw the graph in my previous post.

The dot grammar

When parsing using fslex and fsyacc, the first thing you should find or make is a grammar of the thing you're trying to parse. Everything else sort of flows from there. Luckily, the complete dot grammar can be found here. I thought it was a bit complicated for my purposes, so I simplified the grammar a bit:

(in the following, terminals are shown in bold. Literal characters are given in single quotes. Parentheses ( and ) indicate grouping when needed. Square brackets [ and ] enclose optional items. Vertical bars | separate alternatives.)

graph    :    digraph [ ID ] '{' stmt_list '}'
stmt_list:    [ stmt [ ';' ] [ stmt_list ] ]
stmt     :    node_stmt
         |    edge_stmt
         |    attr_stmt /*defines a default attribute*/
attr_stmt:    (graph | node | edge) attr_list
attr_list:    '[' attr  [ ',' ] [ attr_list ] ']' 
atrtr    :    ID '=' ID  
edge_stmt:    node_id -> node_id [ attr_list ]
node_stmt:    node_id [ attr_list ]
node_id  :    INT

If you compare with the original dot grammar, I made the following simplifications:

  • No node ports
  • No sub-graphs
  • No short definition of multiple edges (a -> b -> c)
  • no HTML IDs
  • only digraphs

Basically this grammar says that a graph is "digraph name { bunch of node, edge or default statements }". We've already seen node and edge statements; default statements basically just set an attribute on all the nodes and edges that follow. It is overridden by an attribute of the same name on a node or edge itself, or by a new default statement.

The abstract syntax tree

Based on that grammar, I came up with the following abstract syntax tree using F# discriminated unions:

#light

open List
open System
open System.Collections.Generic

type Attributes = Dictionary<string,string>

type Element = 
    | Node of string * Attributes
    | Edge of string * string * Attributes
    | GraphAttributeList of Attributes
    | NodeAttributeList of Attributes
    | EdgeAttributeList of Attributes

type Graph = Graph of List<Element>

It's easiest to read this from bottom to top (unfortunately it needs to be defined the other way round, otherwise F# has difficulty parsing). A graph is a list of elements. An element can either be a node, an edge or a default attribute list for the graph, the subsequent nodes or the subsequent edges. Each of these elements can have a number of attributes. Attributes are simply presented as a Dictionary. The abstract syntax tree was fairly straightforward to build from the grammar; I expect the same for any well-written grammar.

The abstract syntax tree is the parser's interface to the outside, so it's important that you think about how you're going to use the parser when making decisions about the representation of the abstract syntax tree. For example, I took care not to use any F# specific types in the above AST definition, so that client assemblies in other language would not need to reference F# specific assemblies. On that topic, don't forget to compile your F# assemblies with the --standalone flag, otherwise client assemblies will still need some F# specific libraries (e.g. discriminated unions implement IStructuralHash, so clients also need to know this interface).

Another issue you should think about is how 'deep' you want to parse. For example, it would be theoretically possible to define separate cases for each of the different types of attributes that can be generated by Graphviz. This would also allow us to parse some of the arguments (e.g. the list of position coordinates could be parsed into a list of tuples). However, given the large amount and frequent changes in Graphviz dot attributes, I choose not to take this route.

Next episode: the lexer!