Saturday, July 17, 2010

Implementing a simple preprocessor in F# using FsLex

Introduction

In this article, I will show how two different lexers can work together to create a tokenizer with preprocessing functionality.

This is not a perfect solution for implementing the preprocessor used in programming languages like C or C++, but if you have some simpler tasks, this can provide an elegant solution; much simpler than you can achieve by a single lexer/parser pair. In reality this will generate a preprocessing lexer->tokenizing lexer->parser pipeline.

This article assumes that the reader is familiar with FsLex and FsYacc

The problem domain

First let me describe the problem that I have, and the need for a preprocessor. The project that I personally have is a case of auto generating code from a single definitions file describing the classes and attributes of the domain. Such a file could look like this:

    entity User
        field ID int
        field FirstName string[250]
        field LastName string[250]
        field Email string[250]
        generator DomainObject .\MyProject.Domain
        generator DataAccess .\MyProject.DataAccess
        generator TableScript .\DatabaseScripts\Schemas\dbo\Tables
    end

This file defines that I have a User in my domain model, that I need a domain object (c# class) generated implementing the user; I need a data access object that can save and load a user to/from the database; and I need a table script that can generate a user table in the database. The line

        generator DomainObject .\MyProject.Domain

Defines a type of code generator (DomainObject) and a destination folder (./MyProject.Domain) where the generated file will be saved.

The three different code generators take care of generating a great deal of trivial code for me, allowing me to concentrate on the important part, domain logic.

Now let's add another domain object to the script

    entity User
        field ID int
        field FirstName string[250]
        field LastName string[250]
        field Email string[250]
        generator DomainObject .\MyProject.Domain
        generator DataAccess .\MyProject.DataAccess
        generator TableScript .\DatabaseScripts\Schemas\dbo\Tables
    end
    entity UserLog
        field ID int
        field UserID int
        field Action string[250]
        field ActionDate datetime
        generator DomainObject .\MyProject.Domain
        generator DataAccess .\MyProject.DataAccess
        generator TableScript .\DatabaseScripts\Schemas\dbo\Tables
    end

Now, we see some duplications appearing, the same destination folder are being used for the different code generators. If the project is restructured so files will be moved to new folders, the same change will have to be carried out many places. Adding the possibility to define a variable will remove this duplication.

    @DomainOutputFolder=.\MyProject.Domain
    @DataAccessOutputFolder=.\MyProject.DataAccess
    @TableOutputFolder= .\DatabaseScripts\Schemas\dbo\Tables
    entity User
        field ID int
        field FirstName string[250]
        field LastName string[250]
        field Email string[250]
        generator DomainObject $(DomainOutputFolder)
        generator DataAccess $(DataAccessOutputFolder)
        generator TableScript $(TableOutputFolder)
    end
    entity UserLog
        field ID int
        field UserID int
        field Action string[250]
        field ActionDate datetime
        generator DomainObject $(DomainOutputFolder)
        generator DataAccess $(DataAccessOutputFolder)
        generator TableScript $(TableOutputFolder)
    end

Now the output folders are stored in variables, so changing the output folder is a lot easier now. It also gives a better overview of what and where code is generated since the destination folders are placed in the beginning of the file.

The domain specification script without variables is the format that my tokenizing lexer and parser wants, so the job of my preprocessor is to transform the latter domain specification script with variables to the former script without variables.

Implementing the Preprocessor

Normally when generating a lexer/parser pair, the lexer emits a series of tokens that the parser can recognize and use to build up an abstract syntax tree. But there is no specification on what type of data the lexer needs to return. It can return integers, strings, or whatever you desire. You can even get the lexer to return an abstract syntax tree directly (with the result that it is completely impossible to understand)

My solution revolves around a lexer that reads character data and return character data. In order for the output of the preprocessing lexer to be used as an input to the tokenizing lexer, the result must be wrapped in a package that the lexer understands. In this example, I have wrapped the lexer in a TextReader implementation.

I will start by showing how this is actually used in the project, as it will provide the context in order to more easily understand the solution. Here is first the parsing routine without the preprocessor (I have used the --unicode flag for FsLex, which generates a lexer that operates on char data, not byte data)

let reader = new System.IO.StreamReader (filename)
let lexBuffer = LexBuffer.FromTextReader reader 
Parser.start Lexer.token lexBuffer

Here is the parsing routine with the preprocessor added

let reader = new System.IO.StreamReader (filename)
let preprocessingReader = new PreprocessingTextReader(reader)
let lexBuffer = LexBuffer.FromTextReader preprocessingReader
Parser.start Lexer.token lexBuffer

Notice that the PreprocessingTextReader is simply a decorator pattern. This implies that the preprocessor is usable on other context. The preprocessor simply transform an input text stream.

The solution thus has two components, a preprocessing lexer, and a specialized TextReader class that uses the preprocessing lexer to deliver the processed result.

Preprocessor.fsl

Lets first look at the actual preprocessing lexer which make up the core of the preprocessor. This is the component that actually converts the input stream.

The lexer output is a char array. So every pattern matched should return in an char array.

First I will show the file, and then go into details with each element. As I said, I assume that the reader is familiar with FsLex and FsYacc, so I will not describe FsLex basics here.

{ 
module Preprocessor
open System.Collections.Generic
open Microsoft.FSharp.Text.Lexing

let variables = new Dictionary()

let lexeme = Lexing.LexBuffer<_>.LexemeString

let parseConstant lexbuf =
    let input = lexeme lexbuf
    // Extract "a=b" in "@a=b"
    let s = input.Substring(1)
    let parts = s.Split('=')
    variables.Add(parts.[0], parts.[1])

let resolveConstant lexbuf =
    let input = lexeme lexbuf
    // Extract "xyz" in "$(xyz)"
    let variable = input.Substring(2, input.Length - 3)
    variables.[variable].ToCharArray()
}

let char        = ['a'-'z' 'A'-'Z']   
let identifier = char*
let nonNewlines = [^ '\r' '\n']*

rule preProcess = parse
| eof                           { [||] }
| "@"identifier"="(nonNewlines) { parseConstant lexbuf; [||] }
| "$("identifier")"             { resolveConstant lexbuf }
| _                             { lexbuf.Lexeme }

In order to implement the variables, I must go down the imperative programming style and introduce static mutable data in the lexer in the form of a dictionary that can hold variable names and variable values.

let variables = new Dictionary()

Followed by this are two helper functions. The first adds a variable to the dictionary. The second retrieves a variable from the dictionary, and returns it as a char array.

let parseConstant lexbuf =
    let input = lexeme lexbuf
    // Extract "a=b" in "@a=b"
    let s = input.Substring(1)
    let parts = s.Split('=')
    variables.Add(parts.[0], parts.[1])

let resolveConstant lexbuf =
    let input = lexeme lexbuf
    // Extract "xyz" in "$(xyz)"
    let variable = input.Substring(2, input.Length - 3)
    variables.[variable].ToCharArray()

Then comes the lexer rules. The first rule is pretty simple, the end of file indicator simply returns an empty array.

rule preProcess = parse
| eof                           { [||] }

The second rule identifies a variable definition. Upon seeing this, the previously shown helper function parseConstant is called. Note that multiple lexer states could possibly have made this implementation clearer, e.g. eliminating the need for splitting strings and removing the @ character in the parseConstant function. But this implementation doesn't use that. The return value is an empty array.

| "@"identifier"="(nonNewlines) { parseConstant lexbuf; [||] }

The third rule is the one that actually looks up the variable, again using the previously defined helper function.

| "$("identifier")"             { resolveConstant lexbuf }

And last, any other lexeme is returned as is. Because the lexer is generated with the --unicode flag, the lexeme is a char array, and can therefore be returned as is.

| _                             { lexbuf.Lexeme }

PreprocessingReader.fs

The PreprocessingReader class is a specialization of the TextReader class that uses the preprocessing lexer and exposes it as a TextReader. Seing how the preprocessing lexer was implemented, it is clear that the reader should function like this.

  • The reader has an internal buffer of char data read from the lexer
  • When the application reads from the reader, it will return data from the buffer
  • If there is no data in the buffer when the application reads from the reader, the reader will read data from the lexer into the buffer first.
  • When reading data from the preprocessing lexer into the buffer, the function should be able to indicate if we have hit the end of file.

The buffer in this case is implemented by a Queue<char> (from the System.Collections.Generic namespace). The buffer is filled by the function chechQueue(). This function will put data in the queue if necessary. The function will return true if there is more data to process, and it will return false, if the queue has been emptied and the preprocessing lexer has passed end of file.

Since performance is of no concern in this specific application, the implementation is a minimal one, one that simply implements the Read() and Peek() functions.

module PreprocessingReader
open System
open System.IO
open System.Collections.Generic
open Microsoft.FSharp.Text.Lexing

type PreprocessingTextReader (sourceReader: StreamReader) =
    inherit TextReader()

    let reader = sourceReader
    let lexBuffer = LexBuffer.FromTextReader reader
    let queue = new Queue()

    // Checks if there is more data to read. If the queue is empty, it
    // will check if there is more data in the input file, and add it to
    // the queue.
    // If there is more data to process, the function returns true. If
    // all the data in the input file has been processed, the function returns false
    let rec checkQueue () =
        if queue.Count > 0 then
            true
        elif lexBuffer.IsPastEndOfStream then
            false
        else    
            let s = Preprocessor.preProcess lexBuffer
            if (s.Length = 0) then
                checkQueue()
            else
                for c in s do 
                    queue.Enqueue c
                true

    override x.Dispose(disposing) =
        if disposing then reader.Dispose()

    override x.Peek() : int =        
        if checkQueue() then
            queue.Peek() |> Convert.ToInt32
        else
            -1

    override x.Read() : int =        
        if checkQueue() then            
            queue.Dequeue() |> Convert.ToInt32
        else
            -1

Conclusion

I believe that this is an elegant solution to my problem in the context that it is given. But using it in a more complex context could cause problems.

The first problem with this code is the lack of performance optimization. This implementation reads only a single character at a time. Letting the preprocessor be able to read larger chunks of data at a time would probably yield better performance. But in my case it takes less than a second to execute at compile time if there are changes to the input file, so performance is not a problem here.

And second, the preprocessor messes up the line numbering. If you are building a compiler and want to show a line number for compilation errors, then a specific line returned by the preprocessor would not necessarily have the same line number in the original file, making it difficult for the user of that compiler to identify where in his/her source code there is a bug.

And third, there is no handling of looking up a variable that is not defined. This implementation will simple fail with an exception.

So if you are building advanced parsers and compilers, or are using this to analyze text runtime in a production environment where performance is critical, then you need to resolve those issues before using this type of implementation.

But if these are not the issues you have, then this is a very simple clean implementation of a preprocessor, where the concerns have been clearly seperated. This preprocessor can easily be extended to implement include files, conditional compilation, and much more. And the preprocessor can easily be used in a different context, it does not need to be used with a parser.

/Pete

8 comments: