A Lexer for Haskell in Haskell

Thomas Hallgren
PacSoft
Oregon Graduate Institute,
14 January, 2002

All slides in one page
Source code

Slides

Talk at the Haskell Implementers Meeting

A Lexer for Haskell in Haskell

Thomas Hallgren,
PacSoft
Oregon Graduate Institute


Revision history

2001-11-30: original half-baked presentation at OGI
2001-12-03: 53% baked version
2002-01-14: 60% baked version presented at the Haskell Implementers Meeting
2002-01-31: 62% baked version, put on the web
2002-04-06: 63% baked version, corrected the type of popContext
2003-03-04: 64% baked version, added sizes of GHC's and NHC's lexers

Background

Starting Point

Problems for implementors

Goals

...and why not...

Existing solutions

Examples

Existing solutions

Observations

The tasks of a lexical analyzer for Haskell

The main tasks of the lexical analyzer is grouping characters into lexemes and throwing away white space.

This sounds simple enough, but there are many non-trivial subtasks:

This is why the monolithic code is so complex...

Tempting paths for an implementor

Can a lexer be split up into a number of simpler passes?
Can nested comments be removed first?
If so, the rest could be specified using regular expressions, and perhaps implemented easily using a standard tool.
Can qualified identifiers be recognized in a separate pass?
If so, the DFA produced by a lexer generator could be much smaller.
Can keywords and reserved operators be recognized in a separate pass?
If so, the DFA produced by a lexer generator could be much smaller.

Achieving the Goals

Correctness through Simplicity
Generate the lexer from a specification that is as close as possible to the specification in the Haskell 98 Report.
Efficiency
Compile the specification to an efficient Haskell program.

Specifying the lexer

The Haskell lexer is specified in Haskell!

The Haskell code is a fairly direct transcription of specification in the Haskell Report.

Structure of the new lexer

The implementation consists of the following key ingredients:
program       :: RegExp HaskellChar Token
lexerGen      :: RegExp HaskellChar Token -> HaskellSourceCode

haskellLex    :: String -> [(Token,String)]
nestedComment :: ...

addPos    :: [(Token,String)] -> [(Token,(Pos,String))]
rmSpace   :: [(Token,(Pos,String))] -> [(Token,(Pos,String))]
layoutPre :: [(Token,(Pos,String))] -> [(Token,(Pos,String))]

token     :: PM (Token,(Pos,String))
popContext:: PM ()

parseFile :: PM a -> FilePath -> String -> Either Error a
See modules HsLexerPass1, HsLexer, HsLexUtils.

Some other points worth mentioning

A quick comparison

Rough sizes

Speed?

TO DO

The End

[ Valid HTML?| Check Links]