Talk at the Haskell Implementers Meeting

A Lexer for Haskell in Haskell

Thomas Hallgren,
PacSoft
Oregon Graduate Institute

Revision history

2001-11-30: original half-baked presentation at OGI
2001-12-03: 53% baked version
2002-01-14: 60% baked version presented at the Haskell Implementers Meeting
2002-01-31: 62% baked version, put on the web
2002-04-06: 63% baked version, corrected the type of popContext
2003-03-04: 64% baked version, added sizes of GHC's and NHC's lexers

Background

Being inspired by Typing Haskell in Haskell, I thought Parsing Haskell in Haskell, in the same spirit, would also be a good idea.
I decided to start by trying to create A Lexer for Haskell in Haskell.
Most of the work was done during a warm and sunny summer week in 1999.
The final parts (nested comments and layout processing) were added last year.

Starting Point

Use Appendix B of the Haskell 98 Report as a specification.
Appendix B consists of some segments of (semi)formal notation, glued together, and clarified by informal text.

Problems for implementors

Appendix B is not self-contained. For example, it is necessary to consult Chapter 2, where you are referred to (the wrong section of) Chapter 5, to find a confirmation that qualified names are part of the lexical syntax (i.e., that spaces are not allowed in M.x)
The original designers of the Haskell syntax decided to design a nice syntax, without restricting themselves to the current parsing technology.
Text book methods and standard tools can not be used directly out of the box...
Still, it appears that solving the problem of parsing Haskell is not seen as prestigious enough that someone has been willing to invest the time to do it well.

Goals

Simplicity
Correctness
Efficiency
Clarification

...and why not...

Reusability

Existing solutions

Examples

Lexer.hs (in Haskell, from the Programatica front-end, ~500 lines) (originally from the hssource library supplied with GHC)
GHC's lexer (in nonstandard Haskell, ~1300 lines)
NHC's lexer (in Haskell, ~1200 lines)
HBC's lexer (in C, ~1700 lines)
Hugs' lexer (in C, ~2000 lines)

Existing solutions

Observations

Haskell implementations use handwritten lexers.
A large monolithic chunk of code.
Difficult to verify correctness.
Difficult to adapt to new versions of Haskell (1.2, 1.3, 1.4, 98, revised 98...).
Only one of my goals is achieved: efficiency.

The tasks of a lexical analyzer for Haskell

The main tasks of the lexical analyzer is grouping characters into lexemes and throwing away white space.

This sounds simple enough, but there are many non-trivial subtasks:

Removing nested comments,
Preserving position information,
Interacting with the parser to implement the layout processing.
Recognizing string literals,
Recognizing simple identifiers,
Recognizing qualified identifiers,
Recognizing keywords and reserved operators.
...

This is why the monolithic code is so complex...

Tempting paths for an implementor

Can a lexer be split up into a number of simpler passes?

Can nested comments be removed first?: If so, the rest could be specified using regular expressions, and perhaps implemented easily using a standard tool.
Can qualified identifiers be recognized in a separate pass?: If so, the DFA produced by a lexer generator could be much smaller.
Can keywords and reserved operators be recognized in a separate pass?: If so, the DFA produced by a lexer generator could be much smaller.

Achieving the Goals

Correctness through Simplicity: Generate the lexer from a specification that is as close as possible to the specification in the Haskell 98 Report.
Efficiency: Compile the specification to an efficient Haskell program.

Specifying the lexer

The Haskell lexer is specified in Haskell!

The Haskell code is a fairly direct transcription of specification in the Haskell Report.

HaskellLexicalSyntax,
Main module of the lexer generator.

Structure of the new lexer

The implementation consists of the following key ingredients:

program       :: RegExp HaskellChar Token
lexerGen      :: RegExp HaskellChar Token -> HaskellSourceCode

haskellLex    :: String -> [(Token,String)]
nestedComment :: ...

addPos    :: [(Token,String)] -> [(Token,(Pos,String))]
rmSpace   :: [(Token,(Pos,String))] -> [(Token,(Pos,String))]
layoutPre :: [(Token,(Pos,String))] -> [(Token,(Pos,String))]

token     :: PM (Token,(Pos,String))
popContext:: PM ()

parseFile :: PM a -> FilePath -> String -> Either Error a

See modules HsLexerPass1, HsLexer, HsLexUtils.

Some other points worth mentioning

The generated lexer supports Unicode.
The first passes of the lexer preserves comments and white space. This turned out to be useful when reusing the lexer in an Haskell-to-HTML converter.
How is the "maximal munch" rule implemented?

A quick comparison

Rough sizes

The old handwritten lexer for Haskell: 664 lines

The new lexer generator: 678 lines
The lexer specification for Haskell: 193 lines

The new layout processing: 100 lines
Size of the Haskell code for the generated DFA: 5500 lines (148 states)

Speed?

The new lexer+parser seems to be 10-15% slower than the old one.
The Haskell Prelude and Standard Libraries (2943 lines) are still parsed in less than one second (on a 600MHz Pentium III).

TO DO

Convert the escapes in character and string literals.
See if size of the DFA can be reduced by eliminating equivalent states...
Make some of the handwritten parts more readable.
Make the regular expression compiler more readable.
Update it to the revised Haskell 98 report.
...

A Lexer for Haskell in Haskell

Thomas Hallgren
PacSoft
Oregon Graduate Institute,
14 January, 2002

Slides

Talk at the Haskell Implementers Meeting

A Lexer for Haskell in Haskell

Thomas Hallgren,
PacSoft
Oregon Graduate Institute

Revision history

Background

Starting Point

Problems for implementors

Goals

Existing solutions

Examples

Existing solutions

Observations

The tasks of a lexical analyzer for Haskell

Tempting paths for an implementor

Achieving the Goals

Specifying the lexer

Structure of the new lexer

Some other points worth mentioning

A quick comparison

Rough sizes

Speed?

TO DO

The End

A Lexer for Haskell in Haskell

Thomas Hallgren PacSoft Oregon Graduate Institute, 14 January, 2002

Slides

Talk at the Haskell Implementers Meeting

A Lexer for Haskell in Haskell

Thomas Hallgren, PacSoft Oregon Graduate Institute

Revision history

Background

Starting Point

Problems for implementors

Goals

Existing solutions

Examples

Existing solutions

Observations

The tasks of a lexical analyzer for Haskell

Tempting paths for an implementor

Achieving the Goals

Specifying the lexer

Structure of the new lexer

Some other points worth mentioning

A quick comparison

Rough sizes

Speed?

TO DO

The End

Thomas Hallgren
PacSoft
Oregon Graduate Institute,
14 January, 2002

Thomas Hallgren,
PacSoft
Oregon Graduate Institute