The tasks of a lexical analyzer for Haskell
The main tasks of the lexical analyzer is grouping characters into lexemes
and throwing away white space.
This sounds simple enough, but there
are many non-trivial subtasks:
- Removing nested comments,
- Preserving position information,
- Interacting with the parser to implement the layout processing.
- Recognizing string literals,
- Recognizing simple identifiers,
- Recognizing qualified identifiers,
- Recognizing keywords and reserved operators.
- ...
This is why the monolithic code is so complex...
Tempting paths for an implementor
Can a lexer be split up into a number of simpler passes?
- Can nested comments be removed first?
- If so, the rest could be specified using regular expressions, and perhaps
implemented easily using a standard tool.
- Can qualified identifiers be recognized in a separate pass?
- If so, the DFA
produced by a lexer generator could be much smaller.
- Can keywords and reserved operators be recognized in a separate pass?
- If so, the DFA
produced by a lexer generator could be much smaller.
Specifying the lexer
The Haskell lexer is specified in Haskell!
The Haskell code is a fairly direct transcription of specification in
the Haskell Report.
Structure of the new lexer
The implementation consists of the following key ingredients:
program :: RegExp HaskellChar Token
lexerGen :: RegExp HaskellChar Token -> HaskellSourceCode
haskellLex :: String -> [(Token,String)]
nestedComment :: ...
addPos :: [(Token,String)] -> [(Token,(Pos,String))]
rmSpace :: [(Token,(Pos,String))] -> [(Token,(Pos,String))]
layoutPre :: [(Token,(Pos,String))] -> [(Token,(Pos,String))]
token :: PM (Token,(Pos,String))
popContext:: PM ()
parseFile :: PM a -> FilePath -> String -> Either Error a
See modules
HsLexerPass1,
HsLexer,
HsLexUtils.