Some useful notes by Iavor S. Diatchki. (Converted to HTML by Thomas Hallgren)

The Lexer

We use LexerGen/HsLexerGen to generate Lexer/HsLex.hs which is the basic lexer. Its main goal is to define:
haskellLex :: String -> [(Token,String)]
Token is defined in Lexer/HsTokens.hs, and it is basically the different types of token we have. The string is the value of the token.
Next comes Lexer/HsLexerPass1.hs which makes use of the basic lexer haskellLex. Its main goal is to define:
lexerPass0 :: String -> [(Token,(Pos,String))]
lexerPass1Only :: [(Token,(Pos,String))] -> [(Token,(Pos,String))]
lexerPass0 uses haskellLex to separate the input into tokens, and then annotates them with their positions. lexerPass1Only removes whitespace tokens from the input.

At this stage Pos is simply a pair of Int (defined in Lexer/HsLexerPass1.hs). The format is (rows,cols). Positions start at (1,1).


Finally the file Lexer/HsLexer.hs contains the real lexer that can interact with the Happy grammar. It defines lexer
lexer :: ((Token,(SrcLoc,String)) -> PM a) -> PM a
Here PM is the parsing monad, located in ParseMonad.hs The lexer expects the parsing monad to have a state component of type:
type State = ([(Token,(SrcPos,String)],[Int])
The first component of the state is the list of remaining tokens. The second component is the layout context, i.e. a stack keeping track of indentations of blocks of declarations.

The lexer accesses the state with the aid of three functions:

get         :: PM State
set         :: State -> PM ()
setreturn   :: a -> State -> PM a
fail        :: String -> PM a
(setreturn is simply an optiomization (is it wortherd?), setreturn x s = set s >> return x )

Currently the file ParseMonad.hs also defines:

eoftoken = (GotEOF,(eof,""))
eof = SrcLoc "?" (-1) (-1) -- hmm
(this is also used in HsLexer.hs)

SrcLoc is a type defined in ../AST/SrcLoc.hs and is positions with file names in them. The conversion from Pos to SrcLoc happens in a function parseTokens defined in ParseMonad.hs.