Lexical Analysis (lexer.mll)

Seen from a computer, even ML programs are just sequences of characters at first. For example, the previous program gcd looks like:

'l' 'e' 't' ' ' 'r' 'e' 'c' ' ' 'g' 'c' 'd' ' ' 'm' ' ' 'n' ' ' '=' ...

Since we cannot do anything to such a string as it is, we first separate it to tokens like:

"let" "rec" "gcd" "m" "n" "=" ...

This process is called lexical analysis.

Although there are many methods for lexical analysis, we here use a tool called ocamllex, which is indeed developed for lexical analysis in OCaml. The file is lexer.mll. We leave the details of ocamllex to its manual and explain only the highlight. We write down a list of rules such as

| '-'? digit+
    { INT(int_of_string (Lexing.lexeme lexbuf)) }

in a syntax like pattern matching, meaning that "if the string matches regular expression '-'? digit+, return an INT token." The data type of tokens (such as INT) is defined in parser.mly, which is explained next. The phrase Lexing.lexeme lexbuf can be considered as a "magic spell" meaning the string being analyzed.

Next