Name	Name	Last commit message	Last commit date
parent directory ..
bin	bin
lib	lib
test	test
README.md	README.md

Lexing

From your fork of the repository, go to the lip/ directory and create a new project with the following command:

dune init project toylexer

Then, run the following command from the lip/toylexer directory:

echo '(ocamllex lexer)' >> lib/dune

This appends one line to lib/dune, to instruct the compiler to use the ocamllex tool.

The goal of this project is to familiarize with the ocamllex lexer generator. Usually, a lexer takes as input a string of text and gives as output a stream of tokens, to be fed as input to a parser which will produce an abstract syntax tree. In this project, we will focus only on the lexer: combining it with a parser will be the goal of the next project.

You will work on the following files in the lib directory:

token.ml, defining the type of tokens
lexer.mll, defining the lexer generator
main.ml, containing utility functions to use the lexer.

The main routine in bin/main.ml reads a line from the stdin, and processes it according to the command line:

dune exec toylexer

which prints the sequence of tokens fed from stdin. The utility functions for this command are already implemented in lib/main.ml. You can test the command as follows:

echo "x=1; y=x+1" | dune exec toylexer

If everything is fine, you should see as output:

ID(x) ASSIGN CONST(1) SEQ ID(y) ASSIGN ID(x) PLUS CONST(1) EOF

which is the tokenization of the input string.

The project requires you to work at the following tasks:

Task 1

Extend the Toylexer library so that the command:

dune exec toylexer freq <n>

prints the list of the n most frequent tokens. To this purpose, you must implement the function

frequency : int -> 'a list -> ('a * int) list

in lib/main.ml. This function takes as input an integer n and a list of tokens, and outputs a list of length at most n. This list contains pairs (t,tn) where t is a token and tn is the number of its occurrences in the input. The resulting list is ordered by the number of occurrences, in descending order.

For instance:

frequency 3 [ID("x"); ASSIGN; ID("y"); SEQ; ID("x"); ASSIGN; ID("x"); PLUS; CONST("1")];;
- : (token * int) list = [(ID "x", 3); (ASSIGN, 2); (ID "y", 1)]

Note

This is a good opportunity to use the List API of OCaml's standard library: https://ocaml.org/manual/5.2/api/List.html

Task 2

Extend the lexer with the following types of tokens:

ATOK: all strings of letters and digits starting by a capital letter
BTOK: all strings of lowercase vowels
CTOK: all strings of letters containing at most one vowel
DTOK: all strings of digits possibly starting with - and possibly containing a . followed by other digits (e.g., 3.14, -7., -.3)
ETOK: all strings representing hexadecimal numbers in C syntax (i.e., strings starting with 0x or 0X, and containing hexadecimal value HH, where HH is 1 or more hex digits, '0'-'9','A'-'F','a'-'f')

If a string matches more than one of these patterns, then the assigned token must be the one highest on the list. Furthermore, all these patterns have priority over ID and CONST.

Note

The syntax of ocamllex's regular expressions is explained here: https://ocaml.org/manual/5.2/lexyacc.html#ss:ocamllex-regexp

Task 3

Implement unit tests in the test directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toylexer

toylexer

README.md

Lexing

Task 1

Task 2

Task 3

Files

toylexer

Directory actions

More options

Directory actions

More options

Latest commit

History

toylexer

Folders and files

parent directory

README.md

Lexing

Task 1

Task 2

Task 3