by Tim Henderson (tadh@case.edu)
Copyright 2016, Licensed under the GPL version 3. Please reach out to me directly if you require another licensing option. I am willing to work with you.
A parser for the graphviz dot language in the Go programming language. It is intended to be stream oriented for parsing large graphs.
Dot is a relatively simple language and can be parsed with a clean separation between the lexing and parsing phases.
NODE = "node"
EDGE = "edge"
GRAPH = "graph"
DIGRAPH = "digraph"
SUBGRAPH = "subgraph"
STRICT = "strict"
LSQUARE = "["
RSQUARE = "]"
LCURLY = "{"
RCURLY = "}"
EQUAL = "="
COMMA = ","
SEMI = ";"
COLON = ":"
ARROW = "->"
DDASH = "--"
ID = ([a-zA-Z_][a-zA-Z0-9_]*)|("([^\"]|(\\.))*")|ID-HTML
COMMENT = (/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*$)
Note: The ID
token is has 3 forms:
-
The usual form as a name
[a-zA-Z_][a-zA-Z0-9_]*
-
A string,
"([^\"]|(\\.))*"
. Thus"\\\""
is valid but"\\""
is not -
A HTML string, which is non-regular:
CHAR = [^<>] ID-HTML = IdHTML IdHTML : Tag ; Tag : < Body > ; Body : CHAR Body ; | Tag Body | e // denotes epsilon, the empty string ;
Thus
<<xyz<xy>xyz><asdf>>
is valid but<<>
is not
This is the grammar as given on http://www.graphviz.org/doc/info/lang.html
graph : [ strict ] (graph | digraph) [ ID ] '{' stmt_list '}'
stmt_list : [ stmt [ ';' ] [ stmt_list ] ]
stmt : node_stmt
| edge_stmt
| attr_stmt
| ID '=' ID
| subgraph
attr_stmt : (graph | node | edge) attr_list
attr_list : '[' [ a_list ] ']' [ attr_list ]
a_list : ID '=' ID [ (';' | ',') ] [ a_list ]
edge_stmt : (node_id | subgraph) edgeRHS [ attr_list ]
edgeRHS : edgeop (node_id | subgraph) [ edgeRHS ]
node_stmt : node_id [ attr_list ]
node_id : ID [ port ]
port : ':' ID [ ':' compass_pt ]
| ':' compass_pt
subgraph : [ subgraph [ ID ] ] '{' stmt_list '}'
compass_pt : (n | ne | e | se | s | sw | w | nw | c | _)
A grammar suitable for bottom up parsing
Graphs : Graphs Graph
| Graph
;
Graph : GraphStmt
| COMMENT
;
GraphStmt : GraphType GraphBody
| GraphType ID GraphBody
| STRICT GraphType GraphBody
| STRICT GraphType ID GraphBody
;
GraphType : GRAPH
| DIGRAPH
;
GraphBody : LCURLY StmtList RCURLY
| LCURLY RCURLY
;
StmtList : Stmt StmtList
| Stmt SEMI StmtList
| Stmt
| Stmt SEMI
;
Stmt : ID EQUAL ID
| NodeStmt
| EdgeStmt
| AttrStmt
| SubGraph
| COMMENT
;
NodeStmt : NodeId
| NodeId AttrLists
;
NodeId : ID
| ID Port
;
Port : COLON ID
| COLON ID COLON ID // where second ID in "n", "ne", "e", "se", "s", "sw",
// "w", "nw", "c", "_"
;
AttrStmt : AttrType AttrLists
;
AttrType : NODE
| EDGE
| GRAPH
;
AttrLists : AttrList AttrLists
| AttrList
;
AttrList : LSQUARE AttrExprs RSQUARE
| LSQUARE RSQUARE
;
AttrExprs : AttrExpr AttrExprs
| AttrExpr
;
AttrExpr : ID EQUAL ID
| ID EQUAL ID COMMA
| ID EQUAL ID SEMI
;
EdgeStmt : EdgeReciever EdgeRHS
| EdgeReciever EdgeRHS AttrList
;
EdgeReciever : NodeId
| SubGraph
;
EdgeRHS : EdgeOp EdgeReciever EdgeRHS
| EdgeOp EdgeReciever
;
EdgeOp : ARROW // only valid for digraph
| DDASH // only valid for graph
;
SubGraph : GraphBody
| SUBGRAPH GraphBody
| SUBGRAPH ID GraphBody
;
A grammar suitable for top down (recursive descent parsing).
This should be LL(1) but I haven't taken the time to prove it. Use with caution.
Graphs : Graph Graphs
| Graph
;
Graph : GraphStmt
| COMMENT
;
GraphStmt : GraphStart GraphBody
;
GraphStart : GraphType
| GraphType ID
| STRICT GraphType
| STRICT GraphType ID
;
GraphType : GRAPH
| DIGRAPH
;
GraphBody : LCURLY Stmts RCURLY
| LCURLY RCURLY
;
Stmts : Stmt Stmts
| e
;
Stmt : Stmt' SEMI
| Stmt'
;
// The first two non-terminals are actually redundant. However, they solve
// grammatical ambiguity with respect to edges, nodes, attrs and subgraphs. By
// using them you will not need arbitrary look-ahead
Stmt' : StmtIDStart // matches things starting with ID
| StmtSubGraphStart // matches things starting with Subgraph
| EdgeStmt
| AttrStmt
| NodeStmt
| SubGraph
| COMMENT
;
StmtIDStart : ID AttrStmtCont
| ID NodeIdCont
;
AttrStmtCont : EQUAL ID
;
NodeIdCont : Port EdgeCont
| Port AttrLists
| EdgeCont
| AttrLists
;
StmtSubGraphStart : SubGraph EdgeCont
| SubGraph
;
EdgeStmt : EdgeReciever EdgeCont
;
EdgeCont : EdgeRHS AttrList
;
EdgeReciever : NodeId
| SubGraph
;
AttrStmt : AttrType AttrLists
| ID EQUAL ID
;
NodeStmt : NodeId AttrLists
;
SubGraph : SubGraphStart GraphBody;
NodeId : ID
| ID Port
;
Port : COLON ID
| COLON ID COLON ID // where second ID in "n", "ne",
// "e", "se", "s", "sw", "w", "nw",
// "c", "_"
;
AttrType : NODE
| EDGE
| GRAPH
;
AttrLists : AttrList AttrLists
| e
;
AttrList : LSQUARE AttrExprs RSQUARE
;
AttrExprs : AttrExpr AttrExprs
| e
;
AttrExpr : ID EQUAL ID
| ID EQUAL ID COMMA
| ID EQUAL ID SEMI
;
EdgeRHS : EdgeOp EdgeReciever EdgeRHS'
;
EdgeRHS' : EdgeOp EdgeReciever EdgeRHS'
| e
;
EdgeOp : ARROW // only valid for digraph
| DDASH // only valid for graph
;
SubGraphStart : SUBGRAPH GraphBody
| SUBGRAPH ID GraphBody
| e
;