-
Notifications
You must be signed in to change notification settings - Fork 9
/
CIF2-EBNF.txt
328 lines (271 loc) · 12.5 KB
/
CIF2-EBNF.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
(*
* Extended Backus-Naur Form of the CIF2 syntax and grammar
*
* The CIF2 syntax is closely related to the STAR2 syntax published
* by Spadaccini and Hall (J. Chem. Inf. Model., 2012, 52 (8),
* pp 1901-1906 DOI: 10.1021/ci300074v. The CIF1.1 syntax was derived from
* the original STAR syntax.
*
* The allowed character set is those having Unicode
* code points U+0009, U+000A, U+000D, U+0020 to U+D7FF, and
* U+E000 to U+10FFFD, less code points of the form U+xFFFE and U+xFFFF,
* where x is any hexadecimal digit (including 0). Note that the U+0007
* character used in STAR2 for escaping string terminators is not used
* in CIF2.
*
* This document follows EBNF syntax as given in ISO/IEC 14977. In
* particular, the common "+" notation is replaced by option or
* repeat brackets, e.g. "(ab)+" is replaced by "(ab), {ab}". Moreover,
* the provided EBNF applies to sequences of Unicode characters -- not
* sequences of bytes or encoded characters -- independent of any
* character encoding scheme (but see also below).
*
* This particular EBNF uses the "special sequence" mechanism to
* represent Unicode code points and code point ranges. Individual
* code points are represented in a special sequence by the form
* U+[[h]h]hhhh -- that is, "U+" followed by the 4 to 6 hexadecimal
* digits of the unsigned, 21-bit code point value. Ranges are
* represented by two such code point values separated by a hyphen
* character. Whitespace is permitted on either or both sides of
* any code point value within this special sequence formulation.
*
*
* CIF2 Grammar and Syntax
*
* CIF 2.0 is a binary format consisting of Unicode text encoded
* in UTF-8. Notwithstanding its use of (nearly) the full Unicode
* character repertoire, CIF applies only the semantics described below
* to decoded character data, especially with respect to whitespace and
* line termination. Software consuming CIF data and names generally
* ascribes additional semantics to them, however, which may include
* additional Unicode semantics.
*
* A CIF2 file consists of a sequence of data blocks separated by
* whitespace, with a required magic code as the first characters.
* Data may additionally be structured into save frames, which have
* the same form as data blocks and nest within data blocks. The data
* themselves may appear as individual (key, value) pairs, and / or they
* may be organized into tabular structures called "loops".
*
* All data names appearing directly within a given data block or save
* frame are required to be distinct with respect to a
* normalization / case-folding procedure described elsewhere.
* In the same way, data block names must be distinct within their CIF,
* and save frame names must be distinct within their immediate
* container (either a data block or a save frame).
*
* The main grammatic elements of CIFs are data block headers,
* save frame headers, data names, data values, and a few keywords.
* These all must always be separated from each other by whitespace.
* However, text fields (one kind of data value) have a line terminator as
* part of their delimiter, and that line terminator can do double duty
* to satisfy the whitespace separation requirement. Additionally,
* the requirement does not apply to separating the delimiters of
* list and table values from member values. The productions below
* explicitly include whitespace wherever it is allowed or required.
*)
(*
* This formulation of the CIF2-file production accepts the end of the
* input as whitespace for the purpose of satisfying the CIF 2.0
* requirement that the version comment be followed by whitespace.
* It incorporates the CIF 2.0 limitation (also in CIF 1.1) that no
* line of a CIF may exceed 2048 characters.
*)
CIF2-file = ( file-heading, [ line-term, [
wspace-any, data-block,
{ wspace, data-block }
], [ wspace ], [ comment ] ] )
- ( { allchars }, 2049 * char, { allchars } );
file-heading = [ ?U+FEFF? ], magic-code, { inline-wspace } ;
(*
* The "magic code" identifies the CIF version with which an instance
* document claims to comply.
*)
magic-code = '#\#CIF_2.0' ;
(*
* A datablock consists of a data heading followed by zero or more
* data items and save frames.
*)
data-block = data-heading, { block-content } ;
data-heading = data-token, container-code ;
(*
* Each element inside a data block is either data or a save frame, separated
* from the header or previous element by whitespace.
*)
block-content = wspace, ( data | save-frame ) ;
(*
* A save frame has content similar to a data block's, but it resides
* inside a data block instead of at the top level. (Save frames do not nest.)
*)
save-frame = save-heading, { frame-content }, wspace, save-token ;
save-heading = save-token, container-code;
(*
* A save frame contains a sequence of data elements, each separated from the
* header or previous element by whitespace.
*)
frame-content = wspace, data ;
(* A data block or save frame name is composed of one or more non-blank characters *)
container-code = non-blank-char, { non-blank-char } ;
(* Data appear either as key/value pairs, or within loops. *)
data = ( data-name, wspace-data-value ) | data-loop ;
(*
* A data loop consists of a loop header (the case-insensitive word "loop_"
* followed by a sequence of datanames) and then a sequence of one or more
* whitespace-separated values. Though it cannot be expressed in EBNF, CIF
* requires that a loop whose header contains N data names must contain an
* integral multiple of N data values.
*)
data-loop = loop-token, wspace, data-name, { wspace, data-name },
wspace-data-value, { wspace-data-value } ;
(*
* A dataname begins with an underscore character, and contains one or more
* additional, non-blank characters.
*)
data-name = '_' , non-blank-char, { non-blank-char } ;
(*
* A list contains zero or more whitespace-separated values. The delimiting
* brackets may optionally be separated by whitespace from the values,
* or from each other if there are no values.
*)
list = '[', [ list-values-start, { wspace-data-value } ], [ wspace ], ']' ;
list-values-start =
( wspace-any, nospace-value )
| ( wspace-any, [ comment ], text-field )
| ( [ { wspace-to-eol }, inline-wspace, { inline-wspace } ], wsdelim-string )
| ( wspace-to-eol, { wspace-to-eol }, wsdelim-string-sol ) ;
(*
* A table contains zero or more whitespace-separated key/value pairs.
* The delimiting brackets may optionally be separated by whitespace from
* the values, or from each other if there are no values.
*)
table = '{', [ wspace-any, table-entry, { wspace, table-entry } ], [ wspace ], '}' ;
(*
* Key-value pairs appearing in a table structure take the form 'key':value,
* where key must be a delimited string (but not a text block) and the value
* may be any data value. Whitespace is permitted between the colon and value.
*)
table-entry = ( quoted-string | triple-quoted-string ),
':', ( nospace-value | wsdelim-string | wspace-data-value ) ;
(*
* In most contexts, data values must be preceded by explicit and/or
* implicit whitespace. Only text fields have implicit leading whitespace.
* Additionally, the whitespace preceding an whitespace-delimited string
* affects the form that string may take.
*)
wspace-data-value =
( wspace, nospace-value )
| ( [ wspace-lines ], inline-wspace, { inline-wspace }, wsdelim-string )
| ( wspace-lines, wsdelim-string-sol )
| ( [ wspace ], [ comment ], text-field ) ;
(*
* These data values have neither implicit leading whitespace nor any special
* sensitivity to leading whitespace.
*)
nospace-value =
quoted-string
| triple-quoted-string
| list
| table ;
(*
* Whitespace-delimited strings draw from a subset of the CIF character set,
* have an even more limited first character, and may not have the same form
* as a data block header, save frame header, or any of several reserved
* words. When they appear at the start of a line, whitespace-delimited
* strings may not start with a semicolon.
*)
wsdelim-string-sol = wsdelim-string - ( ';', { non-blank-char } ) ;
wsdelim-string = ( lead-char, {restrict-char} )
- ( ( ( data-token | save-token ), { non-blank-char } ) | loop-token | global-token | stop-token ) ;
lead-char = restrict-char - ( '"' | '#' | '$' | "'" | '_' ) ;
restrict-char = non-blank-char - ( '[' | ']' | '{' | '}' ) ;
(* quote-delimited or apostrophe-delimited string *)
quoted-string = ( quote-delim, quote-content, quote-delim )
| ( apostrophe-delim, apostrophe-content, apostrophe-delim ) ;
quote-content = { char - quote-delim } ;
quote-delim = '"' ;
apostrophe-content = { char - apostrophe-delim } ;
apostrophe-delim = "'" ;
(* triple-quote-delimited and triple apostrophe-delimited strings *)
triple-quoted-string = ( quote3-delim, quote3-content, quote3-delim )
| ( apostrophe3-delim, apostrophe3-content, apostrophe3-delim ) ;
quote3-delim = '"""' ;
quote3-content = { [ '"', [ '"' ] ], not-quote, { not-quote } } ;
not-quote = allchars - '"' ;
apostrophe3-delim = "'''" ;
apostrophe3-content = { [ "'", [ "'" ] ], not-apostrophe, { not-apostrophe } } ;
not-apostrophe = allchars - "'" ;
(* text block *)
text-field = text-delim, text-content, text-delim ;
text-delim = line-term, ';' ;
text-content = { allchars } - ( { allchars }, text-delim, { allchars } ) ;
(*
* CIF keywords are case-insensitive.
*
* The global and stop tokens are part of the original STAR specification;
* they are reserved in case of future use in CIF.
*)
data-token = ( 'D' | 'd' ), ( 'A' | 'a' ), ( 'T' | 't' ), ( 'A' | 'a' ), '_';
save-token = ( 'S' | 's' ), ( 'A' | 'a' ), ( 'V' | 'v' ), ( 'E' | 'e' ), '_';
loop-token = ( 'L' | 'l' ), ( 'O' | 'o' ),( 'O' | 'o' ), ( 'P' | 'p' ) , '_' ;
global-token = ( 'G' | 'g' ), ( 'L' | 'l' ), ( 'O' | 'o' ), ( 'B' | 'b' ), ( 'A' | 'a' ), ( 'L' | 'l' ), '_' ;
stop-token = ( 'S' | 's' ), ( 'T' | 't' ), ( 'O' | 'o' ), ( 'P' | 'p' ), '_' ;
(* A single non-whitespace character *)
non-blank-char = char - inline-wspace ;
(*
* Runs of spaces, tabs, line terminators, and comments are
* required to separate many higher-level elements of this grammar. In
* most contexts such runs may not begin with comments. For the purposes
* of these particular productions, such runs *never* end with comments.
*)
(*
* a nonempty run of whitespace and possibly comments, beginning with inline
* whitespace or a line terminator; may span multiple lines
*)
wspace = ( inline-wspace | line-term ), wspace-any;
(*
* a nonempty run of whitespace and possibly comments, beginning with inline
* whitespace or a line terminator, and ending with a line terminator; may
* span multiple lines
*)
wspace-lines = [ inline-wspace, { inline-wspace }, [ comment ] ], line-term, { wspace-to-eol } ;
(*
* a possibly-empty run of whitespace and comments; may begin with a comment,
* and may span multiple lines
*)
wspace-any = { wspace-to-eol }, { inline-wspace } ;
(*
* a run of zero or more spaces and/or tabs, optionally followed by a comment,
* always terminated by a line terminator.
*)
wspace-to-eol = { inline-wspace }, [ comment ], line-term ;
(*
* A comment is a hash symbol followed by every character up to, but not
* including, the end of the line
*)
comment = '#', { char } ;
(* 'char' represents any allowed character other than line-term *)
char = allchars - line-term ;
(*
* Only ASCII space and tab characters are significant as inline
* whitespace. Unicode's classification of certain other code points as
* whitespace is not significant for the purposes of CIF.
*)
inline-wspace = ?U+0020? | ?U+0009? ;
(*
* The two-character sequence U+000D U+000A is recognized as a line
* terminator, as are each of the characters U+000A and U+000D when they
* appear outside such a sequence. Similarly to XML, CIF2 interprets
* each of these line termination sequences as the single character U+000A,
* wherever they appear in a CIF instance document, as if that
* translation were performed prior to parsing.
*)
line-term = ( ?U+000D?, [ ?U+000A? ] ) | ?U+000A? ;
(* For ease of specification we define a token for the full character set. *)
allchars = ?U+0009? | ?U+000A? | ?U+000D? | ?U+0020 - U+007E?
| ?U+00A0 - U+D7FF? | ?U+E000 - U+FDCF? | ?U+FDF0 - U+FFFD?
| ?U+10000 - U+1FFFD? | ?U+20000 - U+2FFFD? | ?U+30000 - U+3FFFD?
| ?U+40000 - U+4FFFD? | ?U+50000 - U+5FFFD? | ?U+60000 - U+6FFFD?
| ?U+70000 - U+7FFFD? | ?U+80000 - U+8FFFD? | ?U+90000 - U+9FFFD?
| ?U+A0000 - U+AFFFD? | ?U+B0000 - U+BFFFD? | ?U+C0000 - U+CFFFD?
| ?U+D0000 - U+DFFFD? | ?U+E0000 - U+EFFFD? | ?U+F0000 - U+FFFFD?
| ?U+100000 - U+10FFFD? ;