-
Notifications
You must be signed in to change notification settings - Fork 1
/
writing fragids.xml
1951 lines (1940 loc) · 149 KB
/
writing fragids.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://docbook.org/xml/5.1/rng/docbook.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://docbook.org/xml/5.1/sch/docbook.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<article xmlns="http://docbook.org/ns/docbook"
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.1">
<info>
<title>Writing Fragment Identifiers</title>
<legalnotice>
<info>
<copyright>
<year>2021</year>
<year/>
<holder>Joel Kalvesmaki</holder>
</copyright>
<author>
<personname><firstname>Joel</firstname>
<surname>Kalvesmaki</surname></personname>
</author>
</info>
<remark>This document is released under a Creative Commons Attribution 4.0 International
License: <link xlink:href="http://creativecommons.org/licenses/by/4.0/"
>http://creativecommons.org/licenses/by/4.0/</link>
</remark>
</legalnotice>
<revhistory>
<revision>
<date>2021-06-27</date>
<revdescription>
<para>Prepublication draft, version 0.02</para>
</revdescription>
</revision>
</revhistory>
</info>
<warning>
<para>This prepublication draft document is subject to major revisions. Requests for changes
should be directed to the editor, either by email (kalvesmaki@gmail.com) or by GitHub
ticket. </para>
<para>Things to do:<itemizedlist>
<listitem>
<para>Enlist specialists to critique this document.</para>
</listitem>
<listitem>
<para>Discuss questions raised throughout.</para>
</listitem>
<listitem>
<para>Develop a suite of about 100 examples drawn from a range of types of
reference systems.</para>
</listitem>
<listitem>
<para>Write XSpec test suite.</para>
</listitem>
</itemizedlist></para>
</warning>
<warning>
<para>Look throughout the document for warnings, which register significant doubts the
editor wants to leave for discussion, deliberation.</para>
</warning>
<warning>
<para>For about eight years I've been thinking about reference systems and pointers. That
thinking has shaped the design of <link xlink:href="http://textalign.net">Text Alignment
Network</link> (TAN) XML, but I have adopted a somewhat different perspective in
writing these specifications, because my primary design concern has been not a data
format but an unambiguous, clear URI fragid. That is, the problem I've set out to solve
is rather different, so I've taken a different approach. In fact, having finished the
first draft, I have realized that TAN is not completely WF-conformant. In fact, it is
somewhat easier to define a customization of TEI for WF conformance.</para>
</warning>
<section>
<title>Abstract</title>
<para>This document defines the syntax for a fragment identifier ("fragid") to any URI that
designates a piece of writing. Such <emphasis>Writing Fragids</emphasis> (WFs) enhance
the URI, by allowing anyone to cite a particular part of a work, and providing a method
for organizing, retrieving, and disseminating digital content in a variety of media and
formats. </para>
<para>These specifications for WFs define a semantic fragid structure, and not a method for
negotiating a single media type. They explain how to construct WF URIs, and establish
conformance requirements, both for media formats and for applications that extract
content. </para>
<para>To construct a fragid syntax that restricts a reference unambiguously to a specific
part of a piece of writing has numerous challenges. Writing fragids were designed to
model habits of citation, which have developed in complex ways over centuries. WF
specifications isolate a subset of citation practices and correlate it with a syntax
that enables the creation of a fragid that is just as persistent and unique as the URI
it modifies.</para>
</section>
<section>
<title>Introduction</title>
<para>Numerous Universal Resource Identifiers (URIs) designate pieces of writing. Some point
to a writing in the abstract without reference to a particular version or edition (e.g.,
<link xlink:href="http://dbpedia.org/resource/Iliad"/> for the
<emphasis>Iliad</emphasis>), whereas others point to a particular written artefact,
without reference to what kind of writing it is (e.g.,
urn:uuid:61c31a6d-c45a-4093-b25f-5a9c94566172 for the receipt in my pocket). In either
case, someone may wish to use the URI not merely to name the writing, but to designate a
specific range, place, or portion of the writing. One practical way to do so would be to
add a fragment identifier ("fragids") to the URI.</para>
<para>This document defines the rules for <emphasis>Writing Fragids</emphasis> (WFs),
designed to be applied to any URI that designates either an abstract written work or an
object that has writing on it. Although WFs may be applied to any type of writing, it
has been designed primarily for those that are objects of citation, and survive in
non-digital media (e.g., books, articles, newspapers, manuscripts, inscriptions, papyri,
ostraca). <warning>
<para>Currently, the WF specifications are written claiming to support all types of
scripta and works, later making exceptions (opt-out). While writing the
examples, I thought about an alternative approach, namely, to predefine a few of
the most common reference patterns, and say that no scriptum or work could be a
candidate for WFs unless it conformed (opt-in). I can see arguments pro and con
for either approach.<orderedlist>
<listitem>
<para>Opt-out: a general set of rules applicable all the time for every
type of text, regardless, with some exceptions excluded. If there
are inconcinnities, then the hell with it. Anyone who encodes a WF
URI / WF-compliant text that relies upon an idiosyncratic
understanding of a scriptum or work, well, that's on them, and WFs
will be ignored because they are not approaching the material the
same way everyone else is. This approach, adopted for in this
version of the specs, would keep the number of rules to a
minimum.</para>
</listitem>
<listitem>
<para>Opt-in: rules require that anyone who encodes a WF URI /
WF-compliant text declare such-and-such a predefined WF reference
system type, and don't worry about unsupported texts. As the WF
specifications develop, with new versions of the WF specs, important
reference systems can be added to increase the types of supported
writings. This approach would require a lot more rules declared at
the outset.</para>
</listitem>
</orderedlist>I do not have a clear sense on which way the needle should
point.</para>
</warning></para>
<para>URIs with WFs enable two kinds of activity that are otherwise difficult or
impossible:</para>
<para>
<orderedlist>
<listitem>
<para><emphasis role="bold">Shared URI-based citations to specific parts of a
work</emphasis>. Rather than pointing to an entire work, one can specify
a particular part. This in turn allows one to build computer-actionable
statements that are more precise. One can build such WF-URIs without
depending upon a particular digital surrogate. In fact, one can coin and use
WF URIs for writings that have not yet been digitized.</para>
</listitem>
<listitem>
<para><emphasis role="bold">A URI-based protocol for requesting and delivering
digital resources</emphasis>. A WF-URI is media independent, and may
correspond to text, image, audio, video, or other types of resources. The WF
specifications permit designers of data formats to allow users to expose
their data to WF URIs, and developers of applications to retrieve and
deliver requests that include WFs.</para>
</listitem>
</orderedlist>
</para>
<para>These guidelines adhere to the <link
xlink:href="https://www.w3.org/TR/fragid-best-practices/">Best Practices for
Fragment Identifiers and Media Type Definitions</link>, specifically the section
addressing <link xlink:href="https://www.w3.org/TR/fragid-best-practices/#structures"
>fragid structures</link>. A semantic fragid structure such as WF declares a set of
meaningful syntactic rules that can be followed by designers of individual media types,
perhaps to be registered with the <link xlink:href="https://www.iana.org/">Internet
Assigned Numbers Authority (IANA)</link>. </para>
<para>Fragids in general allow processors to extract specific content from digital
resources. Writing Fragids are also intended for those purposes, but on an abstract
level, without specifying exactly which digital resources, if any, will match the WF.
Indeed, some WFs will be added to URIs that have no specification on how or where to
retrieve a matching document. Further, many WFs will be attached to URIs that identify
nondigital textual entities that may not have any corresponding digital resource for
years to come, if ever. Such resource-agnostic URIs are valuable, because they allow one
to make assertions about nondigital texts without having to rely upon any particular
digital surrogate.</para>
<para>Designers of WF-conformant media types must write their own guidelines defining
exactly how the individual components of a WF URI must be interpreted and resolved
against instances of the format. Developers creating parsers or applications handling a
specific WF-conformant data format must take into consideration both these guidelines
and the ones stipulated by any WF-conformant data format.</para>
<para>These specifications engage with technical material from two different areas: writing
technology and computer science. Some readers knowledgeable in one but not the other may
find reading sequentially through this document to be frustrating and confusing. The
following order is recommended for new readers: <xref linkend="motivation"/>, <xref
linkend="challenges"/>, <xref linkend="examples"/>, then other sections as interest
leads.</para>
</section>
<section>
<title>Conformance</title>
<important>
<para>This section is normative.</para>
</important>
<para>The key words <emphasis role="italic">must</emphasis>, <emphasis role="italic">must
not</emphasis>, <emphasis role="italic">required</emphasis>, <emphasis role="italic"
>should</emphasis>, <emphasis role="italic">should not</emphasis>, <emphasis
role="italic">recommended</emphasis>, <emphasis role="italic">may</emphasis>, and
<emphasis role="italic">optional</emphasis> in this specification are to be
interpreted as described in <link
xlink:href="https://www.w3.org/TR/fragid-best-practices/#bib-RFC2119"
>RFC2119</link>.</para>
<warning>
<para>This section is a bit of a hodge-podge, a mixture of ideals and harsh realities
faced when trying to write my conformance suite. It is also probably excessive. As
I've been developing both a media format and a processor, I've noted what I've
wanted to achieve.</para>
</warning>
<section>
<title>Media Formats</title>
<para><termdef><firstterm>WF-conformant media formats</firstterm> are media formats that
have been designed to allow files in the format to be parsed against WF URIs for
content.</termdef> In defining conformance of a media type to the Writing Fragid
specifications, designers must provide a narrative describing how to match the
component parts of a WF URI against a file, and how the file should be parsed to
extract matching content. Although not required, it is recommended that designers
include one or more algorithms, along with a test suite, to corroborate the
narrative programmatically.</para>
<para>Every WF-conformant media format <emphasis>must</emphasis> stipulate in its WF
guidelines which version of WF is supported. </para>
<para>WF was designed to support textual scholarship, which regards provenance as being
highly important. In many cases, when content is extracted from a WF-conformant
file, the recipient of the data will expect metadata that stipulates responsibility.
Therefore, every WF-conformant media format <emphasis>must</emphasis> define a
mechanism that supports provenance information. Whether such information must always
be included, or how it is validated, is left to designers of the media
format.</para>
<para>The WF syntax is not universally comprehensive. Some of the content in a
WF-conformant file may not be accessible to a WF URI. It is not required that all
the content in a matching WF-conformant file be accessible by a WF URI. Guidelines
for WF-conformant media formats must explain how to distinguish WF-qualifying
content from non-WF-qualifying content.</para>
<para>Some generic media formats (e.g., TEI XML, HTML) may be conducive to
WF-conformance, but may also permit multiple ways of being customized. Anyone
customizing a generic media format to be WF-conformant <emphasis>should</emphasis>
provide a unique name for the customization, and <emphasis>should</emphasis> develop
mechanisms that avoid conflict with any other existing WF-conformant definitions for
the same generic format.</para>
<para>Some media media formats might be designed not to expose content for extraction
via WF URIs, but to be a generative source of WF URIs (e.g., a file format designed
for RDF triples that include WF URIs, or for data structures that can be converted
to WF URIs). No conformance requirements are stipulated of such WF-engaged formats
beyond the WF syntactic requirements defined below. </para>
<para>Other conformance requirements for media formats appear throughout these
specifications.</para>
</section>
<section>
<title>Processors</title>
<para><termdef>A <firstterm>WF processor</firstterm> is defined as an algorithm that
takes as input one or more WF URIs and one or more files or file fragments from
WF-conformant media formats, and returns as output the media content that
matches the WF URIs, perhaps along with associated metadata.</termdef> A WF
processor may be written in a variety of programming languages. </para>
<para>WF processors may feature, depend upon, or actually be, an application programming
interface (API) or some other type of interface. The WF specifications put no
strictures on such interfaces, aside from its processing requirements.</para>
<para>If a WF is added to a URI that begins with the regular expression
<code>https?://</code>, the URI is to be interpreted not as a URL per se (a
place where a particular digital resource is to be found) but as the identifying
name of a piece of literature. A WF processor <emphasis>may</emphasis> treat such a
URI as a location and attempt to retrieve data from it, but these specifications
place no strictures on the mediating web protocol or interface, or what type of
resource might be returned by a referenced server, or how to interpret the content
that is returned.</para>
<para>WF URIs were created to support provenanced claims about texts. It is
<emphasis>recommended</emphasis> that WF processors provide in any output
metadata that identifies the processor and its version and any provenance metadata
associated with extracted content. </para>
<para>A WF processor <emphasis>must</emphasis> expect input that conforms to the content
format defined by the target WF-conformant media. That processor
<emphasis>may</emphasis> apply further adjustments before it renders output. The
format and serialization of finalized output WF data is implementation-dependent,
but the representation of the content of WF media format content
<emphasis>should</emphasis> be lossless. Those who design WF processors will
likely need to constrain the output to a particular format (XML, JSON, tiff, mp4,
svg, etc.). <warning>
<para>Does this paragraph get too invasive? Those who write processors will do
whatever they want, in the end.</para>
</warning></para>
<para>A WF processor <emphasis>must</emphasis> define the types of WF conformant-media
it processes. A WF processor need not support all media types, but it
<emphasis>must</emphasis> support all syntactically valid WF URIs.</para>
<para>If a WF URI constructs reference nodes in a specific sequence, the corresponding
output <emphasis>must</emphasis> respect the same sequence.</para>
<para>Error reporting is implementation-dependent, unless otherwise specified. </para>
<para>A non-normative companion XSLT test suite accompanies these specifications. It can
be used to parse and validate WF URIs, or to build analogous processes in other
programming languages.<warning>
<para>Currently under construction at <link xlink:href="./wf-functions.xsl"
>wf-functions.xsl</link>.</para>
</warning></para>
<para>Other conformance requirements for WF processors appear throughout these
guidelines. Designers of WF processors <emphasis>must</emphasis> document how
implementation-dependent decisions are handled.</para>
</section>
</section>
<section xml:id="motivation">
<title>Motivation and Goals</title>
<important>
<para>This section is descriptive.</para>
</important>
<para>The need for Writing Fragids has grown with the influence of the Semantic Web, an
ecosystem of sharing data interoperably, based on the model defined by the <link
xlink:href="https://www.w3.org/RDF/">Resource Description Framework</link> (RDF).
RDF defines a relatively simple graph model for its basic datum, called a triple, a
graph component that consists of two nodes and one edge, named the subject (node),
predicate (edge), and object (node). As the names suggest, a RDF triple models an
everyday claim. </para>
<para>In the case of written texts, one might wish to say, "Plato's
<emphasis>Republic</emphasis> quotes from Homer's <emphasis>Iliad</emphasis>."
Converting such an assertion into RDF is at present relatively trivial (adopting Turtle
syntax, the most readable RDF serialization method):</para>
<programlisting>@prefix db: <http://dbpedia.org/resource/> .
@prefix cito: <http://purl.org/spar/cito/> .
db:Republic_(Plato) cito:cites db:Iliad</programlisting>
<para>Or, to take an example from modern literature, the statement "Walter Burkert, in
<emphasis>Lore and Science in Ancient Pythagoreanism</emphasis>, quotes from Plato's
<emphasis>Laws</emphasis>" can be reduced to a RDF triple as follows:</para>
<programlisting>@prefix db: <http://dbpedia.org/resource/> .
@prefix cito: <http://purl.org/spar/cito/> .
@prefix wc: <http://www.worldcat.org/oclc/> .
wc:860129739 cito:includesQuotationFrom db:Laws_(dialogue)</programlisting>
<para>This could apply as well to journal articles. The following claims that Sadrine
Zufferey and Bruno Cartoni, "A Multifactorial Analysis of Explicitation in Translation,"
<emphasis>Target: International Journal of Translation Studies</emphasis> 26.3
(2014): 361-384 cites Kinga Klaudy and Krisztina Károly, "Implicitation in Translation:
Empirical Evidence for Operational Asymmetry in Translation," <emphasis role="italic"
>Across Languages and Cultures</emphasis> 6.1 (2005): 13-28:
<programlisting>@prefix doi: <https://doi.org/> .
@prefix cito: <http://purl.org/spar/cito/> .
doi:10.1075/target.26.3.02zuf cito:cites doi:10.1556/Acr.6.2005.1.2</programlisting></para>
<para>RDF triples that refer to complete works are rather straightforward. But such
assertions, even if true, are too general to be useful. The citations and quotations by
Plato, Burkert, and Zufferey + Cartoni are made <emphasis>at</emphasis> specific pages,
<emphasis>of</emphasis> specific passages. Those who desire greater precision would
find more helpful sets of RDF triples that specify exactly where one item quotes the
other, i.e., at page A line B book X quotes from section C subsection D of work Y. </para>
<para>In making a triple more precise, the subject and object would remain the same, but
narrowed in scope to specific parts. The endeavor is analogous to current conventions
that allow one to point to a specific section of an individual image, video, audio file,
web page, XML file, and so forth. Many of these file-specific conventions depend upon
URI fragment identifiers that reference content within the resource. The present
specification for Writing Fragids proposes an analogous construction, for URIs that
point, not to digital resources, but to documents, publications, literature, and other
types of writing that lend themselves to referencing in general. Just as a URI can point
to a non-digital entity, so to can a WF.</para>
<para>A URI with a Writing Fragid enables RDF triples that point with greater precision to
written literature in any language, from any period of time. WF syntax does not depend
upon any particular digital service or digital file, so WF URIs can be made and used,
regardless of how many corresponding files exist, if any. The WF syntax allows a person
or algorithm to make a claim that points unambiguously to a particular location in a
work or item. That assertion, whose meaning is independent of any digital resource, can
then be used to extract relevant sections from any matching WF-conformant files, and
perhaps to make other inferences not envisioned by the creator of the original WF URI or
RDF triple. </para>
<para>Although the preceding use cases have focused on quotation, a WF URI may be used in
RDF triples for a variety of reasons, for example, to make assertions about textual
variation, dates, authorship, or even grammatical properties. Consider the statement,
"In Shakespeare's <emphasis>Henry VI</emphasis>, part 2, 1.4.32, 'Henry' is the
grammatical object."<note>
<para>The statement is reductive and not obviously true; the context is, "The duke
yet lives that Henry shall depose," a cunning amphiboly with two very different
intepretations based upon whether one interprets "Henry" as the grammatical
subject or object of "depose".</para>
</note> A WF-based RDF triple encoding this assertion might be built as follows (with a
placeholder for the Writing Fragid
component):<programlisting>@prefix db: <http://dbpedia.org/resource/> .
@prefix la: <http://example.org/linguistic-annotation> .
@prefix olia: <http://purl.org/olia/olia.owl#> .
db:Henry_VI,_Part_2#[WRITING_FRAGID] la:hasFeature olia:DirectObject</programlisting></para>
<para>These guidelines stipulate syntactic rules for constructing a WF such that the
modified URI remains persistent and unique, and therefore unambiguous. </para>
<para>Persistence and uniqueness are foundational requirements for any URI. Something may be
assigned many URIs, perhaps by even the same person or organization, but no person
should wittingly assign one URI to two or more different resources (or classes of
resources), or to create a URI that could be interpreted in mutually exclusive ways. The
same should hold for URI fragids. </para>
<para>The WF syntax has been designed to avoid ambiguity, at the expense of some brevity and
legibility. Anyone who learns the syntax can write and read a WF URI, but its full
meaning may not be apparent <emphasis>prima facie</emphasis>.</para>
<para>Writing Fragids have been designed to model citation practices, which have challenging
features (discussed below). Writers frequently refer to other writings in a terse syntax
that permits multiple interpretations. They take for granted that the reader comes with
a shared context to understand and resolve a citation. Such tacit knowledge has been
reliable in writing for human understanding. But URIs, which are required for computer
"understanding", must be based upon explicit and unambiguous conventions. Because WF
URIs are to be persistent and unique, traditional citation practices that are at present
not easily convertible are excluded from this version of the WF specifications. </para>
</section>
<section xml:id="challenges">
<title>Challenges</title>
<important>
<para>This section is descriptive.</para>
</important>
<para>A Writing Fragid is sequence of characters (letters, numbers, punctuation) that models
the core features of traditional references. Reference systems are essential and
ubiquitous, but poorly studied and understood. An overarching theory or framework of
citation practices is elusive. As of this writing, there does not even exist a relevant
Wikipedia entry on the subject. Many reference systems feature a number of
irregularities that challenges any effort to create a WF URI that is persistent, unique,
and interoperable.</para>
<para>Many texts have more than one reference system. For example, if you are asked to look
up Aristotle, <emphasis>Categories</emphasis> 4, do you go to chapter four, or to Bekker
page number 4? It is impossible to tell, since that particular text has two major
reference systems, and the reference is valid in both systems. If you encounter a
reference to the Bible verse Joshua 9:2a, it is clear that one means to point to chapter
nine, verse two, but does the "a" point to the first part of verse two (and how many
parts are there?), or does it mean verse 9a, i.e., an extra verse between Joshua 9:2 and
9:3? All three options are possible, because Joshua 9:2 can be divided however one
likes, and in the best critical editions of the Septuagint version of the Old Testament,
extra verses appear between Joshua 9:2 and 9:3 and are given by their editors subletters
(Joshua 9:2a, 9:2b, etc.). </para>
<para>Most human readers familiar with the conventions for a given work can quickly and
easily interpret terse, vague, and ambiguous syntax, but computers cannot. Any algorithm
that currently negotiates such differences has been extensively programmed, usually on
the basis of a small set of controlled vocabulary and a limited set of assumptions. No
algorithm has been written to disambiguate any reference to any body of
literature.</para>
<para>One may argue that one should adopt the reference system most people agree upon for a
given work. Although this proposal, based on the notion of a so-called canonical
reference system, is well motivated, it is inadequate, even for very common examples. As
noted above with Aristotle's <emphasis>Categories</emphasis>, many written works have
two or more independent reference systems (e.g., in a book, chapter/section numbers
versus page/line numbers, e.g., Plato's corpus). Texts with an unambiguous canonical
reference systems are rare. Even within the very text collection that gave us the notion
of "canon", the Bible, there are standard reference systems that conflict with each
other. For example, the reference Psalm 30.1 points to three different passages,
depending upon whether the reference is interpreted as being reliant on English, Hebrew,
or Greek Septuagint versification:</para>
<para>
<table frame="all">
<title>Psalm 30.1</title>
<tgroup cols="3">
<colspec colname="c1" colnum="1" colwidth="1*"/>
<colspec colname="c2" colnum="2" colwidth="1*"/>
<colspec colname="c3" colnum="3" colwidth="1*"/>
<thead>
<row>
<entry>English (KJV)</entry>
<entry>Hebrew (JPS)</entry>
<entry>Septuagint (NETS)</entry>
</row>
</thead>
<tbody>
<row>
<entry>I will extol thee, O LORD; for thou hast lifted me up, and hast
not made my foes to rejoice over me.</entry>
<entry>A psalm of David. A song for the dedication of the House.</entry>
<entry>Regarding completion. A Psalm. Pertaining to Dauid.</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
<para>In this case, Psalm 30.1 English = Psalm 30.2 Hebrew = Psalm 29.2 Septuagint. Roughly
10% of the verses in the Tanakh/Old Testament are subject to such reference variants in
the three major versification schemes used today. </para>
<para>There are many, many other works with comparable discordant, ambiguous, or vague
reference systems. For example, many classical works are assigned specific references
corresponding to the page, column, and line numbers of the first critical edition (e.g.,
Bekker numbers for Aristotle, Stephanus numbers for Plato). But without exception, every
20th- and 21st-century edition and translation of texts that adopts these
edition-specific systems must adapt and adjust those systems. A particular word might
fall before, after, or across, a particular line number.</para>
<para>Another type of challenge pertains to the labels used in reference systems. Many
reference systems rely on numerals, or ordered letters that enumerate textual units. But
many other labels are not enumerated, and reference conventions differ widely. To draw
again from Biblical literature, reference terms for books are not standardized. The book
commonly called 1 Samuel has the name 1 Kingdoms in the Septuagint, and 1 Kings is 3
Kingdoms. When one refers to the book Ezra, any number of books might be meant, across a
range of modern national or religious traditions. Some references are given in full,
others in abbreviated form. The latter increases the chance of confusion. For example,
"Jo" could mean the Biblical books John, Job, Joel, or Jonah. When expanding to other
languages, the number of potential problems grows exponentially. </para>
<para>Many generic textual units normally do not admit number labels but are referred to by
their type. For example, one may reference a title through "title", "ti", "t" or the
like, presenting choices that could lead to confusion. Does "ep" refer to an epilogue or
an epistle; does "no" mean note or number? Some text components are difficult to name
and cite (e.g., parts of a liturgical text). As a result, labels for, and references to,
special types of textual divisions are created ad hoc, according to custom, but not any
controlled vocabulary.</para>
<para>Even enumerated sequences are not straightforward. As noted above, with Joshua 9:2a,
some references depend upon compound numbers. Different types of numeration systems
exist, e.g., Roman numerals and letter labels. Although these can invariably be
converted into Arabic numerals, individual cases may not be clear without an explicit
context. For example a reference to "cc" could mean 200 if pointing to a text that uses
Roman numerals, and either 29 or a different number if using letter labels (some letter
label systems assume incomplete alphabets, or have different rules for enumerating
beyond the twenty-six items of the English form of the Latin alphabet).</para>
<para>In short, reference systems present numerous problems. A variety of conventions have
developed independently over centuries, across languages and disciplines. We divide,
label, and refer to each others writings in a bewildering variety of ways. We often have
a difficult time negotiating such differences; computers do not fare better.</para>
</section>
<section>
<title>Concepts and Definitions</title>
<important>
<para>This section is normative.</para>
</important>
<para>The definitions below are normative only within the scope of WF URIs. One should not
infer from the terms or their definitions that they stipulate the best way to think
about written texts.<warning>
<para>This section is overly long, and I admit steps should be dropped. I've erred
on the side of excess for this first draft, because previous attempts to deal
with this task have been, in my opinion, poorly theorized. I think the downfall
of FRBR and OHCO is that their definitions assumed of the reader too much. They
were not been approached incrementally, from the simplest ideas. Hence my
excessive approach.</para>
<para>Much of the discussion is technical and difficult to follow. Because this is
ultimately a technical document, and needs to be rigorous, I would personally
prefer to keep the rigor and provide a second "Gentle Guide to Making a WF URI"
in a casual tone suited to teaching.</para>
<para>As I re-read my draft, I blame myself for overthinking the matter. But I
haven't deleted a lot, so as to leave at least a record of some early thoughts,
and to catalyze a wide-ranging discussion among those who can do better.</para>
</warning></para>
<section>
<title>Writing, Works, Scripta, and Shared Limits</title>
<section>
<title>Writing</title>
<para><termdef><firstterm>Writing</firstterm> is defined as anything a human creates
through visible marks, i.e., characters and symbols, to communicate ideas
and meaning to later readers.</termdef></para>
<para><termdef>An <firstterm>item of writing</firstterm> is an instance of
writing.</termdef> Every item of writing must be referred to
<emphasis>qua</emphasis> conceptual/immaterial entity, or
<emphasis>qua</emphasis> instantiated/material entity. For example, one can
talk about the item of writing called the <emphasis>Iliad</emphasis> without
reference to any particular edition, or one can talk about a specific printed
edition of the Homeric poems. Items of writing discussed in the first mode are
termed <emphasis>works</emphasis>; in the second, <emphasis>scripta</emphasis>
(singular: <emphasis>scriptum</emphasis>). The distinction is to be made even
when it is not so evident (see below).</para>
</section>
<section>
<title>Works</title>
<para><termdef>A <firstterm>work</firstterm> is a conceptual, nonmaterial item of
writing. Every work is a notional entity with shared limits, whose content
admits division (perhaps into other works), and whose content cannot be
equated or conflated with any individual material entity.</termdef> Homer's
<emphasis>Iliad</emphasis> and Shakespeare's <emphasis>Henry VI</emphasis>
are examples of conceptual works: they are items of literature that can be
meaningfully discussed independent of any of the many individual books or
manuscripts that carry editions, translations, or other versions of these works. </para>
<para>Works frequently encompass other works (e.g., the Catalog of Ships or
Gloucester's Soliloquy), or they may be constituent parts of larger works or
collections, which are also treated here as works (e.g., the Homeric Cycle,
Shakespeare's minor tetralogy). Work-to-work relationships may exhibit multiple
levels of nesting, repetition, or interconnection. For example, the work
<emphasis>The New Testament</emphasis> encompasses the work <emphasis>The
Gospel of Matthew</emphasis>, which includes the work <emphasis>The Sermon
on the Mount</emphasis>, which includes the work <emphasis>The Lord's
Prayer</emphasis>, a work that is itself a constituent part of many other
works, sometimes repeatedly so (e.g., a Christian liturgical service). </para>
<para>Any single work may have derivative versions (e.g., translations, paraphrases,
parodies). Such work-versions are to be treated as works in their own right.
<termdef>To every work-version may be attached one or more
<firstterm>version paths</firstterm>, defined as the chronologically
ordered sequence of work-versions upon which a particular work-version
depends.</termdef> A version path may be unclear, or disputed, e.g., texts
preserved in medieval manuscripts hundreds of years removed from any putative
original, with perhaps questions of contamination. Version paths of more recent
works might be quite clear. For example, a modern critical edition (a
work-version in its own right) collectively extends the length of the version
paths of its manuscripts by one, and a translation of that critical edition's
work-version into a modern language has a version path yet one step
longer.</para>
<para>If a work-version is the subject of a version path, the name of that
work-version may be applied not merely to the original work-version, but to all
work-versions along each of its resultant version paths. The name "Lao Tzu's
<emphasis>Art of War</emphasis>" applies to all its versions. Hence, the
name of a work-version should be treated as labeling not a single item but a
class of items. That class may be restricted to subclasses. Any such restriction
may be made on the basis of a version path or not. "Lionel Giles' 1910
translation of Lao Tzu's <emphasis>Art of War</emphasis>" uses the a version
path to restrict the original class. "Twentieth-century versions of Lao Tzu's
<emphasis>Art of War</emphasis> published in the United States" uses
criteria not reflecting any particular version path. </para>
<para>The name of a work-version is inheritable by its versions (a transitive,
asymmetric relation). West's critical edition of the Greek text of the
<emphasis>Iliad</emphasis> is a particular version, but it can be
legitimatly given other more general names: (1) West's edition of the
<emphasis>Iliad</emphasis> in Greek; (2) the Greek version of the
<emphasis>Iliad</emphasis>; or simply (3) the <emphasis>Iliad</emphasis>.
The last of these names can also be applied to an English translation of West's
Greek edition, which could also be properly termed (4) West's version of the
<emphasis>Iliad</emphasis> (the language is not specified); (5) an English
translation of the <emphasis>Iliad</emphasis>, or (6) an English translation of
West's edition of the <emphasis>Iliad</emphasis> in Greek. Other names,
following certain permutations, could be used. Although the examples above
illustrate the point with human-readable names, it applies as well to
machine-readable ones, such URIs. <warning>
<para>The previous two paragraphs exhibit some overthinking on my part, but
they also reflect a real problem I discuss elsewhere, about how to point
to a work-version in particular, a requisite for scripta with multiple
versions of the same work. We have some good URI vocabulary for famous
works, and we have some good URI vocabulary for items of literature
where the work-scriptum divide is not so strong (modern scholarly
articles). But how do we find vocabularies of work-versions that aren't
famous and are clearly down the version path? What about work-versions
that are based on arbitrary categories? The previous two paragraphs note
that work-versions can be built in various combinations, either through
the genetic relationships that make up stemmatology or seemingly
arbitrary properties.</para>
<para>The problem is somewhat philosophical. A work is ultimately a class of
abstract, conceptual objects. Any restriction of that class, into a
work-version, requires the stipulation of acceptable values for one or
more properties. Such properties can come in a bewildering variety of
types. The point seems abstract now, but it has real importance when it
comes to qualifying a particular work-based URI. When pointing to
particular version of a work in a book (say, facing text and
translation) how does on restrict the scope of a work URI to that
particular version?</para>
</warning></para>
</section>
<section>
<title>Scripta</title>
<para><termdef>A <firstterm>scriptum</firstterm> is a physical item of writing. It
is a text-bearing material object, or a set of (largely) indistinguishable
copies of text-bearing objects. Every scriptum is a physical entity with
shared limits, whose content admits division (perhaps into other scripta),
and whose content cannot be equated or conflated with any single
work.</termdef> For the purposes of WF URIs, digital files are to be
regarded as material objects, and therefore scripta. Manuscript Venetus A,
Caroline Alexander's translation of the <emphasis>Iliad</emphasis> (ISBN
9780062046277), the 1594 first quarto edition of <emphasis>Henry VI Part
2</emphasis>, the Bate and Rasmussen edition of <emphasis>Henry
VI</emphasis> (ISBN 9780812969405), and a pdf scan of that edition, are all
examples of scripta. </para>
<para>A digital scan or digital photograph of a material object <emphasis>must
not</emphasis> be treated as the same scriptum as what it represents. It is
a digital surrogate, and constitutes a distinct scriptum.</para>
<para>A scriptum may be a constituent part of another scriptum (e.g., a volume in a
multivolume publication) or it may comprise multiple scripta (e.g., a volume
that consists of multiple fascicles). Sometimes a single scriptum may be
legitimately treated as overlapping or intersecting another (e.g., a manuscript
that was broken apart long ago, the parts joined to other manuscripts). It may
be possible for a single artefact to be treated wholly as two different scripta.
For example, a hand-annotated print copy of a book may be considered an
individual manuscript (a scriptum with set membership of one) or as part of a
print run (a scriptum with set membersip of more than one).</para>
</section>
<section>
<title>Shared Limits</title>
<para><termdef>The the term <firstterm>shared limits</firstterm>, used of works and
scripta, refers to a consensus held by a community of practice.</termdef>
<termdef>A <firstterm>shared work limit</firstterm> is a consensus by a
community of practice on what the limits are to a particular work.</termdef>
<termdef>A <firstterm>shared scriptum limit</firstterm> is a consensus on the
limits for a particular scriptum.</termdef> In many cases these distinctions
are not significant, because most such limits are shared not simply by a
community of practice, but by everyone. </para>
<para>In the case of the shared limits of scripta, WF does not define
"indistinguishable," the criterion used to determine whether any two material
objects are the same scriptum. If the limits of a scriptum are unclear, it
should be avoided as the basis of a WF URI.</para>
<para>In the case of some works, especially culturally influential works, one
community's shared limits will differ from those of another (e.g., the limits to
the meaning of "Bible"). The question of resolving ambiguous or contested shared
limits is discussed below.</para>
</section>
</section>
<section>
<title>The Relationship between Works and Scripta</title>
<para>The description of works and scripta above resembles the account of Type 1
Entities defined by the <link
xlink:href="https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records"
>Functional Requirements for Bibliographic Records (FRBR)</link>. But whereas
FRBR Type 1 Entities distiguish between Works and Expressions, the WF model treats
all conceptual entities (i.e., both Works and Expressions) as a single category. Any
rendition or version of a conceptual, non-material work is itself a conceptual,
non-material work, even if it closely depends upon another. The FRBR term
<emphasis>Expression</emphasis>, said to refer to a nonmaterial, conceptual
entity, is avoided because no one can express a work without some material medium
through which to express it. Similarly, whereas FRBR distinguishes between
Manifestations and Items, the WF model combines both, applying
<emphasis>scripta</emphasis> to all material text-bearing objects. If scripta
are mechanically reproduced and are largely indistinguishable from one another, they
are treated as the same scriptum. Thus, WF <emphasis>scriptum</emphasis> most
resembles FRBR <emphasis>manifestation</emphasis>, which is applied to sets of
material items.</para>
<para>Any single work may be realized in one or more scripta. A work may be represented
in a scriptum completely, partially, or repeatedly. Likewise, any given scriptum
will contain at least one works. Each work is represented completely, partially, or
repeatedly. Many types of scripta have multiple works (e.g., an anthology; a
compilation; or a bilingual edition, which has two independent work-versions). The
work or works on a scriptum may not be identifiable (e.g., a papyrus fragment with
unattested text). Some works do not survive in any known scripta (e.g., Aristotle,
<emphasis>Poetics</emphasis> book 2, on comedy; the letters written by the
Corinthians to Paul the Apostle; George Orwell's <emphasis>Socialism and
War</emphasis>). </para>
<para>In many cases, particularly in ancient and medieval literature, the distinction
between works and scripta will be readily apparent. In other cases, the distinction
is not always evident. For example, many scholarly articles are assigned a Digital
Object Identifier (DOI) URI, which may be assigned by the publisher to the item
<emphasis>qua</emphasis> work, <emphasis>qua</emphasis> scriptum, or both.
Others may use that same DOI URI to refer to the article as a work, as a scriptum,
or as both. Anyone who creates a WF URI must therefore specify whether a given URI
refers to a writing <emphasis>qua</emphasis> work or <emphasis>qua</emphasis>
scriptum.</para>
</section>
<section>
<title>Scripta: Readers, Regions, and Text Structures</title>
<para>Given the framework adopted above, several more key concepts follow, centered
around those who use scripta. WF URIs have been designed under the assumption that
anyone who independently approaches a scriptum endeavors to understand the shared
features of that scriptum. <termdef>The <firstterm>shared features</firstterm> of a
scriptum include its work-version regions (defined below) and their
parts.</termdef> No WF URI may be applied to a scriptum that does not lend
itself to such shared features.</para>
<para><termdef>The term <firstterm>scriptum reader</firstterm> is defined as a person
who has the aptitude to understand the text in a given scriptum, either because
that person is part of the audience intended by the creator of the scriptum, or
because they have acquired the background needed to have that
facility.</termdef> A scriptum reader can read at least one of the written
languages in the scriptum, and has the requisite background to understand what the
scriptum is about, and to discern the scriptum's most significant shared features.<warning>
<para>From this point, the argument stresses the competence of what is termed
the <emphasis>scriptum reader</emphasis>. It pivots away from defining
textual objects <emphasis>an sich</emphasis> to defining the perceiver of
the textual object. Is this a legitimate approach? If so, what are the
implications (epistemological esp.)? If not, what are the alternatives? It
seems you want as many people involved in making WF URIs, but you don't want
dilettantes and fools. But perhaps even the most competent of us might think
the other a dilettante and fool in some respect?</para>
</warning></para>
<para><termdef>A <firstterm>work-version region</firstterm> is defined as a set of one
or more places on a scriptum that collectively delimit all the text that is part
of a single version of a single work.</termdef> For example, a bilingual novel
lends itself to several different work-version regions: the original text, the
facing translation, the prefatory introduction, an excursus in the appendix, etc. A
work-version region need not be contiguous. For example, a scriptum may feature
scattered quotes from another text. Work-version regions, like work-versions
themselves, may nest, overlap, or tesselate.</para>
<para>Most work-version regions are designed to help scriptum readers distinguish
clearly between <termdef>the <firstterm>main text</firstterm>, the text that is
squarely part of the work itself and within its shared limits</termdef>, and
<termdef>the <firstterm>paratext</firstterm>, features intended to divide,
structure, and order the main text.</termdef>
</para>
<para>Most main texts can be distinguished by scriptum readers into a <termdef>single
<firstterm>trunk text</firstterm>, a text that runs from start to finish
(logical, perhaps also physical),</termdef> and zero or more
<termdef><firstterm>branch texts</firstterm>, texts that are anchored to a
point or range in the trunk text, and that relate to the trunk text at or near
the anchored region</termdef>. Examples of branch texts include footnotes,
endnotes, glosses, and marginalia.</para>
<para>Paratext includes not only visible components that can be isolated from the main
text, such as symbols and spaces (margins, indentations, leading, etc.), but also
recognizable attributes of the main text (color, font, size, weight, etc.) that a
scriptum reader would recognize as organizing the main text. It may the case that
paratext is interspersed within the main text (e.g., vertical bars to signify line
breaks, inline numerals to mark the beginning of a textual unit). Any visible
features associated with the work that are unclear in function (e.g., marginal
doodles) are to be treated as paratext. </para>
<para><termdef>A <firstterm>viable work-version region</firstterm> is a work-version
region whose main text, paratext, trunk text, and branch texts are clear to
scriptum readers.</termdef></para>
<para>Viable work-version regions can be named, whether by traditional nomenclature or
URI. For example, in the bilingual edition example above, the viable work-version
region for the version in the original language can be named with URIs reserved for
the main work. It can also be given URIs specific to the particular work-version
(i.e., X's edition of the original text). The same can be said of the facing
translation: it can be named after the general work itself, and after the specific work-version.<warning>
<para>Note the discussion above about the difficulty of restricting a work to a
work-version. I see different ways to approach the task of building a URI
for a work-version:<orderedlist>
<listitem>
<para><emphasis role="bold">Absolute base URI</emphasis>. Each
work-version must take only absolute URIs. The relationship
between work-versions should not be built into the URI
architecture itself. If anyone wishes to state the relationship
held by any two work-versions, that would need to happen
separately, in RDF triples.</para>
<para>Advantages: everything is rooted in URIs, without a second
level of fragids to memorize. </para>
<para>Disadvantages: there hardly exist any such absolute, specific
work-version URIs. No one is entitled to change a URI in a
namespace they don't own (absent URI fragments), so they would
have to be coined by individual users in their own namespaces.
Wouldn't this result in numerous obscure vocabularies?
(Elsewhere I've admonished creators of work URIs to provide a
mechanism whereby they can mint sub URIs for specific
work-versions, with the hope that maybe Wikipedia could find a
way, perhaps with #-identifiable stubs. Would these be okay?)
</para>
</listitem>
<listitem>
<para><emphasis role="bold">Work fragids</emphasis>. The idea here
is to define, separately, another fragid structure exclusively
for work URIs. One can then insert parameters that restrict the
class along any number of criteria. For example,
<code><http://dbpedia.org/resource/Iliad#<emphasis
role="bold">$wf0:lang=grc$</emphasis>></code> would
declare restriction in scope to only works of the Iliad in the
Greek language.</para>
<para>Advantages: it's modular, and can allow for a variety of
properties to restrict classes. It allows for a common set of
work URIs that don't need to be guessed at.</para>
<para>Disadvantages: permutations would abound. ISO 639-3 language
codes are easy, but what about defining authorship or date or
place of creation? One would expect, given a new fragid
structure, that the parameters would need to refer to URIs and
controlled vocabulary, not nicknames. It could get c-r-a-z-y in
length and legibility. WFs would have to specify what classes
are permitted. But then how does one know whether a particular
media file that matches the base work URI also matches the
subclass?</para>
</listitem>
<listitem>
<para><emphasis role="bold">Adapted URIs</emphasis>. This would be
an attempt to thread the needle between the previous two
options. A user would embed the best known Ur-work URI within a
tag URI in the user's namespace, then add in specific WF-defined
steps to restrict membership, e.g.,
<code><tag:example.com,2014:http://dbpedia.org/resource/Iliad/<emphasis
role="bold">2002/grc/west</emphasis>></code> would
identify West's specific version, but
<code><tag:example.com,2014:http://dbpedia.org/resource/Iliad/<emphasis
role="bold">*/grc/*</emphasis>></code> would point to
any Greek edition.</para>
<para>Advantages: the tag uri attaches provenance to the new URI for
blame/credit. The extra paths lend themselves to shorthand
nomenclature that could attract matches based on shared
practices. Only a few of the most common examples would be
permitted, to avoid chaos.</para>
<para>Disadvantages: The URI is no longer opaque: processors would
have to drop the authority part of the tag URI before trying to
make matches. What do you do with fuzzy dates? Or ones where
lots of creators would be inserted in the third step. Everyone
will complain: you don't support X as a criteria. Even supported
work-versions could result in wild and crazy unpredictable
constructions. It is also built upon not on the work-version
path, but arbitrary criteria meant to imitate the work-version
path.</para>
</listitem>
</orderedlist>Perhaps there's another way forward? We need some sort of
convenient way to say something like "West's version of the Iliad"
(regardless if it's the original Greek or a modern translation) or "all
versions of the Iliad published in the 1950s." It doesn't mean WF needs to
support all possible permutations, but it should have some kind of
controlled template that would permit growth.</para>
</warning></para>
<para><termdef>A <firstterm>work-scriptum map</firstterm> is defined as scriptum
allocated into only those viable work-version regions that admit locally unique
URIs.</termdef> The work-scriptum map may be described procedurally. Apply
work-version regions to a scriptum. Remove any that are not viable. Apply to each
remaining region all possible URIs that identify its work-version or any inherited
work-versions (see above on work-version inheritance). Find all regions with
duplicate URIs. If one region contains a work-version clearly older than the rest
(it has the shortest version path to a putative original), let it uniquely retain
that URI, and discard all other instances of the duplicate URI. Discard any regions
that do not have any URIs.</para>
<para>For example, a bilingual French-Greek edition of Aristotle's
<emphasis>Categories</emphasis> will admit at least two work-version regions.
Because the French is a translation of the facing Greek, and depends on it, only the
Greek version retains the URI for the work in general, as well as the URI for that
particular version. The French version retains only the URI for the French
version.</para>
<para>The work-scriptum map and its work-version regions are important, because they
allow one to unambiguously identify select parts of a scriptum based on works, and
assign priority to multiple versions of the same work. If a particular work-version
features two or more times in a scriptum, and one of them does not clearly have
priority (the shortest version path), the general work URI will
<emphasis>never</emphasis> be used, only the URIs that are distinct to unique
work-versions.</para>
</section>
<section>
<title>The Trunk Text: Divisions, Units, Sequences, and Trees</title>
<para>In this section "the text," when not specified, refers to the main, trunk text of
a work-version region in a work scriptum map. It excludes the paratext (e.g., page
numbers) or branch texts (e.g., notes). </para>
<para>Definitions pertaining to divisions and units focus on scripta and exclude
work-versions per se, because two people can discuss a text's divisions and units
only on the basis of real-world examples. Works are abstract entities that may be
conceptualized differently (one person's mental work divisions cannot be consulted
by another), so cannot define text divisions and units. One needs scripta.</para>
<para><termdef>A <firstterm>textual division</firstterm> marks a break in the
text.</termdef> Examples of textual divisions in scripta include spaces
(indentations, margins, word spaces), labels, hairlines, and changes in text
rendering (color, font, weight, size). Textual divisions are to be treated as
anchors within a stream of text, and always intervening between text items
(characters).</para>
<para><termdef>A <firstterm>textual division group</firstterm> is a series of textual
divisions that divide a text, or a designated portion of a text, by the same
units with a thematically related set of labels.</termdef></para>
<para><termdef>Applying one textual division group to a text, or to a designated portion
of a text, results in a sequence of one or more <firstterm>textual
units</firstterm>.</termdef> Textual units are normally named in terms of
division typology. Examples of textual units include books, pages, columns, lines,
books, chapters, parts, stanzas, periods, indentations, and word spaces. </para>
<para>Any textual unit might itself be subject to division by another textual division
group and result in a subordinate textual unit sequence. A page, for example, might
be divided into lines. A textual unit sequence might be composed of textual units of
different types. For example, a chapter might include a mixture of paragraphs and
run-in headers. </para>
<para>Most textual unit sequences will be ordered according to a single directed stream
(a main text consisting of a trunk text and perhaps branch texts). This merely
reflects the nature of the divided text. Some textual unit sequences may take
multiple streams, or be undirected or semi-directed (e.g., concrete poetry, word
clouds). Regardless of how a text is ordered or directed, its textual unit sequence
will inherit that order and direction.</para>
<para>Although rooted in scripta, a given textual division or textual unit might be
defined by an ideal, not a material entity, and therefore be grounded notionally in
works, not scripta. All textual divisions are based either on the
<emphasis>physical, material features of a scriptum</emphasis>, or upon
<emphasis>logical, conceptual contours of a work</emphasis>. Examples of
material textual units include pages, lines, columns, subcolumns, codexes,
fascicles, and folios. Examples of logical textual units include abstracts, acts,
appendixes, chapters, suras, paragraphs, couplets, stanzas, sentences, notes,
quotations, verses, phrases, words, and letters. </para>
<para>The number of logical unit types are significantly greater than material unit
types. They are also generally not as well defined, because they label conceptual
entities, not material ones. </para>
<para>Some textual unit names can be interpreted as being either material or logical,
e.g., books, volumes, parts. Context determines whether such a term refers to a
material or a logical textual unit, oftentimes indicated by the next unambiguous
subdivision unit type. If a book is divided into chapters, the book unit is logical;
if it is divided into pages, it is material.</para>
<para><termdef>A <firstterm>textual unit tree</firstterm> is a text divided into a
single sequence of textual units (its primary sequence), any unit of which may
itself be divided into a subsequence of textual units, and so forth.</termdef>
The tree metaphor has further application. Any given textual unit either yields
sequences of smaller ones, and therefore can be called a branch, or it terminates,
and is so called a leaf. </para>
<para>Every viable work-version region in a scriptum can be organized into zero or
more textual unit trees. For example, a book (the entire scriptum being taken as the
viable work-version region) might have a textual unit tree based on pages, columns,
and lines. That same book may have another textual unit tree of parts, chapters,
subchapters, paragraphs, and sentences. That same book may include yet other textual
unit trees (e.g., a second level of pagination, corresponding to a previous
edition). In any given textual unit tree, no textual unit is to be divided into more
than one sequence. If a textual unit permits an alternative, second sequence, it
must be expressed in a separate tree.</para>
<para>A textual unit tree can be of any depth. No tree is required to have a depth
greater than one. Consequently, every textual unit sequence is also a textual unit tree.<note>
<para>The WF model differs from the common ideal of a text as an ordered
hierarchy of content objects (OHCO), coined by Renear et al. 1990. The WF
model does not insist that text is ordered (the O in OHCO), but this version
of the WF syntax supports only ordered, directed text. (Unordered,
undirected text may be supported in a future version of WF.) Hierarchy (the
H in OHCO) is reframed in terms of textual unit sequences, which can nest
(but are not required to do so) to produce hierarchies. The terms "content"
and "objects" (the CO in OHCO) are regarded as redundant. An ordered textual
hierarchy cannot have non-content or non-objects.</para>
</note></para>
<para>Every textual unit sequence is classified as being either
<emphasis>material</emphasis> or <emphasis>logical</emphasis> depending upon the
criteria for the division group. A textual unit tree may therefore also be termed
material or logical, based on the classification of its primary (initial) textual
unit sequence. For example, a page-column-line textual unit tree is material,
whereas a part-chapter-subchapter-paragraph-sentence tree is logical. </para>
<para>A textual unit tree may be either native or adopted. <termdef>A
<firstterm>native textual unit tree</firstterm> is one that has been created
for the particular scriptum.</termdef>
<termdef>An <firstterm>adopted textual unit tree</firstterm> is one that has been
applied to a scriptum from a previous one.</termdef>
</para>
<para>Some viable work-version regions may allow many concurrent textual unit trees,
both material and logical, native and adopted. For example, a modern translation of
Aristotle's <emphasis>Categories</emphasis> may have (1) a unique sequence of
paragraphs (a tree of native logical textual units); (2) page and line breaks
(native material units); (3) sectioning found in many other editions and
translations (adopted logical units); and (4) line numbers drawn from Bekker's
19th-century edition (adopted material units).</para>
</section>
<section>
<title>Labels; Reference Units, Sequences, and Systems</title>
<para>In a textual unit sequence each unit may be labeled or not. <termdef>A textual
unit is <firstterm>labeled</firstterm> if the unit is accompanied by numerals,
letters, abbreviations, or symbols in the paratext that names or identifies the
unit relative to the others in the same sequence.</termdef> Such labels provide
the explicit basis for reference systems. </para>
<para>A label does not necessarily identify a clearly demarcated textual unit. Some
labels identify textual units that are not clearly divided, or not divided at all.
For example, a modern translation of a text by Plato may have occasional Stephanus
numbers (the page numbers of a 19th-century edition of the Greek) in the margin, but
the paratext might also lack any clear textual divisions to specify where one
textual unit ends and the other begins. Such vague labels tend to come with trees
based on adopted material units. From this point forward, the term
<emphasis>label</emphasis> excludes these types of vague labels, and means only
those labels that are attached to textual units with clear divisions.</para>
<para>Any textual unit sequence is, on the whole, a labeled sequence or an unlabeled
one. <termdef>A <firstterm>labeled textual unit sequence</firstterm> is one where
the majority of textual units are labeled.</termdef>
<termdef>All others are termed <firstterm>unlabeled textual unit
sequences</firstterm>.</termdef></para>
<para>Within a given labeled textual unit sequence every textual unit label is unique or
non-unique. <termdef>Any label attached to only one textual unit in the sequence is
a <firstterm>unique label</firstterm>. Any label attached to two or more textual
units (i.e., two or more units with the same label) must be treated as unique if
the label, according to shared convention, marks the textual units as component
parts of a single range of text.</termdef>
<termdef>All others are <firstterm>non-unique labels</firstterm>.</termdef> For
example, in the label sequence 1, 2a, 5, 2b, rubric, 3, [unlabeled], 4, rubric, 6,
7... the label "rubric" is non-unique, and all others are unique. Scriptum readers
can discern from the pattern that the labels "2a" and "2b" form a pair that
designate the first and second half of a single range of text, so are treated as
jointly forming the unique label "2." <termdef>Any textual units joined by a single
label, whether the units are contiguous or not in the sequence, are called
<firstterm>reference units</firstterm>.</termdef> Thus, the two textual
units marked by 2a and 2b in the previous example are the two parts of a single