forked from giellalt/lang-est-x-utee
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathund.timestamp
4458 lines (3401 loc) · 155 KB
/
und.timestamp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
$Rev: 190997 $
The final mmove in the old svn infra: change the am-shared reference to point to
giella-core parallel to the language dir. After this we can remove am-shared
from each language.
r190997:
Fix mobile speller filename bug.
r190907:
Fix speller generation bug.
r190900:
Fix speller analyser reference after the flattening of the
tools/spellcheckers/ dir.
r190891:
Final step in flattening the tools/spellcheckers/ dir tree: removing the whole
fstbased/ dir, with all subdirs. Finally!
r190878:
Fix automakefile error: no final backslash followed by an empty line.
r190866:
Step eight in flattening the tools/spellcheckers/ dir tree: flipping the switch.
All pieces are in place for building everything in tools/spellcheckers/ only,
nd everything has been tested with one language, including make check (a few
ests are skipped because the fst is not found, but no tests break). The old
files are kept for the moment, in case unseen issues and missing data is popping
up after the switch, but will be deleted after verification.
r190855:
Step seven in flattening the tools/spellcheckers/ dir tree: copying
fstbased/mobile/hfst/index.xml to the new location.
r190836:
Step six in flattening the tools/spellcheckers/ dir tree: moving TAGWEIGHTS out
of the language independent part to the language specific part, so that we can
specify different tagweight files for desktop and mobile spellers.
r190799:
Step five in flattening the tools/spellcheckers/ dir tree: modifying another set
of build files for the new dir structure, and the consequences of one dir for
all speller files.
r190782:
Step four in flattening the tools/spellcheckers/ dir tree: copying all non-make
files from spellcheckers/fstbased/desktop/hfst/ to spellcheckers/.
r190770:
Step three in flattening the tools/spellcheckers/ dir tree: changing the
relocated build files to adapt to their new home.
r190714:
Step two in flattening the tools/spellcheckers/ dir tree: copying the
desktop/weighting/ dir as the default one - for most languages the
mobile/weighting/ dir is just a copy of the desktop one.
r190663:
Step one in flattening the tools/spellcheckers/ dir tree: copying all subdir
Makefile.am files to *.mod-* files in the top spellcheckers dir, except from the
weigthing dirs.
r190650:
Added .gitignore file, as a preparatory step.
r190627:
Forgot to remove the entries for configure.ac re listbased spellers.
r190618:
Removed all list-based spellcheckers. There has not been any serious work in
that area since the move to the new infrastructure 8 years ago. If there is a
future need, we have it all in the rev history, and removing it simplifies other
operations.
r190611:
Moved the files in tools/data/ to tools/tokenisers/, and removed the dir
tools/data/. Part of the tools dir cleanup.
r190575:
Commented out check for GTLANG_xxx variable, it is not used, and the check
output is confusing to users.
r190568:
Added checks for giella-core and giella-shared, symlinking to them if found,
checking out (svn) or cloning (git) if not. Also removed every single reference
to __UND__, it is not needed, and will cause merge conflicts.
r190317:
The last hyphenation build fix: now also works with other than the default fst
backend, e.g. with the foma backend.
r190310:
Removed a double target declaration, one from the old pattern-based build, and
one from the fst build. It was a simple copy from fst to pattern, and is not
needed anymore.
r190291:
Updated referenced filename. Old name was not found, and stopped all builds.
r190284:
Restored file that was accidentally deleted, also renamed it to the correct name
after the dir reorg.
r190275:
One reference to an old filename corrected. Stopped all nightlies.
r190258:
Removing the last remnants of the old hyphenation directory structure.
r190251:
Moving the last files from patterns one dir up.
r190239:
Removed most of the old hyph files not needed anymore.
r190231:
Fixed copy-paste error that was just ... doh. Friday. evening.
r190226:
Switched build to new, shallower build structure. The old files and dirs are
still there, but not used.
r190212:
Forgot one file to be copied up one dir level, now done.
r190198:
Step one in flattening the tools/hyphenators/ dir tree: copying and renaming
make files, copying the filter dir. The files are not yet connected. Also
preparing new build instruction file.
r190127:
Added missing quote mark „ that caused unwanted behaviour in tokenisation.
r190112:
Updated references to dir names in giella-shared: requires new version of
giella-common. Updated some test scripts to refer to the new dir names.
r190068:
The second big renaming: src/morphology/ -> src/fst/. All build, test and config
files are updated. `make` and `make check` works for sma.
r190043:
Added dynamic construction of a regex of flag diacritics found in tokeniser
fst's. The regex is used to ensure that flag diacritics are considered epsilons
at token boundaries. Fixes a number of tokenisation bugs.
r190025:
A glaring miss stopped all nightly builds. Thanks to Tino for pointing out.
r189997:
Renamed src/syntax/ to src/cg3/, and updated all references to it. Part of the
large restructuring, and a test case for more complex renaming.
r189964:
More cleanup after removing src/phonology/*: all references to it have been
replacecd, and the file am-shared/src-phonology-dir-include.am has been removed.
r189952:
Forgot to remove src/phonology/Makefile from configure.ac. Duh.
r189913:
Changed documentation extraction & building to get the source doc in
src/morphology/.
r189902:
The big switch: building phonology files are now changed from src/phonology/ to
src/morphology. Documentation is still built in the old location, but will be
moved separately due to higher conflict risk.
r189865:
Update phonology filename in src/morphology/Makefile.modifications-phon.am.
r189805:
Copy src/phonology/Makefile.am to src/morphology/Makefile.modifications-phon.am
and src/phonology/xxx-phon.twolc to src/morphology/phonology.twolc as step one
in moving the file. Then the build can switch, and finally, the old files can be
deleted.
r189750:
Corrected copy-paste bug in the build steps for areal grammar checker analysers.
The bug caused SMJ to fail.
r189743:
Fixed bug with multiple declarations of EXTRA_DIST and noinst_DATA in the
previous template merge.
r189736:
Preparations for moving the phonology files inside morphology/ (later to be
renamed fst/).
r189372:
Reorganised mt/apertium make files so that fixed content is in Makefile.am, and
userj-editable content is in Makefile.modifications.am.
r189294:
Started splitting the local Makefile.am in two, by moving it to a new filename,
and then create a new Makefile.am that just includes the moved one. In later
commmits, some of the content can be moved from one file to the other.
r189249:
Fixed the remaining cases of improved upper-lower case configurable processing.
Removed a variable from configure.ac with comments, turned out it wasn't needed.
r189193:
First step in fixing default case handling: downcasing of derived proper nouns
can now be turned off for the standard fst's by changing a test in configure.ac.
r188999:
Fixed bug in phonology compilation when there are multiple phonology files:
temporary files were deleted before being used due to name overlap.
r188957:
Added Automake variables to handle demanding or non-default uppercasing, or
writing systems with no case distinction at all.
r186195:
Adding |{➤}|{•} to pmscript.
r185312:
Added ‹ and › to the list of possible punctuation marks in the tokenisers.
r184977:
Added Makefile setting for enabling swaps in error models (ie ab -> ba). Default
is no (as this used not to work, and the existing error models are based on this
fact).
r184493:
Replace UNDEFINED with __UNDEFINED__, so that text replacement can take place.
r184388:
tools/mt/Makefile.am needs am-shared/lookup-include.am as well.
r184380:
Forgot to add cgbased to the SUBDIRS variable in tools/mt/Makefile.am.
r184372:
Added basic support for CG-based machine translation. Ongoing work.
r184171:
Make sure some jspwiki header files for generated documentation are included in
the distro.
r184080:
Made it possible to disable Forrest validation when Forrest is installed. This
reduces build time and annoying warnings for people not working on the
documentation. Default is still to do Forrest validation.
r183967:
Wrapped command line tools in double quotes, to protect against spaces in
pathnames. Spaces will occur when building on Windows using Windows Subsystem
for Linux, as locations such as 'Program Files' are included in the default
search path.
r183714:
Improved build process for pattern hyphenators - now patgen config is done
programmatically instead of interactively. The values are configured in the
Makefile.am.
r183082:
Added script for testing tag coverage, made by Kevin, and originally for sme.
r182643:
Added support for multiple whitespace analysers.
r182627:
Added support for comments in error model text files. Added support for zipped
but uncompressed files (required by divvunspell for now).
r181915:
Added simple shell script to easily run the grammar checker test tool, and
considering build directories etc.
r180818:
Generate and compile the new filter for removing semantic tags in front of
derivations. Require new version of the giella-core because of dependencies.
r180788:
Make sure all generated files have a suffix that will make them be ignored.
Added comments to clarify.
r180778:
Børre updated the documentation url to point to giellalt.uit.no.
r180635:
Fixed stupid copy-paste error in the previous commit. Reorganised the code a bit
to make a variable definition clearer and more logical.
r180093:
Make sure that the input to all variants of the mobile speller is weighted.
r179995:
Fixed fsttype mismatch error for filters when building mobile spellers, by
building filters locally of the correct fst type, as we do for desktop spellers.
r177971:
Added UpCase function to the tokenisers, to handle all-upper variants of the
input side. It does almost double the size of the fst, but at least it is just
one additional line of code. Also, it does only work in Linux/using glib (for
other platforms it is restricted to Latin1 - still, that covers a major portion
of the Sámi fst's and running text, so much better than nothing).
r177633:
Ensure that the correct grammar checker pipeline is the default one, so that it
will be executed when no pipeline is specified.
r177299:
Added the new multichar +Symbol to the multichar definitions.
r177265:
Changed sub-post tag for symbols from +ABBR to +Symbol. Needs to be declared as
multichar in each language.
r177031:
Added support for shared Symbol file: build rules, affix file, modifications to
root.lexc. Also increased required version of giella-common, to make sure that
the shared stem file is actually there.
r177024:
Fixed dir name typo that broke compilation.
r177013:
Fixed copy-paste error.
r177012:
Added support for building an analyser tool. This is in practice an
xml-specified pipeline identical to what is used in the grammar checker, but
where the pipeline does text analysis instead of grammar checking. Also made
grammar checkers and mobile spellers part of the --enable-all-tools
configuration.
r176732:
Added filter to remove the +MWE tag from the grammar checker generator. It
blocked generation of some word forms (and should not be visible in any case).
r176032:
Fixed another case of transducer format mismatch for hyphenators, this time
regarding pattern-based hyph building.
r176022:
Corrected an instance of transducer format mismatch when building hyphenators.
r175743:
Make the mobile keyboard layout error model work properly (ie on input longer
than one char) by circumfixing it with any-stars.
r175625:
First round of improved handling of compilation errors in shell pipes: instruct
make to delete targets when some of the intermediate steps fail.
r175576:
Added configure.ac conditional to control whether spellers for alternative
orthographies are built. The default is 'true'. Set this to 'false' for
historical or other orthographies for which a speller is not relevant.
r175563:
Fix broken hfst builds of xfscript files when there is no final newline in the
source file (caused the save command to be shaddowed by the final line of text,
usually a comment, so no file was saved, and thus there was nothing to work on
for the next build step).
r175523:
Apply alternate orthography conversion after hyphenation marks have been
removed, but before the morphology marks are deleted. Especially word boundaries
are useful for certain types of conversion, but other borders will likely be
useful as well. The conversion scripts need to take the border marks into
consideration.
r175130:
Replicate the desktop error model for the mobile speller, and generalise the
corpus weighting compilation. Now the build code is ready for mobile speller
release.
r175033:
Improved Easter egg generation, using the improved script in giella-core.
Increased the required giella-core version correspondingly.
r174992:
Cleaned the HFST_MINIMIZE_SPELLER macro, and also its use. No need to include
push weights anymore, it is done always, for all speller fst's.
r174980:
Push weights for all final fst's, + optimise error model.
r174944:
Changed how the att file is produced. From now on it should be built once, and
then added to svn. The att file will usually not change, and storing it in svn
will avoid rebuilding it every time. Also changed the compression.
r174903:
Added support for adapting the error model to the mobile keyboard layout for the
language in question.
r173443:
Two more places to remove the Use/-GC and the MWE tags: mt and speller fst's.
Now done.
r173403:
Had forgotten to remove the Use/-GC tag in the core fst's, only from all the
others. Now fixed.
r173359:
Step 2 in blocking dynamic compounds of MWE tagged entries: moved all MWE tag
processing away from the *-raw-* targets to the specific *.tmp targets. This way
the MWE tags will survive long enough to be available for the blocking done in
the tokeniser fst's. Tested in SME, and seems to work as intended.
r173267:
Added step 1 in blocking dynamic comounds between an MWE and another noun: added
new filter that will turn the MWE tag into a flag diacritic. Increased required
giella-common version number due to the new filter.
r172759:
Fixed bug when building the punctuation file - the required subdir was not made.
r172172:
Moved the whitespace analyser almost to the beginning of the pipeline, directly
after the tokeniser+analyser. This is to be able to support sentence boundary
detection, as the whitespace analyser will give some valuable tags for that.
r172102:
Corrected typo in a configuration option - dekstop instead of desktop. Thanks to
our friends in Nuuk for noticing.
r172002:
Corrected a misplaced dependency that caused url.hfst to be rebuilt on every
make, and thus trigger other rebuilds. Not anymore.
r171992:
Moved whitespace tagging after the speller, to avoid that it creates trouble for
the speller. That happens when whitespace error tags are applied to the word
form that should be spell-checked.
r171963:
Made it possible to tag something as _only_ for the grammar checker, or _not_
for the grammar checker. Updated required giella-share version, due to new
required filters.
r171951:
Moved whitespace chars to the blank regex, thereby reinstating the old
compilation speed. Thanks to Kevin and Tino for noticing and suggesting the
improvement. Also added comment to document what incondform is supposed to
contain, again thanks to Kevin.
r171935:
Removed hyphen from the regular unknown alphabet, thereby reverting analysis of
-foo as one (unknown) token, and instead back to two tokens. Added hyphen to
alphamiddle, so that foo-bar will still be analysed as one big unknown token.
r171912:
Added the tokenisation disambigutation file to the compiled and installed
targets.
r171888:
Better handling of unknowns: defined more whitespace characters, defined a lot
more vowels in the alphabet, added recent improvements to flag diacritic like
symbols at token boundaries.
r171746:
Fixed two build bugs: abbr.txt was only autogenerated when building with hfst,
and the url.?fst file was not properly generated from url.tmp.?fst.
r171722:
Fixed bug in MT compilation - pattern rules are not used, but new filenames
still had them due to copy-paste error.
r171714:
Added pmatch filtering also to MT and spellcheckers. Now all tools and fst's
should be covered.
r171704:
Forgot to add pmatch filtering to the default targets in src/ - duh. Now done.
r171619:
Added pmatch filtering to the rest of the build targets in src/. Also added
grammar checker filtering.
r171601:
Major reorganisation to properly handle pmatch preparations, by splitting the
disamb-analyser compilation in two: one going to the regular disamb analyser,
and the other going to the pmatch variant. We use the two tags +Use/PMatch and
+Use/-Pmatch in complementary distribution to specify paths for each, one path
containing pmatch backtracking poings (used with the --giella format of
hfst-tokenise), and one without. The backtracking machinery is used to handle
ambiguous tokenisation. Increased required version of giella-shared due to new
required filters.
r171508:
More improvements to the analysis regression check: undo space->underscore from
lookup2cg (to avoid meaningless diffs when comparing to the new hfst-tokenise),
and removed weight info. Also changed the dir ref for abbr.txt to ref the build
dir, not the source dir, as that is where the file is generated.
r171459:
Improved regression check script: check that the abbr file is built, for
improved traditional tokenisation; and make the patch command silent, for less
noise during testing.
r171273:
Thanks to Børre, the analysis regression script will now remove diffs due to
different handling of dynamic compounds when comparing old and new tokenisation.
This makes it much easier to spot real differences between the two.
r171249:
Improved shell script for analysis regression testing, so that in cases of no
diffs it will only print a short message and continue. The test for no diff is
also much faster than a real diff. Improves processing time a lot for large test
corpora.
r170776:
Moved punctuation definitions from each language to giella-shared/all_langs/.
Makes much more sense, and will help in resolving random tokenisation bugs due
to « and ».
r170717:
Implemented the option to compile phonology rules directly against the lexicon,
for better rule compilation optimisations. Kevin: fixed a bug in xml generation
for the grammar checker.
r170657:
Fixed hyphenation build when there is no phonology file.
r170641:
Corrected an error after the Hunspell config section was commented out.
r170632:
Added --enable-all-tools option to configure.ac, to allow for easier configura-
tion and testing of all common tools. Unstable or experimental tools must still
be explicitly enabled. Commented out the Hunspell speller config completely, it
is not supported. Corrected a comment.
r170561:
Improved and completed the code to skip building phonology fst's. Clearer logic
and comments.
r170545:
Added a configure.ac setting to skip phonology compilation, typically used when
compiling external sources, that provides a full analyser in src/morphology.
Also added a configuration option to compile xfscript files with lexicon
references in them, so allow for faster and more optimised rule composition.
This variable has no effect yet, the rest of the machinery is missing.
r170451:
Remove all tmp files when cleaning.
r170445:
Remove also url.tmp.lexc when cleaning.
r170435:
Fixed bug: the url analyser is located elsewhere, and should not be processed
here in any case.
r170418:
Made url analyser compilation open for local adaptations, by going via a tmp
file.
r170362:
Remove also url.lexc when cleaning, it is copied from giella-shared.
r170232:
Corrected double installation of url analyser bug. It should not be installed at
all.
r170218:
Add missing ‘|’ in analyser-gt-whitespace.hfst goal.
r170181:
Fixed a bug in the previous commit that surfaced when enabling tokenisers but
not grammar checkers.
r170172:
Massive rewrite of filter codes and automatically generated tag conversions, all
done to handle bug #2474 (URL tag not correctly formatted in the tokeniser
output). The bug should be fixed now.
r170051:
Added filter dir and filter compilation to the fst-based hyphenators. Moved
filter compilation from src/filters/ to the local filter dir (by copying the
regex files and then compile them), to make the build process mostly fst format
independent.
r170004:
Added support for local modifications of the hyphenator build via a tmp file.
Simplified tmp handling in the src/ dir.
r169989:
Added dir structure and Autotools data to prepare for adding hyphenation
testing.
r169972:
Downcasing of derived proper nouns was only applied on the input side, not the
hyphenated side. This caused such words to be case-shifted: arabialaččat ->
A^ra^bi^a^lač^čat. This is now fixed.
r169958:
Fixed hyphenation bug where the lexicon-based hyphenator missed hyphenation
points, mainly in propernouns, due to flag diacritics. Fixed by telling the fst
compiler to treat flags as epsilons. Now the lexicon-based hyphenator is beating
the plain rule-based one in most (all?) cases where there are differences. Must
be tested better, though.
r169825:
Added comment to guide placement of local build targets (to avoid future merge
conflicts), and a comment reminder about other places to change filenames.
r169798:
Reorganised the source filenames to make it easy to override when needed. Should
make it possible to solve the bug where src/syntax/disambiguator.cg3 overrides
the same file in tools/grammarcheckers/.
r169780:
Refactored repeating patterns of code with variables, fixes upload link after
XServe crash last winter.
r167627:
Corrected and improved the compilation of the analysers including the URL
analysis. This should fix the problem with compiling SMA and other languages,
and should in general reduce both compilation time and analyser size. The basic
change was to union in the URL analysis as the last step in building the
analysers, instead of early - the early injection led to fst blowup during
minimisation. Now no blowup appears to take place.
r166912:
Added the special target .NOTPARALLEL to the hfst speller make file, to work
around a make bug that caused a prerequisite to not be built when invoking make
with the -j option. Also added some comments.
r166895:
Updated command in comments to use the correct option.
r166802:
Reverted the more robust semantic tag reordering, it was just too slow. Now we
are back to a less robust and more fragile system (including bugs), but with
faster compilation. Ultimately we will abandon _semantic_ tag reordering
altogether, and instead rewrite the lexc code to always place the semantic tags
where they should be.
r166754:
Corrected automake (and make?) syntax error that broke compilation.
r166722:
Simplified semantic tag filtering regex construction.
r166504:
Too eager in the previous commit to get rid of semantic tag processing: removed
the filter to zero out semantic tags completely, which broke compilation of a
number of fst's where semantic tags are not wanted.
r166462:
Corrected bugs in reordering semantic tags by doing the reordering in two steps:
1) insert the tag in the new and correct position, and 2) remove the tag in the
wrong position. There will probably be things to iron out, but initial tests are
fine. This should also make the whole semantic tag reordering a bit faster to
compile and apply, as the generated regexes are smaller and simpler.
r166172:
Now that the downcasing script works in all cases, remove all the special
processing, and get rid of spurious rebuilds of the dependent fst's. Another
time-saver:-)
r166165:
Changed the downcasing script to work also with hyperminimised hfst-fst's. Now
the downcasing script works both with Xerox, Hfst and Foma, and both with
standard and hyperminimised hfst-fst's. Finally!
r165881:
Added support for filters for grammatical and derivation tags, sorted the
generated filter list.
r165789:
Bugfix: OLang/xxx tags were removed, not made optional, in generators.
r165752:
Do not delete disambiguator.cg3 and grammarchecker.cg3 when cleaning.
r165726:
Whether to let the orig-lang tags be visible in the disambiguating analyser or
not is dependent on the language and the needs of each language community.
Moving the removal of those tags from the general processing to the language
specific processing. Step 2: removing it from the general processing.
r164386:
Added the -p option to the yaml testing command, to remove all passing test.
This should make it easier to spot the actual FAILs.
r164372:
Corrected path to zhfst file. Also changed the return code when the zhfst file
is not found, so that it will be reported as a FAIL. Since this test is only run
when configured for building spellers, a missing zhfst file should be fatal.
Also changed variable name to avoid confusion with the shell variable.
r164364:
Added phony target forwarding 'make test' to 'make check'. Required to make
'make check' work on some build systems.
r164245:
Added a separate disambiguation file for the spell checker output, and a
spell-checker-only pipeline (well, still tokenisation and disambigation, but
no proper grammar checking).
r164223:
Corrected Foma compilation for phonology rules.
r163430:
Made symbol alignment default - I can see no cases where we don't want it, but
it is still possible to disable it if such a need pops up. Also improved the
error message when trying to build a twolc language using Foma.
r163423
Added INFO text about switching to Hfst as a fallback when Xerox tools are not
found. Also added test and error message when using Foma on a language with a
twolc file.
r163194:
Fixed URL analysis in MT. All URL's and email addresses are now tagged +URL.
Although the url analyser itself is small, the resulting analyser quadrupled in
size (in sme).
r163061:
Removed filters for removing morphological borders - they destroy the assymetry
of the fst's, and make yaml testing more complicated.
r163050:
Added support for Area variants of the grammar checker generator. Should fix
nightly build error for SMJ.
r163043:
Added missing Foma support for dictionary fst's.
r163036:
Fixed the last bunch of path errors. Now all yaml tests are back to normal.
r163017:
Cleanup: commented in outcommented test loop, removed exit statement used during
development, fixed path for two test scripts.
r163001:
The last set of test runners for yaml tests changed to the new system.
r162992:
Three more yaml test runners done, still a few more to go before yaml testing is
back in shape.
r162964:
Changed the last yaml testing scripts in the template to follow the new and
improved system. No need for autoconf processing anymore.
r162937:
Major rework of the yaml testing framework, to be able to properly support fst
type specific yaml testing (ie test only xfst or hfst transducers, or everything
but xfst transducers (=foma & hfst)). This change triggered a number of other
changes. The user-facing shell scripts are greatly simplified by this change.
r162892:
Corrected AM errors in the previous merge. Now the build is working again,
r162885:
Added support for grammar checker generators for alternative orthographies and
writing systems. Should fix nightly build issue in CRK.
r162691:
Added support for a grammar checker specific generator. Should fix various
issues re generation of suggestions.
r162556:
Added test for the presence of divvun-validate-suggest, which is now required to
build grammar checkers. Now configure will error out instead of make.
r162499:
Add note to the errors.xml file that it is generated, and from which file it is
generated, to avoid people editing the wrong file.
r162487:
Error messages are now copied from a source file to a build file, after bein
validated. This allows support for VPATH builds and retains the integrity of the
zcheck file. At the same time also replaced hard coded language names with
automake variable expansion in the pipespec.xml.in file.
r162377:
Fixed bug in building dictionary analysers for alternative orthographies,
introduced in the changes yesterday.
r162370:
Added option to specify language variant, to allow testing spellers for
alternative writing systems, alternative orthographies, different countries etc.
r162316:
Added support for area / country specific fst's for the specialised dict and
oahpa build files. At the same time reorganised the build code so that targets
with two variables now consistently use the fst type / suffix as the pattern,
and the writing system/alt orth/area/etc as the function parameter. This should
make the build system more robust by reducing the risk for accidental pattern
similarity.
r162295:
Added support for building area/country specific spellers. The target language
for now is SMJ, but the feature is of course language independent and useful in
a number of other circumstances.
r162279:
Changed dialect fst filenames to follow existing patterns used for Oahpa fst's.
r162264:
Added support for building dialect fst's. It is disabled by default, but can be
enabled with a configure option. Also changed the disamb analyser to keep the
dialect tags. Only normative fst's are filtered against dialect tags.
r162242:
Added initial support for building Area-specific analysers and generators (norm
only). Also restored Area tags in the disamb and grammar checker analysers.
Fixed missing support for Foma transducers in the alternative writing system
support.
r162204:
Grammar checker .zcheck file should go into datadir, not libdir.
r162194:
Now using speller version info from configure.ac, not version.txt, which is
removed. New giella-core required.
r162184:
Fixed a bug in fst format handling for the grammar checker - conflicting formats
caused a segfault. Now using openfst-tropical for all fst's being processed in
the grammarcheckers/ dir (presently only the speller acceptor analyser).
r162142:
Fixed OLang tag extraction and filter generation.
r162130:
Added weights to compounds in the language-indpendent build steps (languages
without compounds will go through the same step, but will not be changed).
Applied only to analysers. Also added spellrelax to the language-independent
build of the analysers = it it always applied.
r162106:
Improved the previous fix: make sure it does not crash when the target file does
not exist, and use the same test on all autogenerated tag lists. This should
save a few more seconds of build time.
r162089:
Fixed bug #2355 so that the filters for semantic tags will only be rebuilt when
there are real changes to the semantic tags.
r162031:
Corrected a € vs cut incompatibility on Linux, cf bug report #2457.
r161989:
Updated the pipespec.xml file to comply with the newest version of the grammar
checker code, where each argument type is explicitly specified. Makes for a more
robust pipeline.
r161903:
Corrected fileref in m4, added correct autoconf path to errors.xml.
r161896:
Renamed pipespec.xml to *.in, to allow autoconf processing. This makes it
possible to use modes when building using VPATHS.
r161856:
Hard-coded filename in fallback target - that was the only way to work around a
loop in make on some systems.
r161813:
Renamed src/syntax/disambiguation.cg3 to src/syntax/disambiguator.cg3, to keep
the file naming consistent (actor noun if possible), and remove discrepancy
between the regular disambiguator and the grammar checker disambiguator that
caused makefile troubles.
r161129:
Heavy rewrite of the analysis regression check tool, to support testing the
grammar checker pipeline.
r161092:
Do not remove semantic tags, dialect tags and other tags useful for
disambiguation or suggestion generation. The grammar checker speller needs
these, and they will anyway disappear when we project the final fst.
r160750:
Proper verbosity specification in a few more instances, and added weight pushing
for the grammar checker speller now (how could I have missed that?).
r160742:
Fixed a bug in piped hfst-xfst commands: in three cases the -p option was
missing, causing strange misbehavior in hfst-xfst on some systems.
r160734:
Further configure.ac cleanup: moved some variable definitions to other m4 files,
moved the language definition on top, deprecated GTLANG* variables for GLANG*
variants (ie Giella instead of GiellaTechno). Updated copyright year.
r160723:
Moved all default AC_CONFIG_FILES into a separate function in a separate m4
file, to clean up configure.ac. Some other cleanup of configure.ac.
r160716:
Defined variable for separate speller release version string.
r160710:
Changed package name and version to more clearly be a real name and version
number.
r160701:
Updated comment in preparation for other changes.
r160670:
Added support for analysing whitespace and thus make it possible to tag
whitespace errors (double spaces, extra spaces, etc), and also to more reliably
detect sentence and paragraph borders by using whitespace as a delimiter.
r160649:
Using absolute dir refs to make it possible to call the shell scripts from
everywhere.
r160600:
Fixed a bug: forgot to remove a line.
r160589:
Rewrote the speller test scripts in devtools/ to be VPATH safe and rely on
autotools for paths etc, so that the scripts will work also when only checking
out single languages.
r160153:
Added support for specifying language-specific files to be included in the
grammar checker archive file.
r160039:
Updated grammar checker files and build rules.
r159755:
Added hfst-push-weights to move transducer weights to the beginning of the
strings, to enable proper optimisations of speller lookup in hfst-ospell.
Stripped out most lang-specific stuff from grammar checker cg file, and added
simple example rules + some explanations. Use gramcheck tokeniser in pre-pipe.
r159063:
Added default rule for speller suggestions, to make the suggestions survive cg
treatment.
r159036:
Added spell checking component to the grammar checker pipeline. Now every
planned component is working as it should. The spell checking requires first
that one builds the latest hfst-ospell code, and then the newest grammar checker
code for this to work.
r158455:
Increased weights for fall-back rule-based hyphenation. Added .hfst suffix to
rule fst for consistency.
r158399:
Replaced the huge sme grammar checker with the more moderate smn grammar checker
cg file, as the template file for future grammar checkers.
r158388:
Added note (readme file) about NOT touching the local am-shared dir, to avoid
future unintended changes.
r158353:
Added the missing files for a working grammar checker. Fixed grammar checker
build rules to not be dependent upon enabling tokenisers.
r158289:
Added conversion of the analysis tags from the grammar checker speller into CG
format.
r158250:
One misplaced variable caused the grammar checker speller to be built
independent of the configuration. This caused a build fail for everyone. Solves
bug #2437. Also added $(srcdir) in front of root.lexc, to ensure that the file
reference resolves correctly in local build targets.
r158242:
Moved the target clean-local to the local Makefile, to make it possible to
enhance the clean target with locally generated files.
r157960:
Correctiona to the grammar checker speller build: we now build a working zhfst
file that can be used as part of the development cycle. Also additions to silent
builds.
r157879:
Major update to the grammar checker template. It still does not work completely
as it should, so hold your horses. Update content: ensured that all files needed
are copied to the grammar checker build dir, removed option to name files
(=irrelevant bloat), now builds an almost proper zip file, and ensured that
tokenisers are built before grammarcheckers. Also made it so that when grammar
checkers are enabled, spellers are automatically enabled too, as they will be
included as part of the grammar checker pipeline.
r157261:
Changed the file exists test for the lemma generation testing so that it will
work even in cases where multiple source files are used as input.
r157204:
Made cg3 file compilation more general.
r157096:
Moved the code to build the apertium relabel script in the apertium directory,
so that we can use the actual giella-tagged fst for MT as the tag source. This
should fix all issues of missing tags in the relabel script.
r157021: