-
Notifications
You must be signed in to change notification settings - Fork 173
/
NEWS
1192 lines (831 loc) · 43 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
up to: 54d426adbe06b3187b55bc804767b459c023007a
Snowball 2.3.0 (2024-09-??)
===========================
Ada
---
* Bug fixes:
+ Fix code generated for Snowball `loop` which previously was partly Pascal
rather than Ada (it looks like the Ada generator was originally based on
the Pascal one). None of the stemmers shipped in previous releases
exercised this case, but the Turkish stemmer now does.
* Code quality:
+ Only declare variables A and C when each are needed.
+ Fix indentation of generated declarations.
C/C++
-----
* Code quality:
+ Fix formatting of generated C code which was missing a newline in the code
generated for `goto` or `gopast` followed by a grouping, unless `-comments`
was specified.
C#
--
* Bug fixes:
+ Add missing runtime support for testing for a string var at the current
position was only implemented for backwardmode. The forwards case isn't
exercised by any of the stemming algorithms we currently ship.
Go
--
* Optimisations:
+ Drop some unneeded Go code generated for string `$`. None of the shipped
stemmers use string `$`, though the Schinke Latin stemmer algorithm on the
website does.
* Code quality:
+ Dispatch among result with `switch` instead of an `if` ... `else if` chain
(which looks like we did because the Go generator evolved from the Python
generator and Python didn't used to have a switch-like construct. This
doesn't make a measurable speed difference so it seems the Go compiler is
optimising both to equivalent code, but using a switch here seems clearer,
a better match for the intent, and is a bit simpler to generate.
Java
----
* The Java code generated by Snowball requires now requires Java >= 7. Java 7
was released in 2011, and Java 6's EOL was 2013 so we don't expect this
to be a problematic requirement. See #195.
* Optimisations:
+ We now store the current string in a `char[]` rather than using a
`StringBuilder` to reduce overheads. The `getCurrent()` method continues
to return a Java `String`, but the `char[]` can be accessed using the new
`getCurrentBuffer()` and `getCurrentBufferLength()` methods. Patch from
Robert Muir (#195).
+ Use a more efficient mechanism for calling `among` functions. Patch from
Robert Muir (#195).
* Code quality:
+ Consistently put `[]` right after element type for array types, which seems
the most used style.
Javascript
----------
* Bug fixes:
+ Use base class specified by `-p` in string `$` rather than hard-coding
`BaseStemmer` (which is the default if you don't specify `-p`). None of
the shipped stemmers use string `$`, though the Schinke Latin stemmer
algorithm on the website does.
* Code quality:
+ Modernise the generated code a bit. Loosely based on changes proposed in
#123 by Emily Marigold Klassen.
* Other changes:
+ Add start of ESM support. See #183, reported by Lionel Rowe.
Pascal
------
* Code quality:
+ Eliminate commented out code generated for string `$`. None of the shipped
stemmers use string `$`, though the Schinke Latin stemmer algorithm on the
website does.
Python
------
* Bug fixes:
+ Correctly handle stemmer names with an underscore (not currently exercised
by any stemmers we ship).
* Other changes:
+ Set python_requires to indicate to install tools that the generated code
won't work with Python 3.0.x, 3.1.x and 3.2.x (due to use of `u"foo"`
string literals). Closes #192 and #191, opened by Andreas Maier.
+ Add classifiers to indicate support for Python 3.3 and for 3.8 to 3.13.
Fixes #158, reported by Dmitry Shachnev.
Rust
----
* Code quality:
+ Suppress unused_parens warning, for example triggered by the code generated
for `$x = x*x` (where `x` is an integer).
+ Dispatch `among` result with `match` instead of an `if` ... `else if` chain
(which looks like we did because the Rust generator evolved from the Python
generator and Python didn't used to have a switch-like construct. This
results in a 3% speed-up for an unoptimised Rust compile but doesn't seem
to make a measurable difference when optimising so it seems the Rust
compiler is optimising both to equivalent code. However using a `match`
here seems clearer, a better match for the intent, and is a bit simpler to
generate.
New stemming algorithms
-----------------------
* Add Estonian algorithm from Linda Freienthal (#108).
Behavioural changes to existing algorithms
------------------------------------------
* English: Add extra condition to undoubling. We no longer undouble if the
double consonant is preceded by exactly "a", "e" or "o" to avoid conflating
"add"/"ad", "egg"/"eg", "off"/"of", etc. Fixes #182, reported by Ed Page
* German: Replace with the "german2" variant. This normalises umlauts to an
"e" suffix and "ß" with "ss", which is presumably much less common in
newly created text than it once was as modern computer systems generally
don't have the limitations which motivated this, but there will still be
large amounts of legacy text which it seems helpful for the stemmer to
handle without having to know to select a variant.
On our sample German vocabulary which contains 35033 words, 77 words give
different stems. A significant proportion of these are foreign words, and
some are proper nouns. Some cases definitely seem improved, and quite a few
are just different but effectively just change the stem for a word or group
of words to a stem that isn't otherwise generated. There don't seem any
changes that are clearly worse, though there are some changes that have both
good and bad aspects to them.
Fixes #92
* German: Don't remove -em if preceded by -syst. Previously we would overstem
words ending -system. This change means we now conflate e.g. "system" and
"systemen". See #161.
* Italian: Address overstemming of "divano" (sofa) which previously stemmed to
"div", which is the stem for 'diva' (diva). Now it is stemmed to 'divan',
which is what its plural form 'divani' already stemmed to. Fixes #49,
reported by francesco.
* Romanian: Fix to work with unicode alphabet in modern use
Currently the stemmer does not work with s-comma and t-comma characters,
but only with their cedilla "approximations" from before Romanian had
full Unicode support.
The old cedilla "approximations" are normalized to the proper unicode
characters by the stemmer. Patch from Rober Muir.
* Swedish: Replace suffix "öst" with "ös" when preceded by any of 'iklnprtuv'
rather than just 'l'. The new rule only requires the "öst" to be in R1
whereas previously we required all of "löst" to be. This second tweak
doesn't seem to affect any words ending "löst" but it conflates a few extra
cases when combined with the expanded list of preceding letters, and seems
more logical linguistically (since "ös" is akin to "ous" in English). Fixes
#152, reported by znakeeye.
Optimisations to existing algorithms
------------------------------------
* Lithuanian: Remove redundant R1 check.
* Tamil: Optimise by using `among` instead of long `or` chains. The generated
C version now takes 43% less time to processes the test vocabulary.
* Tamil: Remove many cases which can't be triggered due to being handled by
another case.
* Tamil: `test` clean ups.
* Tamil: Make fix_va_start simpler and faster.
* Tamil: Eliminate pointless flag changes.
* Turkish: Minor optimisations.
Code clarity improvements to existing algorithms
------------------------------------------------
* Lithuanian: Use recommended latin stringdef codes
Using common codes makes it easier to work across algorithms, but
they are more mnemonic so also seem clearer when just considering this
one algorithm.
* Serbian: Use recommended latin stringdef codes
Using common codes makes it easier to work across algorithms, but
they are more mnemonic so also seem clearer when just considering this
one algorithm.
* Turkish: Adjust stringdefs to match other uses
Use {sc} for s-cedilla and {i} for dotless-i.
Compiler
--------
* Generic code generation improvements:
+ Add generic dead code elimination machinery. This facilitates various new
optimisations, so far the following have been implemented:
- Tail-calling
- Simpler code for calling routines which always give the same signal
- Simpler code when a routine ends in a integer test (this also allows
eliminating an Ada-specific codegen optimisation which did something
similar but only for routines which consist *entirely* of a single
integer test.
- Dead code reporting and removal (only in simple cases currently)
Currently this overlaps in functionality with the existing reachability
tracking which is implemented on a per-language basis, and only for some
languages. This reachability tracking was originally added for Java
where some unreachable code is invalid and result in a compile time error,
but then seems to have been copied for some other newer languages which
may or may not actually need it. The approach it uses unfortunately
relies on correctly updating the reachability flag anywhere in the
generator code where reachability can change which has proved to be a
source of bugs, some unfixed. This new approach seems better and with some
more work should allow us to eliminate the older code. Fixes #83
+ Omit check for `among` failing in generated code when we can tell at compile
time that it can't fail.
+ Eliminate `!`/`not` from integer test code by generating the inverse
comparison operator instead for all languages, e.g. for Python we now
generate
if self.I_p1 >= self.I_x:
instead of
if not self.I_p1 < self.I_x:
This isn't going to be faster in compiled languages with an optimiser but
for scripting languages it may be faster, and even if not, it makes for a
little less work when loading the script.
+ Avoid trailing whitespace in generated files.
Improve --help output
Check for division by zero during constant folding
This now gives an error
Simplify more numeric expressions
We now simplify identity operations such as x + 0 to x, and
operations which are equivalent to negation such as 0 - x,
-1 * x, x / -1 to -x.
It's unlikely such expressions would be written literally,
but they may be created by constant folding - e.g.
x * sizeof 'a' -> x * 1 -> x.
For `hop` followed by an unexpected token (e.g. `hop hop`) we were
already emitting a suitable error but would then segfault because AE on
the node for the hop is NULL.
Give error for redefinition of grouping
We already catch this for routines.
Make compiler fail cleanly if malloc fails
We now report the failure and exit. Previously the NULL return from
malloc wasn't checked for so we'd typically segfault.
lenof and sizeof now mark their arguments as used
This avoids a bogus error followed by a confusing additional
message if this is the only use of the value of the string variable
they're applied to:
lenofsizeofbug.sbl:3: warning: string 's' is set but never used
Unhandled type of dead assignment via sizeof
I stumbled on this while creating an artificial testcase - it seems
an unlikely situation to encounter in real code as you'd probably
use `hop` or subtract values of cursor or from `setmark` rather
than copying the value into a string variable just to find its
length.
Fix line number for "string not terminated" error
The reported line number was one too high in the case where we
were in a stringdef (but correct if we weren't).
Eliminate special handling for among starter
We now convert the starter to be a command before the among, adding
an explict substring if there isn't one. So:
substring C among ((X) ...)
now generates the exact same parse tree as:
substring C X among (...)
and:
among ((X) ...)
as:
substring X among (...)
Warn about suspicious situations where a command always signals f or always
signals t.
Author: jsteemann <jsteemann@users.noreply.github.com>
fix a memleak in snowball compiler
When an output file is used and the name option is not, then the
compiler will dynamically allocate value for the `name` option and never
free it.
This is not a large problem, because at the end of the compile process
the OS will free all allocated memory anyway.
However, when using snowball as part of a larger toolchain and then
using compile options such as `-fsanitize=leak`, a memleak in snowball
can break the entire build.
This exactly what happened to us. We could work around this somehow, but
it seems better to fix the leak in the compiler properly.
#136 and #166
LSAN_OPTIONS=leak_check_at_exit=0
Store textual data more efficiently in compiler
Previously the snowball compiler stored almost all textual data using
the symbol type, which is a typedef for unsigned short. This was
done even though most such data only used 8 bit character values - it
just ended up spaced out to twice the size (assuming 2 byte short)
with a zero byte between every actual character value byte.
The space and time overheads this incurred weren't really an issue as
snowball programs are small, but this also complicated code that handled
such data as it often needed to work character by character rather than
treating the data as a block.
Now we only use symbol for literal strings, as they may need to be
stored as wide character Unicode (ENC_WIDECHARS). If ENC_SINGLEBYTE or
ENC_UTF8 are in use then literal string data is still stored using
symbol.
Build system
------------
Turn on Java warnings and make them errors
Compile with -g by default
This makes debugging easier, and matches the default for at least
some other build systems (e.g. autotools).
[ada] Fix "make clean" to remove all built files
Author: Stefano Rivera <stefano@rivera.za.net>
Make file: Clean stemtest too
Add missing COMMON_FILES dep to dist targets
GNUmakefile: Tidy up and make more consistent
GNUmakefile: Make use of $*
This avoids needing to echo $< into sed to extract the same string,
which was slower and less readable, and allows other simplifications.
Use $(patsubst ...) instead of sed in .java.class rule.
This gives cleaner make output and is a bit more efficient.
libstemmer
----------
Testsuite
---------
Give a clear error if snowball-data isn't found
Fixes #196, reported by Andrea Maccis.
Handle not thinning testdata better
If THIN_FACTOR is set to 1 we no longer run gzipped test data
through awk. We also no handle THIN_FACTOR being set empty as
equivalent to 1 for convenience.
Fix Java TestApp to allow a single argument
The documented command line syntax is that you only need to specify
the language and there was already code to read from stdin if no
input file was specified, but at least two command line options
were required.
Fix deprecation warning in TestApp.java
Optimise TestApp.java by creating fewer objects. Patch from Robert Muir.
stemwords.py: Use argv.pop(0) to unshift elements
Removing and retrieving in one function is cleaner, and avoids bugs
(such as us not actually removing the argument for -c, though we
also currently ignore options we don't understand so this the
processing of -c's argument as an option would typically not cause
a problem).
Documentation
-------------
Include "what is stemming" section in each README
Include section on threads in each README. Based on patch for Python from dbcerigo.
Document input should be lowercase with composed accents
See #186
CONTRIBUTING.rst: Clarify which charsets to list
CONTRIBUTING.rst: Go into more detail
Fix some typos.
Author: Josh Soref <jsoref@users.noreply.github.com>
Document that our CI now uses github actions.
Update link to Greek stemmer PDF
Updated URL found by Michael Bissett, ref
https://github.com/snowballstem/snowball-website/pull/33
Snowball 2.2.0 (2021-11-10)
===========================
New Code Generators
-------------------
* Add Ada generator from Stephane Carrez (#135).
Javascript
----------
* Fix generated code to use integer division rather than floating point
division.
Noted by David Corbett.
Pascal
------
* Fix code generated for division. Previously real division was used and the
generated code would fail to compile with an "Incompatible types" error.
Noted by David Corbett.
* Fix code generated for Snowball's `minint` and `maxint` constant.
Python
------
* Python 2 is no longer actively supported, as proposed on the mailing list:
https://lists.tartarus.org/pipermail/snowball-discuss/2021-August/001721.html
* Fix code generated for division. Previously the Python code we generated
used integer division but rounded negative fractions towards negative
infinity rather than zero under Python 2, and under Python 3 used floating
point division.
Noted by David Corbett.
Code Quality Improvements
-------------------------
* C#: An `among` without functions is now generated as `static` and groupings
are now generated as constant. Patches from James Turner in #146 and #147.
Code generation improvements
----------------------------
* General:
+ Constant numeric subexpressions and constant numeric tests are now
evaluated at Snowball compile time.
Behavioural changes to existing algorithms
------------------------------------------
* german2: Fix handling of `qu` to match algorithm description. Previously
the implementation erroneously did `skip 2` after `qu`. We suspect this was
intended to skip the `qu` but that's already been done by the substring/among
matching, so it actually skips an extra two characters.
The implementation has always differed in this way, but there's no good
reason to skip two extra characters here so overall it seems best to change
the code to match the description. This change only affects the stemming of
a single word in the sample vocabulary - `quae` which seems to actually be
Latin rather than German.
Optimisations to existing algorithms
------------------------------------
* arabic: Handle exception cases in the among they're exceptions to.
* greek: Remove unused slice setting, handle exception cases in the among
they're exceptions to, and turn `substring ... among ... or substring ...
among ...` into a single `substring ... among ...` in cases where it is
trivial to do so.
* hindi: Eliminate the need for variable `p`.
* irish: Minor optimisation in setting `pV` and `p1`.
* yiddish: Make use of `among` more.
Compiler
--------
* Fix handling of `len` and `lenof` being declared as names.
For compatibility with programs written for older Snowball versions
len and lenof stop being tokens if declared as names. However this
code didn't work correctly if the tokeniser's name buffer needed to
be enlarged to hold the token name (i.e. 3 or 5 elements respectively).
* Report a clearer error if `=` is used instead of `==` in an integer test.
* Replace a single entry command list with its contents in the internal syntax
tree. This puts things in a more canonical form, which helps subsequent
optimisations.
Build system
------------
* Support building on Microsoft Windows (using mingw+msys or a similar
Unix-like environment). Patch from Jannick in #129.
* Split out INCLUDES from CPPFLAGS so that CPPFLAGS can now be overridden by
the user if required. Fixes #148, reported by Dominique Leuenberger.
* Regenerate algorithms.mk only when needed rather than on every `make` run.
libstemmer
----------
* The libstemmer static library now has a `.a` extension, rather than `.o`.
Patch from Michal Vasilek in #150.
Testsuite
---------
* stemtest: Test that numbers and numeric codes aren't damaged by any of the
algorithms. Regression test for #66. Fixes #81.
* ada: Fix ada tests to fail if output differs. There was an extra `| head
-300` compared to other languages, which meant that the exit code of `diff`
was ignored. It seems more helpful (and is more consistent) not to limit how
many differences are shown so just drop this addition.
* go: Stop thinning testdata. It looks like we only are because the test
harness code was based on that for rust, which was based on that for
javascript, which was only thinning because it was reading everything into
memory and the larger vocabulary lists were resulting in out of memory
issues.
* javascript: Speed up stemwords.js. Process input line-by-line rather than
reading the whole file into memory, splitting, iterating, and creating an
array with all the output, joining and writing out a single huge string.
This also means we can stop thinning the test data for javascript, which we
were only doing because the huge arabic test data file was causing out of
memory errors. Also drop the -p option, which isn't useful here and
complicates the code.
* rust: Turn on optimisation in the makefile rather than the CI config. This
makes the tests run in about 1/5 of the time and there's really no reason to
be thinning the testdata for rust.
Documentation
-------------
* CONTRIBUTING.rst: Improve documentation for adding a new stemming algorithm.
* Improve wording of Python docs.
Snowball 2.1.0 (2021-01-21)
===========================
C/C++
-----
* Fix decoding of 4-byte UTF-8 sequences in `grouping` checks. This bug
affected Unicode codepoints U+40000 to U+7FFFF and U+C0000 to U+FFFFF and
doesn't affect any of the stemming algorithms we currently ship (#138,
reported by Stephane Carrez).
Python
------
* Fix snowballstemmer.algorithms() method (#132, reported by kkaiser).
* Update code to generate trove language classifiers for PyPI. All the
natural languages we previously had stemmers for have now been added to
PyPI's list, but Armenian and Yiddish aren't on it. Patch from Dmitry
Shachnev.
Code Quality Improvements
-------------------------
* Suppress GCC warning in compiler code.
* Use `const` pointers more in C runtime.
* Only use spaces for indentation in javascript code. Change proposed by Emily
Marigold Klassen in #123, and seems to be the modern Javascript norm.
New Snowball Language Features
------------------------------
* `lenof` and `sizeof` can now be applied to a literal string, which can be
useful if you want to do calculations on cursor values.
This change actually simplifies the language a little, since you can now use
a literal string in any read-only context which accepts a string variable.
Code generation improvements
----------------------------
* General:
+ Fix bugs in the code generated to handle failure of `goto`, `gopast` or
`try` inside `setlimit` or string-`$`. This affected all languages (though
the issue with `try` wasn't present for C). These bugs don't affect any of
the stemming algorithms we currently ship. Reported by Stefan Petkovic on
snowball-discuss.
+ Change `hop` with a negative argument to work as documented. The manual
says a negative argument to hop will raise signal f, but the implementation
for all languages was actually to move the cursor in the opposite direction
to `hop` with a positive argument. The implemented behaviour is
problematic as it allows invalidating implicitly saved cursor values by
modifying the string outside the current region, so we've decided it's best
to fix the implementation to match the documentation.
The only Snowball code we're aware of which relies on this was the original
version of the new Yiddish stemming algorithm, which has been updated not
to rely on this.
The compiler now issues a warning for `hop` with a constant negative
argument (internally now converted to `false`), and for `hop` with a
constant zero argument (internally now converted to `true`).
+ Canonicalise `among` actions equivalent to `()` such as `(true)` which
previously resulted in an extra case in the among, and for Python
we'd generate invalid Python code (`if` or `elif` with an empty body).
Bug revealed by Assaf Urieli's Yiddish stemmer in #137.
+ Eliminate variables whose values are never used - they no longer have
corresponding member variables, etc, and no code is generated for any
assignments to them.
+ Don't generate anything for an unused `grouping`.
+ Stop warning "grouping X defined but not used" for a `grouping` which is
only used to define another `grouping`.
* C/C++:
+ Store booleans in same array as integers. This means each boolean is
stored as an int instead of an unsigned char which means 4 bytes instead of
1, but we save a pointer (4 or 8 bytes) in struct SN_env which is a win for
all the current stemmers. For an algorithm which uses both integers and
booleans, we also save the overhead of allocating a block on the heap, and
potentially improve data locality.
+ Eliminate duplicate generated C comment for sliceto.
* Pascal:
+ Avoid generating unused variables. The Pascal code generated for the
stemmers we ship is now warning free (tested with fpc 3.2.0).
* Python:
+ End `if`-chain with `else` where possible, avoiding a redundant test
of the variable being switched on. This optimisation kicks in for an
`among` where all cases have commands. This change seems to speed up `make
check_python_arabic` by a few percent.
New stemming algorithms
-----------------------
* Add Serbian stemmer from stef4np (#113).
* Add Yiddish stemmer from Assaf Urieli (#137).
* Add Armenian stemmer from Astghik Mkrtchyan. It's been on the website for
over a decade, and included in Xapian for over 9 years without any negative
feedback.
Optimisations to existing algorithms
------------------------------------
* kraaij_pohlmann: Use `$v = limit` instead of `do (tolimit setmark v)` since
this generates simpler code, and also matches the code other algorithm
implementations use.
Probably for languages like C with optimising compilers the compiler
will generate equivalent code anyway, but e.g. for Python this should be
an improvement.
Code clarity improvements to existing algorithms
------------------------------------------------
* hindi.sbl: Fix comment typo.
Compiler
--------
* Don't count `$x = x + 1` as initialising or using `x`, so it's now handled
like `$x += 1` already is.
* Comments are now only included in the generated code if command like option
-comments is specified.
The comments in the generated code are useful if you're trying to debug the
compiler, and perhaps also if you are trying to debug your Snowball code, but
for everyone else they just bloat the code which as the number of languages
we support grows becomes more of an issue.
* `-parentclassname` is not only for java and csharp so don't disable it if
those backends are disabled.
* `-syntax` now reports the value for each numeric literal.
* Report location for excessive get nesting error.
* Internally the compiler now represents negated literal numbers as a simple
`c_number` rather than `c_neg` applied to a `c_number` with a positive value.
This simplifies optimisations that want to check for a constant numeric
expression.
Build system
------------
* Link binaries with LDFLAGS if it's set, which is needed for some platform
(e.g. OpenEmbedded). Patch from Andreas Müller (#120).
* Add missing dependencies of algorithms.go rule.
Testsuite
---------
* C: Add stemtest for low-level regression tests.
Documentation
-------------
* Document a C99 compiler as a requirement for building the snowball compiler
(but the C code it generates should still work with any ISO C compiler).
A few declarations mixed with code crept in some time ago (which nobody's
complained about), so this is really just formally documenting a requirement
which already existed.
* README: Explain what Snowball is and what Stemming is (#131, reported by Sean
Kelly).
* CONTRIBUTING.rst: Expand section on adding a new generator.
* For Python snowballstemmer module include global NEWS instead of
Python-specific CHANGES.rst and use README.rst as the long description.
Patch from Dmitry Shachnev (#119).
* COPYING: Update and incorporate Python backend licensing information which
was previously in a separate file.
Snowball 2.0.0 (2019-10-02)
===========================
C/C++
-----
* Fully handle 4-byte UTF-8 sequences. Previously `hop` and `next` handled
sequences of any length, but commands which look at the character value only
handled sequences up to length 3. Fixes #89.
* Fix handling of a 3-byte UTF-8 sequence in a grouping in `backwardmode`.
Java
----
* TestApp.java:
- Always use UTF-8 for I/O. Patch from David Corbett (#80).
- Allow reading input from stdin.
- Remove rather pointless "stem n times" feature.
- Only lower case ASCII to match stemwords.c.
- Stem empty lines too to match stemwords.c.
Code Quality Improvements
-------------------------
* Fix various warnings from newer compilers.
* Improve use of `const`.
* Share common functions between compiler backends rather than having multiple
copies of the same code.
* Assorted code clean-up.
* Initialise line_labelled member of struct generator to 0. Previously we were
invoking undefined behaviour, though in practice it'll be zero initialised on
most platforms.
New Code Generators
-------------------
* Add Python generator (#24). Originally written by Yoshiki Shibukawa, with
additional updates by Dmitry Shachnev.
* Add Javascript generator. Based on JSX generator (#26) written by Yoshiki
Shibukawa.
* Add Rust generator from Jakob Demler (#51).
* Add Go generator from Marty Schoch (#57).
* Add C# generator. Based on patch from Cesar Souza (#16, #17).
* Add Pascal generator. Based on Delphi backend from stemming.zip file on old
website (#75).
New Snowball Language Features
------------------------------
* Add `len` and `lenof` to measure Unicode length. These are similar to `size`
and `sizeof` (respectively), but `size` and `sizeof` return the length in
bytes under `-utf8`, whereas these new commands give the same result whether
using `-utf8`, `-widechars` or neither (but under `-utf8` they are O(n) in
the length of the string). For compatibility with existing code which might
use these as variable or function names, they stop being treated as tokens if
declared to be a variable or function.
* New `{U+1234}` stringdef notation for Unicode codepoints.
* More versatile integer tests. Now you can compare any two arithmetic
expressions with a relational operator in parentheses after the `$`, so for
example `$(len > 3)` can now be used when previously a temporary variable was
required: `$tmp = len $tmp > 3`
Code generation improvements
----------------------------
* General:
+ Avoid unnecessarily saving and restoring of the cursor for more commands -
`atlimit`, `do`, `set` and `unset` all leave the cursor alone or always
restore its value, and for C `booltest` (which other languages already
handled).
+ Special case handling for `setlimit tomark AE`. All uses of setlimit in
the current stemmers we ship follow this pattern, and by special-casing we
can avoid having to save and restore the cursor (#74).
+ Merge duplicate actions in the same `among`. This reduces the size of the
switch/if-chain in the generated code which dispatch the among for many of
the stemmers.
+ Generate simpler code for `among`. We always check for a zero return value
when we call the among, so there's no point also checking for that in the
switch/if-chain. We can also avoid the switch/if-chain entirely when
there's only one possible outcome (besides the zero return).
+ Optimise code generated for `do <function call>`. This speeds up "make
check_python" by about 2%, and should speed up other interpreted languages
too (#110).
+ Generate more and better comments referencing snowball source.
+ Add homepage URL and compiler version as comments in generated files.
* C/C++:
+ Fix `size` and `sizeof` to not report one too high (reported by Assem
Chelli in #32).
+ If signal `f` from a function call would lead to return from the current
function then handle this and bailing out on an error together with a
simple `if (ret <= 0) return ret;`
+ Inline testing for a single character literals.
+ Avoiding generating `|| 0` in corner case - this can result in a compiler
warning when building the generated code.
+ Implement `insert_v()` in terms of `insert_s()`.
+ Add conditional `extern "C"` so `runtime/api.h` can be included from C++
code. Closes #90, reported by vvarma.
* Java:
+ Fix functions in `among` to work in Java. We seem to need to make the
methods called from among `public` instead of `private`, and to call them
on `this` instead of the `methodObject` (which is cleaner anyway). No
revision in version control seems to generate working code for this case,
but Richard says it definitely used to work - possibly older JVMs failed to
correctly enforce the access controls when methods were invoked by
reflection.
+ Code after handling `f` by returning from the current function is
unreachable too.
+ Previously we incorrectly decided that code after an `or` was
unreachable in certain cases. None of the current stemmers in the
distribution triggered this, but Martin Porter's snowball version
of the Schinke Latin stemmer does. Fixes #58, reported by Alexander
Myltsev.
+ The reachability logic was failing to consider reachability from
the final command in an `or`. Fixes #82, reported by David Corbett.
+ Fix `maxint` and `minint`. Patch from David Corbett in #31.
+ Fix `$` on strings. The previous generated code was just wrong. This
doesn't affect any of the included algorithms, but for example breaks
Martin Porter's snowball implementation of Schinke's Latin Stemmer.
Issue noted by Jakob Demler while working on the Rust backend in #51,
and reported in the Schinke's Latin Stemmer by Alexander Myltsev
in #58.
+ Make SnowballProgram objects serializable. Patch from Oleg Smirnov in #43.
+ Eliminate range-check implementation for groupings. This was removed from
the C generator 10 years earlier, isn't used for any of the existing
algorithms, and it doesn't seem likely it would be - the grouping would
have to consist entirely of a contiguous block of Unicode code-points.
+ Simplify code generated for `repeat` and `atleast`.
+ Eliminate unused return values and variables from runtime functions.
+ Only import the `among` and `SnowballProgram` classes if they're actually
used.
+ Only generate `copy_from()` method if it's used.
+ Merge runtime functions `eq_s` and `eq_v` functions.
+ Java arrays know their own length so stop storing it separately.
+ Escape char 127 (DEL) in generated Java code. It's unlikely that this
character would actually be used in a real stemmer, so this was more of a
theoretical bug.
+ Drop unused import of InvocationTargetException from SnowballStemmer.
Reported by GerritDeMeulder in #72.
+ Fix lint check issues in generated Java code. The stemmer classes are only
referenced in the example app via reflection, so add
@SuppressWarnings("unused") for them. The stemmer classes override
equals() and hashCode() methods from the standard java Object class, so
mark these with @Override. Both suggested by GerritDeMeulder in #72.
+ Declare Java variables at point of use in generated code. Putting all
declarations at the top of the function was adding unnecessary complexity
to the Java generator code for no benefit.
+ Improve formatting of generated code.
New stemming algorithms
-----------------------
* Add Tamil stemmer from Damodharan Rajalingam (#2, #3).
* Add Arabic stemmer from Assem Chelli (#32, #50).
* Add Irish stemmer from Jim O'Regan (#48).
* Add Nepali stemmer from Arthur Zakirov (#70).
* Add Indonesian stemmer from Olly Betts (#71).
* Add Hindi stemmer from Olly Betts (#73). Thanks to David Corbett for review.
* Add Lithuanian stemmer from Dainius Jocas (#22, #76).
* Add Greek stemmer from Oleg Smirnov (#44).