French language support #2

rrouviere · 2022-09-17T16:37:06Z

Hi!
I'm trying my hand at adding support for the french language.
I think that I have added everything indicated in the README.

Copied mycroft's resources
Adapted tokenizer.json
- word_matches
  - Still unsure whether I have all of the needed values.

Could you tell me what would be the next steps ?
I am familiar with java and I'll be willing to help, but I'd appreciate if you could point me in the right direction (for exemple, would I need to create a FrenchFormater and hardcode values into it?).

Thanks :)

Stypox

Thank you! The JSON files shouldn't be created in the bin folder, though. Also, you have committed a whole lot of .class files, I can clean them myself if you wish ;-)
When the PR files are cleaned up I will review it more in depth

Copied files from Mycroft + Some work on tokenizer

rrouviere · 2022-11-06T21:13:38Z

Hi!
My bad for the bin/, I thought that the .gitignore would take care of it, didn't think to double check x).
Thanks for having a look at it.

The part that I think will need the most work is the FrenchFormatter, as I didn't really modify it except to replace some (not all) strings, as I didn't want it to diverge too much from the English version.
Is there a "cleaner" way that avoid logic duplication? Or it is really language specific to the point where it's OK to just copy and edit the EnglishFormatter?

Thanks :)

BrightDV · 2022-12-19T10:46:27Z

Any update on this? Because the translations are good.

Stypox · 2022-12-20T15:25:56Z

Sorry, I will soon take care of this, I've been really busy lately.

Stypox

Thank you! And sorry for getting to this so late.

Does French have both long-scale and short-scale ways of pronouncing big numbers? English has both, but for example Italian does not. So if French is more similar to Italian maybe you may want to copy some structure from ItalianFormatter.

The code at the moment does not compile because the code uses subThousand and appendSplitGroups, but those functions have not been copied over from EnglishFormatter. Was this done by accident or do you wish to implement them in a different way?

Also, I noticed you have already translated tokenizer.json, the file containing word binding to make parsing easier. I think it's better to first implement formatting and after that works focus on parsing, though. So there is no need for you to work on that atm.

If anything I suggested is too Java-code-y to do, feel free to tell me :-)

Stypox · 2022-12-30T10:03:25Z

numbers/src/main/resources/config/fr-fr/tokenizer.json

+        "thousand_separator"
+      ],
+      "values": [
+        " ",


A space can't be a word, since spaces are word separators. If you want to say that thousands can be separated by spaces (i.e. nothing), you need to do so in Java code.

Stypox · 2022-12-30T10:05:49Z

numbers/src/test/java/org/dicio/numbers/lang/fr/DateTimeTest.java

+    /*  Please note that there is two way of saying years and centuries before 2000. For exemple:
+        1. mille (thousand) neuf (nine) cent (hundred) quatre-vingt (90) quatre (4)
+        2. dix-neuf (nineteen) cent (hundred) quatre-vingt (90) quatre (4). (Slightly old-fashioned but common for years before 1900)
+    */


It's perfectly ok for date time formatters to just return one possibility :-)

Ok, we'll go with the first one then since as @MXC48 mentioned it is valid both in speech as well as in writing. :)
(Except if @MXC48 thinks it would be a better idea to pronounce years before 1900 the second way?)

Stypox · 2022-12-30T10:13:41Z