Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

French language support #2

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

rrouviere
Copy link

Hi!
I'm trying my hand at adding support for the french language.
I think that I have added everything indicated in the README.

  • Copied mycroft's resources
  • Adapted tokenizer.json
    • word_matches
      • Still unsure whether I have all of the needed values.

Could you tell me what would be the next steps ?
I am familiar with java and I'll be willing to help, but I'd appreciate if you could point me in the right direction (for exemple, would I need to create a FrenchFormater and hardcode values into it?).

Thanks :)

Copy link
Owner

@Stypox Stypox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! The JSON files shouldn't be created in the bin folder, though. Also, you have committed a whole lot of .class files, I can clean them myself if you wish ;-)
When the PR files are cleaned up I will review it more in depth

Copied files from Mycroft + Some work on tokenizer
@rrouviere
Copy link
Author

rrouviere commented Nov 6, 2022

Hi!
My bad for the bin/, I thought that the .gitignore would take care of it, didn't think to double check x).
Thanks for having a look at it.

The part that I think will need the most work is the FrenchFormatter, as I didn't really modify it except to replace some (not all) strings, as I didn't want it to diverge too much from the English version.
Is there a "cleaner" way that avoid logic duplication? Or it is really language specific to the point where it's OK to just copy and edit the EnglishFormatter?

Thanks :)

@rrouviere rrouviere requested a review from Stypox November 6, 2022 21:14
@BrightDV
Copy link

Any update on this? Because the translations are good.

@Stypox
Copy link
Owner

Stypox commented Dec 20, 2022

Sorry, I will soon take care of this, I've been really busy lately.

Copy link
Owner

@Stypox Stypox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! And sorry for getting to this so late.

Does French have both long-scale and short-scale ways of pronouncing big numbers? English has both, but for example Italian does not. So if French is more similar to Italian maybe you may want to copy some structure from ItalianFormatter.

The code at the moment does not compile because the code uses subThousand and appendSplitGroups, but those functions have not been copied over from EnglishFormatter. Was this done by accident or do you wish to implement them in a different way?

Also, I noticed you have already translated tokenizer.json, the file containing word binding to make parsing easier. I think it's better to first implement formatting and after that works focus on parsing, though. So there is no need for you to work on that atm.

If anything I suggested is too Java-code-y to do, feel free to tell me :-)

"thousand_separator"
],
"values": [
" ",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A space can't be a word, since spaces are word separators. If you want to say that thousands can be separated by spaces (i.e. nothing), you need to do so in Java code.

Comment on lines 33 to 36
/* Please note that there is two way of saying years and centuries before 2000. For exemple:
1. mille (thousand) neuf (nine) cent (hundred) quatre-vingt (90) quatre (4)
2. dix-neuf (nineteen) cent (hundred) quatre-vingt (90) quatre (4). (Slightly old-fashioned but common for years before 1900)
*/
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's perfectly ok for date time formatters to just return one possibility :-)

Copy link
Author

@rrouviere rrouviere Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we'll go with the first one then since as @MXC48 mentioned it is valid both in speech as well as in writing. :)
(Except if @MXC48 thinks it would be a better idea to pronounce years before 1900 the second way?)

if (z < 1000) {
groupName = subThousand(z, i == 0 && ordi);
} else {
groupName = subThousand(z / 1000, false) + " thousand";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are untranslated: "thousand", "thousandth", "th", "zero", "point" and maybe others

import org.dicio.numbers.util.MixedFraction;

public class FrenchFormatter extends NumberFormatter {

Copy link
Owner

@Stypox Stypox Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should also translate the following arrays from EnglishFormatter into French, otherwise the code below won't work: NUMBER_NAMES, NUMBER_NAMES_SHORT_SCALE, NUMBER_NAMES_LONG_SCALE, ORDINAL_NAMES, ORDINAL_NAMES_SHORT_SCALE, ORDINAL_NAMES_LONG_SCALE.

Copy link
Author

@rrouviere rrouviere Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will work on that. Just one question though: what is the difference between NUMBER_NAMES in the Formatter and their copy in tokenizer.json? Should we try and pull their value from there?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokenizer contains the code that allows parsing numbers. And each number may have different words. But when we do formatting, it works the other way around, and so there should be one unique mapping from number to word. The two configurations can definitely be merged in some way, but for the moment I decided to keep it simple.

Comment on lines 231 to 234
public String niceTime(LocalTime time, boolean speech, boolean use24Hour, boolean showAmPm) {
// TODO Auto-generated method stub
return null;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be taken care of separately?

@MXC48-zz
Copy link

Thank you! And sorry for getting to this so late.

Does French have both long-scale and short-scale ways of pronouncing big numbers? English has both, but for example Italian does not. So if French is more similar to Italian maybe you may want to copy some structure from ItalianFormatter.

So French has two ways in speaking but only one in writing
We are going to write 1850 "mille huit cents cinquante". Orally we are going to say "mille huit cents cinquante" or "dix-huit-cents-cinquante"

@Stypox
Copy link
Owner

Stypox commented Dec 30, 2022

I was not clear enough, what I meant is this: https://en.wikipedia.org/wiki/Long_and_short_scales

@MXC48-zz
Copy link

I was not clear enough, what I meant is this: https://en.wikipedia.org/wiki/Long_and_short_scales

So in this case we have the same system as in Italian

@rrouviere
Copy link
Author

Hi!
Thank your for taking a look at it.
As you've see this is mainly a way to avoid duplicating effort more than a ready to merge PR as of now.
From what I recall I've deleted code & fields that seemed too incompatible with french to adapt instead of rewrite.
I'll work on the code for a bit, try to fix a few things and will update you soon :-)

@rrouviere
Copy link
Author

rrouviere commented Dec 30, 2022

@MXC48 If you want to talk/split work you can contact me on discord :)

@Stypox
Copy link
Owner

Stypox commented Dec 30, 2022

Ok, thank you! Since French works like Italian and does not have both long scale and short scale numbers, you will not need many parts related to the shortScale parameter. Also, you will not need NUMBER_NAMES_SHORT_SCALE, NUMBER_NAMES_LONG_SCALE, ORDINAL_NAMES_SHORT_SCALE, ORDINAL_NAMES_LONG_SCALE: you can just put everything in, respectively, NUMBER_NAMES and ORDINAL_NAMES.

@rrouviere
Copy link
Author

Great! Just to be sure: should we duplicate numbers's name both in tokenizer.json as well as hardcoded into FrenchFormatter?
Thanks again for your time.

@Stypox
Copy link
Owner

Stypox commented Dec 30, 2022

Yeah, just answered now. For now numbers should be duplicated. Thanks to you! :-)

@rrouviere
Copy link
Author

I've fixed some of the problems and added a few fixmes here and there. I still have to add NUMBER_NAMES and ORDINAL_NAMES, maybe if you want to take care of it @MXC48 ?
That it for me today though :)

assertEquals("quatre-vingt-dix", pf.pronounceNumber(89.9).places(0).get());
assertEquals("moins un", pf.pronounceNumber(-0.5).places(0).get());
assertEquals("zéro", pf.pronounceNumber(-0.4).places(0).get());
assertEquals("six virgue trois", pf.pronounceNumber(6.28).places(1).get());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the omission of the l in "virgule"

assertEquals("deux-cent-un-millionième", pf.pronounceNumber(201000000000.0).ordinal(T).shortScale(F).get());
//TODO: Check this for french correctness as well as short/long scale issues (billion/billard)
assertEquals("neuf-cent-treize-milliard-quatre-vingt-million-six-cent-mille-soixante-mille-cent-soixante-quatrième", pf.pronounceNumber(913080600064.0).ordinal(T).shortScale(T).get());
assertEquals("neuf-cent-treize-mille-quatre-vingt-million-six-cent-mille-soixante-mille-soixante-quatrième", pf.pronounceNumber(913080600064.0).ordinal(T).shortScale(F).get());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you can repeat several times in a row "mille"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be "neuf-cent-treize-milliard-quatre-vingt-million-six-cent-mille-soixante-quatrième"? "mille million" isn't very common is it? (Though saying 913080600064th itself isn't very common -_-')

assertEquals("neuf-cent-treize-mille-quatre-vingt-million-six-cent-mille-soixante-mille-soixante-quatrième", pf.pronounceNumber(913080600064.0).ordinal(T).shortScale(F).get());
assertEquals("trilliard-deux-millionième", pf.pronounceNumber(1000002000000.0).ordinal(T).shortScale(T).get());
assertEquals("millard-deux-millionième", pf.pronounceNumber(1000002000000.0).ordinal(T).shortScale(F).get());
assertEquals("quatre-triliard-un-millionième", pf.pronounceNumber(4000001000000.0).ordinal(T).shortScale(T).get());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is the omission of an l in trilliard

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's nowhere near a trilliad? that's just "quatre billion un millionième"?

assertEquals("infini", pf.pronounceNumber(Double.POSITIVE_INFINITY).get());
assertEquals("moins l'infini", pf.pronounceNumber(Double.NEGATIVE_INFINITY).scientific(F).get());
assertEquals("moins l''infini", pf.pronounceNumber(Double.NEGATIVE_INFINITY).scientific(T).get());
assertEquals("Non défini", pf.pronounceNumber(Double.NaN).get());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"non défini" or the litteral translation "pas un nombre" ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Non défini", as in 10 divided by 0.

@MXC48
Copy link

MXC48 commented Feb 9, 2023

What should be done to complete the addition in the main code?
(I lost my access codes to the old account so I come back with this new one)

@mdouchin
Copy link

mdouchin commented Jul 24, 2023

also interested in testing: gentle "up"

@Stypox Stypox force-pushed the master branch 7 times, most recently from 49bc02d to b6a68df Compare May 27, 2024 13:33
@Stypox Stypox force-pushed the master branch 2 times, most recently from bda9e31 to d54de33 Compare May 27, 2024 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants