Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the Humanise function better #2294

Open
parduz opened this issue Dec 19, 2021 · 13 comments
Open

Make the Humanise function better #2294

parduz opened this issue Dec 19, 2021 · 13 comments
Labels
localization needs thought For when you need to slow down and consider things.

Comments

@parduz
Copy link

parduz commented Dec 19, 2021

I think that a couple of changes in the Humanise function would help a lot non english users:

  1. "return rounded numbers istead of strings": i mean, instead of "More than 2 millions and a half" the function should return "More than 2000000 and a half" and let the voice manage the pronounciation. This would solve a lot of "speaking errors" at least with my Italian voice.
  2. "return an array": like [0] = "More than", [1] = "2000000", [2] = "and a half". This could allow an easier localization without the need to split the function output again if the order have to be altered.

The first one seems an easy task (if i'm not missing how other voices works), dunno about the second.

@Tkael
Copy link
Member

Tkael commented Dec 20, 2021

  1. I'm afraid that this doesn't work well in some languages. In English we would say "More than 2 and a half million" rather than "More than 2 millions and a half". This would be rendered improperly in English if we were to change the result to "More than 2000000 and a half".

  2. While an interesting idea, this would be a very disruptive change and has the same problem as your first suggestion.

We've designed the translation to allow the translator some flexibilty in how the phrase is constructed, e.g. "circa {0} milione e mezzo", where {0} is the number 2 in your example. I recognize that this is still imperfect for the Italian case (where you might need to use either "milione" or "milioni"). We'll have to think about whether we can improve this area further.

One possibility is to offer translators an opportunity to assign a different translation where the leading number is '1' (e.g. "circa {0} milione e mezzo" where {0} equals 1 and "circa {0} milioni e mezzo" where {0} is not equal to 1)? While redundant for many languages, this might provide the extra degree of control that Italian and other similar languages might need. @richardbuckle Your thoughts?

@Tkael Tkael added localization needs thought For when you need to slow down and consider things. labels Dec 20, 2021
@parduz
Copy link
Author

parduz commented Dec 20, 2021

  1. mh... blame me. I think i had the same idea about a year ago, talked about it in the forum, got the same answer, and forgot about it.

While redundant for many languages, this might provide the extra degree of control that Italian and other similar languages might need.

Well, it could be a step... with a way to "declare" the plurals in crowdin (something like milione|milioni or milion{e|i} ?) it could work.
But still italian have other issues: the voice says "1 milione" as "uno milione" which should be "un milione", "1 mila" (1000) is wrong as we just says "mille", and other difficulties too.
Also, sometime the humanization should care about if there's units to talk about (so, Credits, Persons, Tons, Joules, whatever), 'cause "milioni" wants "di" (millions of) but "mila" don't :-
And that's just the Italian, i don't know nothing about other languages but i guess that Humanize will not ever be smart enough.

I've built a "Italianise" script and replaced each Humanize call with it (with the problem of having to add a {set xxxx to yyyy } for the function parameter): that script takes the output of Humanise and make it correct.
This gave me another idea (that i may already have told somewhere, but i can't recall):

What if Humanise (and perhaps some other functions like "P"?) fires a "callback" script before returning his result?
Humanise could prepare some EDDI_translation_variable_ before (like the passed parameters, the array with the various phrase pieces, the integers resulting from the "humanization", and the function output): then fire the script which could do whatever the user wants altering the "proposed" output string (or do nothing, being empty by default) , and finally return whatever there's in that output string.

So, to recap this a bit: my new idea is

  • Humanise should have an optional string parameter which is what the passed number is about.
  • Humanise should fire a "callback" script before returning, to allow the user alter the function result.

@Tkael
Copy link
Member

Tkael commented Dec 21, 2021

Humanise should have an optional string parameter which is what the passed number is about. I'm not sure that I understand yet how we would need to do this. Please elaborate on what you'd enter and how we would need to handle it?

Humanise should fire a "callback" script before returning, to allow the user alter the function result. Once again, I'm a little fuzzy on the details of what you are proposing. Are you saying that Humanise() would work a little like an event and trigger another script from the Speech Responder?

@richardbuckle
Copy link
Member

with a way to "declare" the plurals in crowdin (something like milione|milioni or milion{e|i} ?) it could work.

Oh you sweet summer child of a language where there is only one plural 😀

As one who speaks both Italian and Russian, let me introduce you to Slavic plurals, where the inflection depends upon the last word (not the last digit) of the number, e.g Russian (in Latin alphabet):

  • one => nominative singular:
    • 1 kg => odin kilogramm,
    • 101 kg => sto odin kilogramm
  • two, three, four => genitive singular:
    • 2 kg => dva kilogramma,
    • 34 kg => tridtsat' chetyre kilogramma
  • anything else => genitive plural:
    • 5 kg => pyat' kilogrammov,
    • 12 kg => dvenadtsat' kilogrammov

Oh, and the cardinal numbers are themselves nouns and must be declined. The word for 'about' is 'okolo' and takes genitive case, so for example 'dva' becomes 'dvukh': 'about two kilograms' is 'okolo dvukh kilogrammov'.

Amazingly, Microsoft's default Russian TTS voice gets all the above right given just the left-hand side, so in the Russian translation the approach is to push as much work as possible to the TTS voice.

I bring this up not to dismiss the idea but to illustrate how incredibly hard it is to generalise.

I would certainly agree that Humanise() already has a lot of anglo-centric assumptions embedded in the very idea that just the number is sufficient as a parameter, but I am wary of going down the rabbit hole of trying to make it suit everyone's needs and failing anyhow.

@parduz
Copy link
Author

parduz commented Dec 22, 2021

Humanise should fire a "callback" script before returning, to allow the user alter the function result. Once again, I'm a little fuzzy on the details of what you are proposing. Are you saying that Humanise() would work a little like an event and trigger another script from the Speech Responder?

EXACTLY!

Humanise should have an optional string parameter which is what the passed number is about. I'm not sure that I understand yet how we would need to do this. Please elaborate on what you'd enter and how we would need to handle it?

Let me try with an example of what my envision is:

You sold it for {Humanise(1534752,"Credits")}.

Humanise do his math and calls the "ReviewHumaniseOutput" script, which could access some variable like:
EDDI_Humanise_Parts[0] = about
EDDI_Humanise_Parts[1] = 1 million
EDDI_Humanise_Parts[2] = and a half
EDDI_Humanise_Parts[3] = Credits
EDDI_Humanise_Param[0] = 1534752
EDDI_Humanise_Param[1] = Credit
EDDI_Humanise_RoundValue = 1
EDDI_Humanise_Magnitude = 1000000
EDDI_Humanise_Output = about 1 million and a half

The user do what they want and change the EDDI_Humanise_Output variable; these variables gives info about "what should be said".
When the script ends, Humanise can return whatever there's in the output string.

It seems to me the less "invasive", pretty useful and the most compatible solution.
I may not see what other languages may need, but for sure this would allow me to have a nice "Italianise" with minimum efforts.

@parduz
Copy link
Author

parduz commented Dec 22, 2021

This is my current "Italianise" script.
It's "too young" so it is in "beta" stage, perhaps may explain what i need to do better than my poor english:

{_ 1000 _}

{set RegexStr to "(.+ )*([0-9]+\,[0-9]+|[0-9]+)( *(mila|.+?lione|.+?liardo)) *(e mezzo)*"}
{set theNumber to PassedNumber }
{set theUnit   to PassedUnit   }

{set Humanized to Humanise(theNumber)}

{set Italianized to match( Humanized, RegexStr )}

{if len(Italianized)=0 :
	{if find(Humanized,"000.000") > -1:
		{set Beginning     to Humanized }
		{set Quantity      to ""        }
		{set Magnitude     to ""        }
		{set AndAHalf      to ""        }
		{set BeforetheUnit to " di"     }
	|else:
		{_ dump match(Humanized, RegexStr)}
		{_ what else to do? return Humanized}
		{set Beginning     to Humanized }
		{set Quantity      to ""        }
		{set Magnitude     to ""        }
		{set AndAHalf      to ""        }
		{set BeforetheUnit to ""        }
	}
|else:
	{_ dump match(Humanized, RegexStr)}
	{_ Found }
	{set Beginning     to Italianized[1] }
	{set Quantity      to Italianized[2] }
	{set Magnitude     to Italianized[4] }
	{set AndAHalf      to Italianized[5] }
	{set BeforetheUnit to ""             }

	{if Quantity = "1" :
		{_ manage singular pronounciation _}
		{if Magnitude = "mila" :
			{if AndAHalf = "e mezzo" :
				{set Quantity to cat(Quantity,"500") }
			|else:
				{set Quantity to "mille"}
			}
			{set Magnitude to ""}
			{set AndAHalf to ""}
			{set BeforetheUnit to ""}
		|else:
			{set Quantity to " un"}
			{set BeforetheUnit to " di"}
		}
	|else:
		{_ manage plurar _}
		{if Magnitude = "mila" :
			{if AndAHalf = "e mezzo" :
				{set Quantity to cat(Quantity,"500") }
				{set Magnitude to ""}
				{set AndAHalf to ""}
				{set BeforetheUnit to ""}
			}
		|else:
			{set Magnitude to slice(Magnitude,0,len(Magnitude)-1) }
			{set Magnitude to cat(Magnitude,"i") }
			{set BeforetheUnit to " di"}
		}
	}
}
{Beginning}{Quantity} {Magnitude} {AndAHalf}{if theUnit: {BeforetheUnit} {theUnit}}.

The whole regex part returns what i would like to have already set by the new Humanise, before firing the "callback" script.

HTH :)

@Tkael
Copy link
Member

Tkael commented Dec 26, 2021

Hmm. Variables in Cottle are immutable, meaning that it would not be possible for the user to set {event.EDDI_Humanise_Output }. We'd have to use SetState() to set a variable and EDDI would need to know to read a specific value from the SetState dictionary.

In terms of complexity, you may be better off sticking with your Italianise script and calculating your values from the original number.

Here's an example of how you could calculate some of the critical values for Italianise from the raw value:

{set originalNumber to 54741887}

{set value to originalNumber}
{while value >= 10:
    {set magnitude to magnitude + 1}
    {set value to value / 10}
}
Magnitude: {magnitude},

{set orderMultiplier to round(pow(10, floor(magnitude / 3) * 3))}
Order Multiplier: {orderMultiplier},

{set firstNumber to floor(value)}
First Number: {firstNumber},

{set secondNumber to floor((value - firstNumber) * 10)}
Second Number: {secondNumber},

{set thirdNumber to floor((value - firstNumber - (secondNumber / 10)) * 100)}
Third Number: {thirdNumber}.

Humanized: {Humanise(54741887)}

From these calculated numbers, we know:

  • The magnitude is 7 (so in the tens of millions range, we might want to use 2 significant figures)
  • The order multiplier is 1000000 (so our unit will be millions)
  • The first number in the value is 5
  • The second number in the value is 4
  • The third number in the value is 7 (more than halfway to the next significant digit)

With 2 significant figures in the millions order and our third digit more than halfway to the next significant figure, we get a humanized value of "Over 54 and a half million".

Hope that helps.

@Tkael
Copy link
Member

Tkael commented Dec 26, 2021

Hmm... after going though the exercise above I think we might also be able to treat Humanise() as a special case of {F("Humanise")}, where we automatically set helpful values calculated from the original number and the translator does the rest using a Humanise script.

It would be another major re-write / disruption for translators but should be possible. Much of the work that has gone into humanizing values via CrowdIn strings would become obsolete.

@richardbuckle your thoughts?

@richardbuckle
Copy link
Member

I think it would be important to get feedback from the other translation teams before embarking on such a radical overhaul. There are bound to be further language-specific issues that we are unaware of.

@Tkael
Copy link
Member

Tkael commented Dec 30, 2021

I've sent a message to our proofreaders on CrowdIn to request additional feedback before we implement any changes.

@yucatan
Copy link

yucatan commented Dec 30, 2021

I have to say that it's not that too hard to make the adjustments in the scripts to get the proper pronunciation in Portuguese. But I am not against such changes.

@Transcan
Copy link
Contributor

Transcan commented Jan 2, 2022

I'm Spanish and in my case I had to write my own "Humaniza" function.
Spanish language has plurals and gender, and so do the number's spelling.

For example:
21 can be spelled as:
veintiún - male, singular
veintiuno - also male and singular but used in some cases
veintiunos - male, plural
veintiuna - female, singular
veintunas - female, plural.

And that is not a regular law. I mean, is difficult to code not counting all the exceptions, even harder if you have to adapt the code to other languages.

Also, the main issue I encountered, some voices don't read them as it should (the gender doesn't match for example). So I coded inside the new humaniza function my own way to spell the numbers. It converts the numbers to words and uses a flag for gender. This way, the voice will read it as I want.

The tricky part is the invocation, because You can't give parameters to a script directly.
Things like {F('Humaniza', 12345, female)} doesn't work...

I placed this at the beginning of each script that needs it:

{_ Funcion humaniza() _}
{set humaniza(n, g) to:
	{SetState("humaniza", n)}
	{SetState("humaniza_femenino", g=true)}
	{F("Humaniza")}
	{return state.humaniza_resultado}
}

And invoke it just as normal function:
has comprado {humaniza(item.amount, true)} toneladas de {item.name}.

Three state variables are used:
humaniza is the number
humaniza_femenino is a boolean for gender
humaniza_resultado is the result as a text string that the script humaniza sets.

"1500000" going through my script return "un millón y medio" while humanise() returns "1000000 y medio".

So my final words are that the thought about making the internal humanise() function some kind of a "function editable by the user via script" is a nice idea.
This way each language can make or adapt his own script, or use the default if that is enough for them.

@Tkael
Copy link
Member

Tkael commented Jan 10, 2022

Thank you @yucatan and @Transcan for your feedback. I'll keep thinking about this. Also happy to hear from any other translators who haven't weighed in yet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
localization needs thought For when you need to slow down and consider things.
Projects
None yet
Development

No branches or pull requests

5 participants