Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Numerator and Denominator Constants to Byte Arrays #3922

Merged
merged 110 commits into from
Oct 16, 2023

Conversation

younies
Copy link
Member

@younies younies commented Aug 23, 2023

closes: #4100

Changes:

  • Extracted the numerator and denominator constants.
  • Represented each constant as an array of bytes, starting with the least significant byte (LSB).
  • Implement and algorithm to detect any loop in the CLDR constants.

@younies younies requested review from sffc, robertbastian, Manishearth and a team as code owners August 23, 2023 15:48
@younies younies marked this pull request as draft August 23, 2023 15:48
@younies
Copy link
Member Author

younies commented Aug 23, 2023

@robertbastian , @Manishearth and @sffc , am I in the right direction?

@sffc
Copy link
Member

sffc commented Aug 23, 2023

What is the end goal? Do we want to convert the expressions to rational numbers? We should express them in the form most useful for runtime calculations.

@younies
Copy link
Member Author

younies commented Aug 23, 2023

What is the end goal? Do we want to convert the expressions to rational numbers? We should express them in the form most useful for runtime calculations.

As a first step, I want to convert all the constants value to u32/u32 or u64/u64.

After that, I will use the same helpers in order to convert the conversion rates too to the form of u32/u32 + index.

I want to make all the calculations in the fractional forms (numerator / denomerator)

@sffc
Copy link
Member

sffc commented Aug 24, 2023

For rational number calculations, should we consider https://github.com/rust-num/num-rational ?

provider/datagen/src/transform/cldr/units/helpers.rs Outdated Show resolved Hide resolved
provider/datagen/src/transform/cldr/units/helpers.rs Outdated Show resolved Hide resolved
provider/datagen/src/transform/cldr/units/helpers.rs Outdated Show resolved Hide resolved
provider/datagen/src/transform/cldr/units/helpers.rs Outdated Show resolved Hide resolved
provider/datagen/src/transform/cldr/units/helpers.rs Outdated Show resolved Hide resolved
provider/datagen/src/transform/cldr/units/helpers.rs Outdated Show resolved Hide resolved
provider/datagen/src/transform/cldr/units/helpers.rs Outdated Show resolved Hide resolved
provider/datagen/src/transform/cldr/units/helpers.rs Outdated Show resolved Hide resolved
provider/datagen/src/transform/cldr/units/helpers.rs Outdated Show resolved Hide resolved
provider/datagen/src/transform/cldr/units/helpers.rs Outdated Show resolved Hide resolved
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you've noted, unfortunately not everything fits into u64/u64. I think what we should do here is only store the basic constants and not the square/cube version of them, and then in units that need the square or cube version, we just include the basic constant multiple times. For example:

        <convertUnit source='bushel' baseUnit='cubic-meter' factor='2150.42*in3_to_m3' systems="ussystem"/>

When we get to the point of encoding that factor, we should encode it as 2150.42 * in_to_m * in_to_m * in_to_m. Then we don't ever need to know that in3_to_m3 ever existed.

@younies younies marked this pull request as ready for review September 15, 2023 22:14
@younies younies changed the title Add helper functions in extracting the conversion values. Extracting the values from the constants as Numerator and Denominator. Oct 10, 2023
@younies younies changed the title Extracting the values from the constants as Numerator and Denominator. Extract Numerator and Denominator Constants to Byte Arrays Oct 11, 2023
@younies younies requested a review from robertbastian October 11, 2023 14:35
@@ -142,13 +142,18 @@ pub const EXTRA_DATAGEN_DEPS: &[&str] = &[
"databake-derive",
"elsa",
"erased-serde",
"fraction",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this conversation has been going back and forth for a little while, but I do share some concerns with @robertbastian about the fraction crate. We already pull in the num-rational crate which appears to support the same sort of abstraction: pub type BigRational = Ratio<BigInt>;. I don't understand why we need the extra crate. Besides this, fraction has less than 6% of the lifetime downloads of num-rational.

This isn't blocking because this is only datagen but it will definitely be a concern when we put this into the library crate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already explained to Robert the reasons for using the GenericFraction change:

  • Sign Separation: Unlike BigRational, which includes the sign in the numerator, GenericFraction allows for a clear separation between the numerator/denominator and the sign.
  • Byte Sequence Extraction: We need to convert BigInt to BigUint to extract the byte sequences for both the numerator and denominator.
  • Decimal String Creation: GenericFraction supports initialization from decimal strings.

Given these reasons, GenericFraction is more appropriate for use in the datagen. However, for calculations during the conversion process, BigRational might be the better choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. All of these seem to me to be relatively easily solved with some code. I don't think it is the right choice to depend on a large crate with a bunch of unsafe for this.

I'm comfortable adding the num deps, since those are an ecosystem standard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the current code; it doesn't look like we rely on Sign Separation, i'm not sure where the byte sequence extraction is relevant, and decimal string creation we only use to parse a BigUint. We can do that directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmm .. in this case, I can use BigRational

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I am still not 100% convinced

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm already making the switch. It's turning out to be quite straightforward.

provider/datagen/src/transform/cldr/units/mod.rs Outdated Show resolved Hide resolved
This is likely due to a circular dependency in the constants. \
Note: If the depth was increased, you may need to increase the maximum depth in the code.",
));
count = if !updated { count + 1 } else { 0 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what is this counting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This counts how many times there were no updates. If the number of no updates is greater than the length of the queue, this means there is a loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the else branch be count then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if there was 1 update, we start counting from zero. I have changed the name to be clearer: "no_update_count."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying

provider/datagen/src/transform/cldr/units/mod.rs Outdated Show resolved Hide resolved
@younies younies requested a review from robertbastian October 16, 2023 12:09
younies and others added 4 commits October 16, 2023 14:27
Co-authored-by: Robert Bastian <4706271+robertbastian@users.noreply.github.com>
@younies younies merged commit f689152 into unicode-org:main Oct 16, 2023
27 checks passed
@younies younies deleted the units-conversion branch June 27, 2024 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Efficiency of Constants Replacement Algorithm in provider/datagen/src/transform/cldr/units/mod.rs
4 participants