Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set CPT ValueWidth::DATA_GET_ERROR_VALUE to 0 #1183

Closed
echeran opened this issue Oct 18, 2021 · 31 comments · Fixed by #1804
Closed

Set CPT ValueWidth::DATA_GET_ERROR_VALUE to 0 #1183

echeran opened this issue Oct 18, 2021 · 31 comments · Fixed by #1804
Assignees
Labels
C-unicode Component: Props, sets, tries S-tiny Size: Less than an hour (trivial fixes) T-bug Type: Bad behavior, security, privacy

Comments

@echeran
Copy link
Contributor

echeran commented Oct 18, 2021

Currently, we set the associated value ValueWidth::DATA_GET_ERROR_VALUE to be the the max value of the impl types (u8::MAX, u16::MAX, u32::MAX).

A better value for this error value would be 0, which matches the Unknown enumerated value for most enumerated properties.

This is a minor improvement since, after all, if the CodePointTrie code and data (from the data generator) are all correct, this error value will never be returned for valid code points. (If the input is an invalid code point, then this error value will be returned, but garbage-in-garbage-out.)

@echeran echeran added T-bug Type: Bad behavior, security, privacy C-unicode Component: Props, sets, tries S-tiny Size: Less than an hour (trivial fixes) labels Oct 18, 2021
@sffc sffc mentioned this issue Oct 20, 2021
3 tasks
@sffc
Copy link
Member

sffc commented Oct 20, 2021

Also resolve this comment:

TrieValue requires ULE, so we already have to be able to convert from &[u8] to T. Is there some way to reuse that code, instead of requiring a second parsing function?

Originally posted by @iainireland in #1167 (comment)

@markusicu
Copy link
Member

If the input is an invalid code point, then this error value will be returned, but garbage-in-garbage-out.

If the input is an invalid code point, then the CodePointTrie index-computation code should return the index of where the error value is stored in the data array, so the hardcoded value is still not used.

Only if the data is totally messed up and you get index-out-of-bounds errors is there a way to get to this sort of hardcoded value, and then it might as well be 0 just because that's simplest.

@markusicu
Copy link
Member

Also, there is no point in hardcoding an override of this value in each value-type-specific version of a trie (as seen here: #1233 (comment)). That would be misleading if it differs from the error value that is stored inside the trie. If anything, the value-type-specific version could be fetched from the trie data during initialization. And when that fails because the data is totally broken, then you might as well throw or pass through an error. Or else convert numeric zero into the value type.

@sffc
Copy link
Member

sffc commented Nov 2, 2021

This is truly a "last resort" fallback which really should be an error, because it's only returned to recover from an array out of bounds. However, it's so unusual (only as a result of malformed data) that it's not really correct to make the whole CPT fallible (returning a Result) because of it. I see this as a garbage-in-garbage-out scenario.

We need it as an associated type because Rust allows us to store "upgraded" or "exotic" types in the CPT, like Script or GeneralSubcategory, which are integers under the hood but behave as their higher-level types when you access them from the CPT. We can't simply return 0 because we need to return an instance of Script or GeneralSubcategory.

@sffc
Copy link
Member

sffc commented Nov 2, 2021

Zibi gave me an idea on Slack, which is that if we really don't want to define the default error value for these types, a workaround would be to add invariants to the CodePointTrie itself, such that it's impossible to have a CodePointTrie where the error value results in an out-of-bounds access.

Note that right now CodePointTrie is zero-validation; we just point directly to the byte buffer. Adding validation may complicate or slow down certain operations.

@sffc
Copy link
Member

sffc commented Nov 2, 2021

Here's another idea. Could we change the CPT data model such that the error value is stored as a field in the header rather than an index in the array? Then we can use that as our fallback.

@iainireland
Copy link
Contributor

Isn't it sufficient to validate data when we read it in from toml? If we don't serialize bad data, then we shouldn't deserialize bad data.

@hsivonen
Copy link
Member

hsivonen commented Nov 2, 2021

Isn't it sufficient to validate data when we read it in from toml? If we don't serialize bad data, then we shouldn't deserialize bad data.

I agree. I think cryptography—not complicating data structures or run-time lookup code—is the right solution for users of ICU4X who are concerned about deserializing data that didn't come from ICU4X's own data blob building process. If the data blob building process had a bug, a panic is reasonable at the time of use of the data.

That is, I think we should treat data built by the ICU4X build process as an internal part of ICU4X for error handling purposes. In cases where the data actually travels externally instead of being baked into the data segment of the executable file holding ICU4X's code, I think cryptography is the right way to ensure integrity.

As a practical matter, I think it's not worthwhile to write a collator with the assumption that the collation data (where collation elements carry indices into data) might be bogus, and validating the collation data upon loading would be complicated and expensive compared to checking a signature when the data has traveled externally.

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Nov 2, 2021
@markusicu
Copy link
Member

Isn't it sufficient to validate data when we read it in from toml? If we don't serialize bad data, then we shouldn't deserialize bad data.

I agree. I think cryptography—not complicating data structures or run-time lookup code—is the right solution for users of ICU4X who are concerned about deserializing data that didn't come from ICU4X's own data blob building process. If the data blob building process had a bug, a panic is reasonable at the time of use of the data.

+1

Between UTrie2 and the newer CodePointTrie, I moved the error value from the header struct into the data array in order to make the code simpler and more uniform. The error handling is inside the compute-data-index functions, and thus independent of the value width. The caller of the index computation just takes the index and fetches a value from the data array.

If you want a little bit of validation, there are two or three special values at the end of the data array, and actually ASCII is stored linearly. So you could bail out if the data array is shorter than 128, and that also guarantees that you can fetch values from near the end of the array.

If the data array is long enough but its contents is bogus, then those integer values might be out of bounds for converting to Script etc. but again you could bail out if you can't convert the error value fetched from there.

I also wonder about the ROI of wrapping the code point trie such that the caller automagically gets a Script or General_Category enum object. It's nice, but it seems to require extra complication and boilerplate. Consider just returning integer values?

@sffc
Copy link
Member

sffc commented Nov 2, 2021

Viewing data as untrusted is a principle we've had since the early days of ICU4X. It is largely based on the principle that data can be overridden and customized by the user. That's a feature that is commonly requested (e.g. tc39/ecma402#210), but which ICU does not support very well, and is therefore something that is central to the design of the ICU4X data provider.

I'm happy to discuss alternative mechanisms to ensure validity of data, such as cryptographic signing of data packs, which could be in the context of a "fast path" that bypasses certain validation safeguards. This should be the subject of a different issue.

I also wonder about the ROI of wrapping the code point trie such that the caller automagically gets a Script or General_Category enum object. It's nice, but it seems to require extra complication and boilerplate. Consider just returning integer values?

The advantage is that we don't need to re-validate that the return value is valid. If we return an integer, it is unsafe in Rust, for good reason, to cast it to the corresponding enum.

@iainireland
Copy link
Contributor

We can let users bring their own data while still pushing validation into the build process. Users provide toml, which is untrusted. The slow toml-backed data provider validates the data and produces a valid codepoint trie. The resulting codepoint trie can now be trusted, even if we serialize and then deserialize it.

I guess technically it's possible to create an invalid serialized CPT by hand, but I don't see much value in defending ourselves against that. We can provide optional validation to be used in the highly unusual case that somebody wants to use untrusted data at runtime, without imposing a startup penalty on normal use cases where the data is inside the trust boundary.

@sffc
Copy link
Member

sffc commented Nov 2, 2021

From discussion on Slack: it would be valuable to have a mode in which we skip the validation step for ICU4X data if we know that the data is from a trusted source. @zbraniecki may file a follow-up issue to discuss that. The status quo, though, is that Serde and DataProvider assume data is untrusted, and therefore validation is required. In order to make constructive progress on the original topic of this thread (the fallback error value for ValueWidth in CodePointTrie), we should discuss it within the current framework of how ICU4X data works.

@sffc
Copy link
Member

sffc commented Nov 2, 2021

Here are what I see as the potentially viable options for the original problem in this thread:

  1. Status quo: associated type on ValueWidth that provides our last-resort error value.
  2. Make CodePointTrie::get fallible: make it return a Result
  3. Make the CodePointTrie constructor fallible and add validation to ensure that the error value is present in the data array
  4. Move the error value to CodePointTrieHeader and keep the system validation-free

I think option 4 satisfies everyone's requirements. It makes the validation question moot, because CPT won't need validation as it currently does not.

@markusicu
Copy link
Member

If the validation includes checking that the data array length is at least 128, then you can access the error value integer. I would fetch that at load/init time, and convert it to the API type like Script. If the validation fails, or if the conversion to the API type fails, then throw an error/exception.

You can guarantee that accesses into the data array never go out of bounds by either explicitly checking for bounds in index computation, or by validating during initialization that none of the indexes go out of bounds for next-stage index or data blocks. There is a complication with the unusual-but-possible 18-bit data block offsets. Worst case, you would have to inspect roughly 17k index entries, but you could inspect shared index blocks only once and then skip.

@markusicu
Copy link
Member

Paths crossed. Option 3 sounds good. You want the option of rejecting bad data, even if you don't have an elaborate test for that yet.

@zbraniecki
Copy link
Member

Can you elaborate on the benefit of option 3 vs option 2?

@markusicu
Copy link
Member

Option 3 slows down initialization a bit but then you have no further overhead. You will want some validation; it is quick to check that it looks like a CodePointTrie and that the data array is large enough for linear ASCII. Then you can fetch the error value. So I would make the constructor return a Result, and make it at least fetch & convert the error value -- or else fail.

Full validation of all indexes takes a bit more time, of course.

Option 2 lets you do less validation up front but adds checks into the runtime code.

Option 1 = status quo handles the out-of-bounds case by returning a hardcoded error value, right? That's not orthogonal to option 3. I propose that the constructor fetch and convert the error value, rather than hardcoding it; and because that initialization step can fail, the constructor should return a Result. You then don't need get() to return a Result, too.

@sffc
Copy link
Member

sffc commented Nov 2, 2021

Question: is the CPT algorithm guaranteed to terminate given untrusted data to operate on? I've been assuming the answer is yes, but I've never asked it explicitly.

@markusicu
Copy link
Member

Question: is the CPT algorithm guaranteed to terminate given untrusted data to operate on?

It will terminate. There are no loops, no search. It's a constant time lookup within each of certain code point ranges, but higher-code point ranges use more indirections or complications; still fixed.

What can go wrong with bogus data is that you do array-index-out-of-bounds.

@hsivonen
Copy link
Member

hsivonen commented Nov 3, 2021

  1. Status quo: associated type on ValueWidth that provides our last-resort error value.

Is this the option that can panic in this case: "What can go wrong with bogus data is that you do array-index-out-of-bounds."? If so, I think we should go with this option.

Specifically: If someone wants to customize the data, they should customize the input to the data pipeline and run the data pipeline on their customization. E.g. for collation, it's completely unreasonable to manually tweak the output of the data pipeline, since it's so likely that the result would be malformed. For collation, someone wishing to customize should create a CLDR-like input, run in through the ICU4C builder, dump to TOML, and convert the TOML to a binary.

  1. Make CodePointTrie::get fallible: make it return a Result

I think this is bad: It contaminates the implementation code with Result propagation for an error that isn't run-time actionable and whose nature is that it signals internal brokenness.

Consider the collator: The way a collator is used is that you plug it into some larger sorting operation. For the use case, you want the collator to return an Ordering and nothing else, which it can do for any pair of input string slices. There is no way to err due to the input data (since Rust &str guarantees the absence of UTF errors and UTF errors in other input flavors are treated as if they were U+FFFD).

Propagating a Result about the internal data being bogus makes everything less ergonomic to use when there already exist a way to signal internal bugs that are unactionable at run time: panics.

  1. Make the CodePointTrie constructor fallible and add validation to ensure that the error value is present in the data array

Even if this approach was feasible for CodePointTrie itself, it wouldn't generalize. I expect validating collation data like this to be expensive enough that given the choice of validating the data like this, checking a signature, or taking the risk of panics on data corruption (something small like ed25519 without any ASN.1 layers), I expect real-world users to either check a signature or take the risk of panics (where the latter might actually mean trusting https to work). There's very little upside to deploying structural validation instead of a signature check for ensuring the integrity of externally-traveled data: The only upside is not having to deal with cryptographic keys.

  1. Move the error value to CodePointTrieHeader and keep the system validation-free

This would make ICU4X compute bogus results if externally-traveled data actually has been corrupted. In terms of surprise and what it takes to understand what's going wrong, this seems worse than any of the other options. However, in terms of usage ergonomics, I'd rather settle on this option than option 2 if we can't agree on panicking on malformed data at the time of making lookups from the data.

@sffc
Copy link
Member

sffc commented Nov 11, 2021

Follow-up for handling cases where we know data is trusted: #1290

@iainireland
Copy link
Contributor

iainireland commented Dec 2, 2021

Picking this conversation back up, because it's relevant in #1349:

I agree with Henri's summary. In particular, I think this point (wrt option 4) is important:

This would make ICU4X compute bogus results if externally-traveled data actually has been corrupted. In terms of surprise and what it takes to understand what's going wrong, this seems worse than any of the other options.

In cases where corrupted data is a live issue, there's a real cost to pushing off its detection until later.

I'd like to propose a fifth option:

  1. Panic on bad data where appropriate. Validate data in the data transformer. Make the same validation function available at runtime for anybody who is worried about deserializing corrupted data dynamically, without imposing any costs for users who can trust their data.

Shane has made a good case that the slow path needs to exist. I am not convinced, though, that it should be the mandatory default. If somebody is dynamically deserializing untrusted data, I think it's reasonable to ask that user to jump through one additional hoop (when the alternative is adding hoops for other, less esoteric use cases). Note that this also allows other forms of validation, like Henri's suggestion of checking a signature.

@sffc
Copy link
Member

sffc commented Dec 2, 2021

esoteric

I see untrusted, user-provided data as a core part of the value proposition of this project. This might be a factor contributing to our difference in perspectives on this issue.

Make the same validation function available at runtime for anybody who is worried about deserializing corrupted data dynamically

I don't see how this would work. The transformer transforms from some arbitrary source data to the ICU4X data struct. That's a different type of validation than what the Deserialize impl needs to do.

Here's another way to look at this. The Serde Deserialize impl is a constructor. The job of a constructor is to make sure that a well-formed object is returned, satisfying all invariants of private fields.

I'm open to exploring options for allowing different code paths in Serde for trusted and untrusted data. See #1290. I think that is the correct way to solve this problem.

@iainireland
Copy link
Contributor

I see untrusted, user-provided data as a core part of the value proposition of this project.

I buy the case for decoupling data from code. In that sense, I agree that user-provided data (that is, data that users have built for themselves at some point in the future, using the tools that we provide) is valuable. I don't understand the case for data from any non-ICU source. Who's bringing their own case mappings?

(I can imagine an argument along the lines of "we want to have one bundle of data on a system for many programs to share, and we don't want control of that bundle to be a vector for exploitation". That's a plausible reason to ensure that we don't do anything unsafe depending on deserialized data, but it's not an argument for avoiding panics caused by bad data.)

I don't see how this would work. The transformer transforms from some arbitrary source data to the ICU4X data struct. That's a different type of validation than what the Deserialize impl needs to do.

I already implemented it for CaseMappingExceptions in #1349. The ICU4X data struct has internal invariants: for example, for any CaseMappingData where data.has_exception() is true, data.exception_index() must be a valid index into the exception data that points to a header word. (If it isn't, we might panic: for example, we might read out of bounds.) In CaseMappingExceptions::validate(), we walk through the exception data and collect the set of valid indices. Then we look at the trie data and verify that there aren't any entries that point to invalid indices. If there aren't, then (after repeating a similar exercise for the other internal invariants) we can be confident that we won't panic.

In a similar way, it would be relatively easy to write a function for CodePointTrie that walks through the index/data arrays and verifies that we'll never have an out-of-bounds read. Post-validation, we can use expect() and be confident that there won't be any panics. (We even have a handy place to put a call to CodePointTrie::validate() in try_new().)

In neither case is this validation particularly fast, but it's certainly fast enough to run in the transformer, where we mostly don't care about performance. Because the internals of these data structures are private, and we write all the constructors, we can guarantee that every CodePointTrie / CaseMapping ever created preserves the necessary internal invariants. This property is preserved through serialization and deserialization. Malicious serialization aside, there can't be panics.

Your stance (correct me if I'm wrong) appears to be: well, somebody could hand-craft an invalid serialized CodePointTrie, so we had better make sure that we also call CodePointTrie::validate() after deserializing data. My response: if there are users for whom that's part of their threat model, then let them call CodePointTrie::validate() by hand. Malicious serialization will not be part of the threat model for the majority of ICU4X users. It doesn't make sense to slow down initialization and/or data access for people who can trust their data, just to let people who don't trust their data avoid having to write one function call.

@zbraniecki
Copy link
Member

I'm siding with Iain and Henri here on what should end up being the "common" case.

But I think Shane's point is that in order to be able to have both models working, we need to first establish the more challenging one (with validation), and then disabling it for secure users will be easy.

Starting with insecure and baking additional validation later will be much more challenging.

@iainireland
Copy link
Contributor

In my proposed world, we need the validation code for the transformer anyway, so there's no temptation to leave it for later.

Another way of framing my proposal: data structures with internal invariants that can panic if the invariants aren't maintained should have a public "verify_internal_invariants" function. That function should be called in every public constructor (which we expect to be used by transformers / "slow" data providers), but not by default on deserialization.

Advantages of doing it this way:

  • verify_internal_invariants can be relatively heavyweight, because it's not going to be called on the hot path.
  • We can use expect where it makes sense, instead of surprising workarounds like this one where we have a "default response" to be used only in the case of bogus data.

Disadvantages of doing it this way:

  • If somebody deserializes untrusted data and forgets to call verify_internal_invariants, then malicious data can cause panics. Note that this isn't "insecure" in the sense of unsafe; all of Rust's security guarantees still hold. It just means that we aren't by default 100% stable in the presence of actively malicious data (but the user can still achieve that stability by calling verify_internal_invariants).

For a concrete example: in CaseMapping, we often store a delta between a character and its upper/lowercase counterpart. After applying the delta, we have to use char::from_u32() to convert back to a Rust char. If the delta is untrusted, then it's possible for c + delta to correspond to an unpaired surrogate, which is not a valid char. Under the status quo, we would have to pick a default return value in this case (either a fixed char, or c itself). In my proposal, because verify_internal_invariants is not on the hot path, we can simply iterate through the range of chars and verify that the delta (if it exists) never produces an invalid char. Then, we can use char::from_u32(...).expect(), instead of having to invent a default return value. It's faster, easier to write, and less surprising for the user.

@sffc
Copy link
Member

sffc commented Dec 2, 2021

if there are users for whom that's part of their threat model, then let them call CodePointTrie::validate() by hand.

This sounds like something one might do in C++. But in Rust, things are safe by default, unless they are annotated as unsafe. An option here could be to have an "unsafe serde" which you can call in an unsafe block, which is what #1290 is about.

It doesn't make sense to slow down initialization and/or data access for people who can trust their data, just to let people who don't trust their data avoid having to write one function call.

My ideal world is one in which we use an equally efficient code path to handle both trusted and untrusted data.

The easiest way to achieve this is to simply ensure that any data-driven algorithm always runs to completion and never panics or hits undefined behavior. Garbage-in-garbage-out is how we currently do this in CPT, and that is also my first choice for how to handle similar situations such as CaseMap and Segmenter.

The original impetus for this thread was that people seemed like they didn't like garbage-in-garbage-out in the case of CPT. So , I proposed several alternatives in this thread, as well as #1290.

@iainireland
Copy link
Contributor

This sounds like something one might do in C++. But in Rust, things are safe by default, unless they are annotated as unsafe. An option here could be to have an "unsafe serde" which you can call in an unsafe block, which is what #1290 is about.

As I already pointed out, nothing I'm proposing is unsafe. There's no reason to wrap "unsafe serde" in an unsafe block, because it's not doing anything that requires unsafe, and there's no way to violate memory safety. All the bounds checks are still there. (Note that there was one case in my first draft of CaseMapping where I got lazy and used unsafe; I've already fixed that.)

There are plenty of examples in good Rust code, including standard library code, where misused APIs can panic. An arbitrary example (that comes to mind because I just used it in Advent of Code) is the windows() method on slices, which panics if given a window size of 0. It would be possible for the standard library authors to have this function return an Option, or an empty iterator, but they chose to make it panic instead, with the expectation that anybody passing in untrusted input should validate it first. I agree that it's valuable to minimize the number of possible panics in our code, but I don't agree that we should force that number to zero at all costs. When we design the external API for this code, we can document the conditions under which panics are possible, and the steps that can be taken to prevent them. This gives users of the library the ability to determine their own trust boundaries, which can be anywhere from "I trust this data implicitly" to "I validated this data when it was installed, and now I will just verify a signature" to "I will validate this data every time I load it".

Garbage-in-garbage-out can have arbitrarily weird side effects. For example, another invariant I validate in my implementation of CaseMapping is that every index into the exception table points to a valid header, and every header is followed by the expected amount of variable-length data. Because of that, when I read a value out of the backing array, I can safely assume that the index was not out of bounds. If I could not, then under garbage-in-garbage-out, I would have to pick an arbitrary return value for CaseMappingExceptions::get. There are a dozen different places that call get inside CaseMappingExceptions, each of which interprets the return value in a potentially different way. Picking an arbitrary return value would still be safe (because everything in this code is safe), but after mashing it through several layers, the user-visible result would be unpredictable. In effect, the garbage-in-garbage-out world is one in which no data structure is allowed to have any internal invariants, which makes reasoning about code miserable.

@sffc
Copy link
Member

sffc commented Dec 3, 2021

As I already pointed out, nothing I'm proposing is unsafe. There's no reason to wrap "unsafe serde" in an unsafe block, because it's not doing anything that requires unsafe, and there's no way to violate memory safety.

Sure; what you're proposing is panicky, not unsafe. I'm trying to say that panicky behavior eliminates only part of validation overhead. We can also eliminate other types of validation that Serde already performs if we had an unsafe block; for example, UTF-8 validation and ULE validation.

Garbage-in-garbage-out can have arbitrarily weird side effects. ... In effect, the garbage-in-garbage-out world is one in which no data structure is allowed to have any internal invariants, which makes reasoning about code miserable.

I agree that garbage-in-garbage-out means no internal invariants (only on affected data structures). I don't agree that it makes reasoning about code "miserable". We achieved this quite elegantly in CPT, and having looked at some of your case mapping code, I think the same outcome can be achieved there.

For example, ... under garbage-in-garbage-out, I would have to pick an arbitrary return value for CaseMappingExceptions::get. There are a dozen different places that call get inside CaseMappingExceptions, each of which interprets the return value in a potentially different way. Picking an arbitrary return value would still be safe (because everything in this code is safe), but after mashing it through several layers, the user-visible result would be unpredictable.

If the function is fallible, you can return an error and everyone is happy. If the function is infallible, like in CPT, then you can return some arbitrary default value. The behavior is deterministic and doesn't panic. The return value is "garbage", but not "unpredictable".

@sffc
Copy link
Member

sffc commented Dec 10, 2021

I summarized our discussion today in #1290 (comment).

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Jan 20, 2022
@sffc
Copy link
Member

sffc commented Jan 20, 2022

Conclusion on the original issue in this thread: OK to set the const generic to 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries S-tiny Size: Less than an hour (trivial fixes) T-bug Type: Bad behavior, security, privacy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants