-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set CPT ValueWidth::DATA_GET_ERROR_VALUE to 0 #1183
Comments
Also resolve this comment:
Originally posted by @iainireland in #1167 (comment) |
If the input is an invalid code point, then the CodePointTrie index-computation code should return the index of where the error value is stored in the data array, so the hardcoded value is still not used. Only if the data is totally messed up and you get index-out-of-bounds errors is there a way to get to this sort of hardcoded value, and then it might as well be 0 just because that's simplest. |
Also, there is no point in hardcoding an override of this value in each value-type-specific version of a trie (as seen here: #1233 (comment)). That would be misleading if it differs from the error value that is stored inside the trie. If anything, the value-type-specific version could be fetched from the trie data during initialization. And when that fails because the data is totally broken, then you might as well throw or pass through an error. Or else convert numeric zero into the value type. |
This is truly a "last resort" fallback which really should be an error, because it's only returned to recover from an array out of bounds. However, it's so unusual (only as a result of malformed data) that it's not really correct to make the whole CPT fallible (returning a Result) because of it. I see this as a garbage-in-garbage-out scenario. We need it as an associated type because Rust allows us to store "upgraded" or "exotic" types in the CPT, like |
Zibi gave me an idea on Slack, which is that if we really don't want to define the default error value for these types, a workaround would be to add invariants to the CodePointTrie itself, such that it's impossible to have a CodePointTrie where the error value results in an out-of-bounds access. Note that right now CodePointTrie is zero-validation; we just point directly to the byte buffer. Adding validation may complicate or slow down certain operations. |
Here's another idea. Could we change the CPT data model such that the error value is stored as a field in the header rather than an index in the array? Then we can use that as our fallback. |
Isn't it sufficient to validate data when we read it in from toml? If we don't serialize bad data, then we shouldn't deserialize bad data. |
I agree. I think cryptography—not complicating data structures or run-time lookup code—is the right solution for users of ICU4X who are concerned about deserializing data that didn't come from ICU4X's own data blob building process. If the data blob building process had a bug, a panic is reasonable at the time of use of the data. That is, I think we should treat data built by the ICU4X build process as an internal part of ICU4X for error handling purposes. In cases where the data actually travels externally instead of being baked into the data segment of the executable file holding ICU4X's code, I think cryptography is the right way to ensure integrity. As a practical matter, I think it's not worthwhile to write a collator with the assumption that the collation data (where collation elements carry indices into data) might be bogus, and validating the collation data upon loading would be complicated and expensive compared to checking a signature when the data has traveled externally. |
+1 Between UTrie2 and the newer CodePointTrie, I moved the error value from the header struct into the data array in order to make the code simpler and more uniform. The error handling is inside the compute-data-index functions, and thus independent of the value width. The caller of the index computation just takes the index and fetches a value from the data array. If you want a little bit of validation, there are two or three special values at the end of the data array, and actually ASCII is stored linearly. So you could bail out if the data array is shorter than 128, and that also guarantees that you can fetch values from near the end of the array. If the data array is long enough but its contents is bogus, then those integer values might be out of bounds for converting to Script etc. but again you could bail out if you can't convert the error value fetched from there. I also wonder about the ROI of wrapping the code point trie such that the caller automagically gets a Script or General_Category enum object. It's nice, but it seems to require extra complication and boilerplate. Consider just returning integer values? |
Viewing data as untrusted is a principle we've had since the early days of ICU4X. It is largely based on the principle that data can be overridden and customized by the user. That's a feature that is commonly requested (e.g. tc39/ecma402#210), but which ICU does not support very well, and is therefore something that is central to the design of the ICU4X data provider. I'm happy to discuss alternative mechanisms to ensure validity of data, such as cryptographic signing of data packs, which could be in the context of a "fast path" that bypasses certain validation safeguards. This should be the subject of a different issue.
The advantage is that we don't need to re-validate that the return value is valid. If we return an integer, it is unsafe in Rust, for good reason, to cast it to the corresponding enum. |
We can let users bring their own data while still pushing validation into the build process. Users provide toml, which is untrusted. The slow toml-backed data provider validates the data and produces a valid codepoint trie. The resulting codepoint trie can now be trusted, even if we serialize and then deserialize it. I guess technically it's possible to create an invalid serialized CPT by hand, but I don't see much value in defending ourselves against that. We can provide optional validation to be used in the highly unusual case that somebody wants to use untrusted data at runtime, without imposing a startup penalty on normal use cases where the data is inside the trust boundary. |
From discussion on Slack: it would be valuable to have a mode in which we skip the validation step for ICU4X data if we know that the data is from a trusted source. @zbraniecki may file a follow-up issue to discuss that. The status quo, though, is that Serde and DataProvider assume data is untrusted, and therefore validation is required. In order to make constructive progress on the original topic of this thread (the fallback error value for ValueWidth in CodePointTrie), we should discuss it within the current framework of how ICU4X data works. |
Here are what I see as the potentially viable options for the original problem in this thread:
I think option 4 satisfies everyone's requirements. It makes the validation question moot, because CPT won't need validation as it currently does not. |
If the validation includes checking that the data array length is at least 128, then you can access the error value integer. I would fetch that at load/init time, and convert it to the API type like Script. If the validation fails, or if the conversion to the API type fails, then throw an error/exception. You can guarantee that accesses into the data array never go out of bounds by either explicitly checking for bounds in index computation, or by validating during initialization that none of the indexes go out of bounds for next-stage index or data blocks. There is a complication with the unusual-but-possible 18-bit data block offsets. Worst case, you would have to inspect roughly 17k index entries, but you could inspect shared index blocks only once and then skip. |
Paths crossed. Option 3 sounds good. You want the option of rejecting bad data, even if you don't have an elaborate test for that yet. |
Can you elaborate on the benefit of option 3 vs option 2? |
Option 3 slows down initialization a bit but then you have no further overhead. You will want some validation; it is quick to check that it looks like a CodePointTrie and that the data array is large enough for linear ASCII. Then you can fetch the error value. So I would make the constructor return a Result, and make it at least fetch & convert the error value -- or else fail. Full validation of all indexes takes a bit more time, of course. Option 2 lets you do less validation up front but adds checks into the runtime code. Option 1 = status quo handles the out-of-bounds case by returning a hardcoded error value, right? That's not orthogonal to option 3. I propose that the constructor fetch and convert the error value, rather than hardcoding it; and because that initialization step can fail, the constructor should return a Result. You then don't need get() to return a Result, too. |
Question: is the CPT algorithm guaranteed to terminate given untrusted data to operate on? I've been assuming the answer is yes, but I've never asked it explicitly. |
It will terminate. There are no loops, no search. It's a constant time lookup within each of certain code point ranges, but higher-code point ranges use more indirections or complications; still fixed. What can go wrong with bogus data is that you do array-index-out-of-bounds. |
Is this the option that can panic in this case: "What can go wrong with bogus data is that you do array-index-out-of-bounds."? If so, I think we should go with this option. Specifically: If someone wants to customize the data, they should customize the input to the data pipeline and run the data pipeline on their customization. E.g. for collation, it's completely unreasonable to manually tweak the output of the data pipeline, since it's so likely that the result would be malformed. For collation, someone wishing to customize should create a CLDR-like input, run in through the ICU4C builder, dump to TOML, and convert the TOML to a binary.
I think this is bad: It contaminates the implementation code with Consider the collator: The way a collator is used is that you plug it into some larger sorting operation. For the use case, you want the collator to return an Propagating a
Even if this approach was feasible for
This would make ICU4X compute bogus results if externally-traveled data actually has been corrupted. In terms of surprise and what it takes to understand what's going wrong, this seems worse than any of the other options. However, in terms of usage ergonomics, I'd rather settle on this option than option 2 if we can't agree on panicking on malformed data at the time of making lookups from the data. |
Follow-up for handling cases where we know data is trusted: #1290 |
Picking this conversation back up, because it's relevant in #1349: I agree with Henri's summary. In particular, I think this point (wrt option 4) is important:
In cases where corrupted data is a live issue, there's a real cost to pushing off its detection until later. I'd like to propose a fifth option:
Shane has made a good case that the slow path needs to exist. I am not convinced, though, that it should be the mandatory default. If somebody is dynamically deserializing untrusted data, I think it's reasonable to ask that user to jump through one additional hoop (when the alternative is adding hoops for other, less esoteric use cases). Note that this also allows other forms of validation, like Henri's suggestion of checking a signature. |
I see untrusted, user-provided data as a core part of the value proposition of this project. This might be a factor contributing to our difference in perspectives on this issue.
I don't see how this would work. The transformer transforms from some arbitrary source data to the ICU4X data struct. That's a different type of validation than what the Deserialize impl needs to do. Here's another way to look at this. The Serde Deserialize impl is a constructor. The job of a constructor is to make sure that a well-formed object is returned, satisfying all invariants of private fields. I'm open to exploring options for allowing different code paths in Serde for trusted and untrusted data. See #1290. I think that is the correct way to solve this problem. |
I buy the case for decoupling data from code. In that sense, I agree that user-provided data (that is, data that users have built for themselves at some point in the future, using the tools that we provide) is valuable. I don't understand the case for data from any non-ICU source. Who's bringing their own case mappings? (I can imagine an argument along the lines of "we want to have one bundle of data on a system for many programs to share, and we don't want control of that bundle to be a vector for exploitation". That's a plausible reason to ensure that we don't do anything unsafe depending on deserialized data, but it's not an argument for avoiding panics caused by bad data.)
I already implemented it for In a similar way, it would be relatively easy to write a function for CodePointTrie that walks through the index/data arrays and verifies that we'll never have an out-of-bounds read. Post-validation, we can use In neither case is this validation particularly fast, but it's certainly fast enough to run in the transformer, where we mostly don't care about performance. Because the internals of these data structures are private, and we write all the constructors, we can guarantee that every CodePointTrie / CaseMapping ever created preserves the necessary internal invariants. This property is preserved through serialization and deserialization. Malicious serialization aside, there can't be panics. Your stance (correct me if I'm wrong) appears to be: well, somebody could hand-craft an invalid serialized CodePointTrie, so we had better make sure that we also call CodePointTrie::validate() after deserializing data. My response: if there are users for whom that's part of their threat model, then let them call CodePointTrie::validate() by hand. Malicious serialization will not be part of the threat model for the majority of ICU4X users. It doesn't make sense to slow down initialization and/or data access for people who can trust their data, just to let people who don't trust their data avoid having to write one function call. |
I'm siding with Iain and Henri here on what should end up being the "common" case. But I think Shane's point is that in order to be able to have both models working, we need to first establish the more challenging one (with validation), and then disabling it for secure users will be easy. Starting with insecure and baking additional validation later will be much more challenging. |
In my proposed world, we need the validation code for the transformer anyway, so there's no temptation to leave it for later. Another way of framing my proposal: data structures with internal invariants that can panic if the invariants aren't maintained should have a public "verify_internal_invariants" function. That function should be called in every public constructor (which we expect to be used by transformers / "slow" data providers), but not by default on deserialization. Advantages of doing it this way:
Disadvantages of doing it this way:
For a concrete example: in CaseMapping, we often store a delta between a character and its upper/lowercase counterpart. After applying the delta, we have to use |
This sounds like something one might do in C++. But in Rust, things are safe by default, unless they are annotated as unsafe. An option here could be to have an "unsafe serde" which you can call in an unsafe block, which is what #1290 is about.
My ideal world is one in which we use an equally efficient code path to handle both trusted and untrusted data. The easiest way to achieve this is to simply ensure that any data-driven algorithm always runs to completion and never panics or hits undefined behavior. Garbage-in-garbage-out is how we currently do this in CPT, and that is also my first choice for how to handle similar situations such as CaseMap and Segmenter. The original impetus for this thread was that people seemed like they didn't like garbage-in-garbage-out in the case of CPT. So , I proposed several alternatives in this thread, as well as #1290. |
As I already pointed out, nothing I'm proposing is unsafe. There's no reason to wrap "unsafe serde" in an unsafe block, because it's not doing anything that requires unsafe, and there's no way to violate memory safety. All the bounds checks are still there. (Note that there was one case in my first draft of CaseMapping where I got lazy and used unsafe; I've already fixed that.) There are plenty of examples in good Rust code, including standard library code, where misused APIs can panic. An arbitrary example (that comes to mind because I just used it in Advent of Code) is the windows() method on slices, which panics if given a window size of 0. It would be possible for the standard library authors to have this function return an Option, or an empty iterator, but they chose to make it panic instead, with the expectation that anybody passing in untrusted input should validate it first. I agree that it's valuable to minimize the number of possible panics in our code, but I don't agree that we should force that number to zero at all costs. When we design the external API for this code, we can document the conditions under which panics are possible, and the steps that can be taken to prevent them. This gives users of the library the ability to determine their own trust boundaries, which can be anywhere from "I trust this data implicitly" to "I validated this data when it was installed, and now I will just verify a signature" to "I will validate this data every time I load it". Garbage-in-garbage-out can have arbitrarily weird side effects. For example, another invariant I validate in my implementation of CaseMapping is that every index into the exception table points to a valid header, and every header is followed by the expected amount of variable-length data. Because of that, when I read a value out of the backing array, I can safely assume that the index was not out of bounds. If I could not, then under garbage-in-garbage-out, I would have to pick an arbitrary return value for |
Sure; what you're proposing is panicky, not unsafe. I'm trying to say that panicky behavior eliminates only part of validation overhead. We can also eliminate other types of validation that Serde already performs if we had an unsafe block; for example, UTF-8 validation and ULE validation.
I agree that garbage-in-garbage-out means no internal invariants (only on affected data structures). I don't agree that it makes reasoning about code "miserable". We achieved this quite elegantly in CPT, and having looked at some of your case mapping code, I think the same outcome can be achieved there.
If the function is fallible, you can return an error and everyone is happy. If the function is infallible, like in CPT, then you can return some arbitrary default value. The behavior is deterministic and doesn't panic. The return value is "garbage", but not "unpredictable". |
I summarized our discussion today in #1290 (comment). |
Conclusion on the original issue in this thread: OK to set the const generic to 0. |
Currently, we set the associated value
ValueWidth::DATA_GET_ERROR_VALUE
to be the the max value of the impl types (u8::MAX
,u16::MAX
,u32::MAX
).A better value for this error value would be 0, which matches the
Unknown
enumerated value for most enumerated properties.This is a minor improvement since, after all, if the CodePointTrie code and data (from the data generator) are all correct, this error value will never be returned for valid code points. (If the input is an invalid code point, then this error value will be returned, but garbage-in-garbage-out.)
The text was updated successfully, but these errors were encountered: