Skip to content

Refactor of the code, 0.8 compatibility, zero-copy AST/parsing, Rust 2018 goodies #76

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Jan 24, 2019

Conversation

zbraniecki
Copy link
Collaborator

This is a major rewrite, to the point where I'm not even sure how to exactly plan reviewing it and landing, but I think it is getting close to being done.

There's one open question about FluentBundle::add_messages which I posted in projectfluent/fluent.js#328

This, unfortunately, break a lot of examples/tests in fluent-bundle as before they freely passed &str and constructed FluentResource on fly.
We'll need to answer that question, and then clean up the code, make sure that such zero-copy abstraction works well for Amethyst, and begin reviewing it to land.

I believe that the proposed structure and memory allocation model should bring us close to the model we'll use for fluent-rs 1.0.

Add a simple binary for testing

Support menubar.ftl except of expressions

Support multiline patterns

Add EOF tests support

Support basic JSON fixtures

Support first set of comparison tests

Support most of the comments

Use serializer instead of deserializer

Uncomment some fixture tests with slight modifications

Support Junk

Use ps.take_if in more places

Support all reference fixtures

Fix benches

Support TermReference

Support variants and select expression

Update to rust 2018 preview 2

Use 2018 use extern macros

Clean up whitespace collecting

Remove try

Update bench to Rust 2018

Support parsing of 0.7 reference tests

Support CallExpressions
@zbraniecki
Copy link
Collaborator Author

My initial thinking is to split this PR into 4 PR's each updating a separate crate and get them reviewed separately, releasing them in order fluent-syntax -> fluent-cli -> fluent-bundle -> fluent.

@zbraniecki
Copy link
Collaborator Author

For fluent's ResourceManager I'll want to migrate it to use Manish's elsa - https://github.com/Manishearth/elsa

@zbraniecki
Copy link
Collaborator Author

This also resolves #16

This is a massive rewrite of the whole crate.

It separates out three crates:
fluent-syntax becomes a zero-copy AST+parser(+serialized in the future)
compabile with Fluent 0.8
fluent-bundle becomes aligned with fluent-bundle in JS
fluent becomes a top-level crate with added simple ResourceManager.

All of the code is compatible with current stable Rust 2018.
@stasm
Copy link
Contributor

stasm commented Dec 27, 2018

My initial thinking is to split this PR into 4 PR's each updating a separate crate and get them reviewed separately, releasing them in order fluent-syntax -> fluent-cli -> fluent-bundle -> fluent.

What is the benefit of going for 4 separate packages? In my experience, too much granularity makes publishing new versions tedious and it also means that there are more permutations of packages installed in the environment at the same time, which makes it easy to run into errors due to mismatched dependencies.

I can see the benefit of publishing fluent-cli as a separate package, since that's a bin-type crate, rather than lib. I would recommend putting the other three in a single fluent crate.

@zbraniecki
Copy link
Collaborator Author

What is the benefit of going for 4 separate packages?

Each package has different set of requirements and "adds up". The fluent-syntax, much like in case of JS and python will be useful for tooling environments which don't need anything else beyond parse/ast/serialize cycle.
fluent-bundle adds the resolver but remains independent of any I/O making it the right level for environments such as gaming systems or Gecko which provide their own resource management systems.
fluent is the complete package which provides its own, minimal, ResourceManager minimizing the amount of custom code one has to write to make their project localizable.
fluent-cli (or maybe fluent-cli-tools?) is a set of binaries to work with ftl/fluent which may eventually become an umbrella for a collection of separately released tools (fluent-cli-parse, fluent-cli-validator etc.).

Does it make sense?

In my experience, too much granularity makes publishing new versions tedious and it also means that there are more permutations of packages installed in the environment at the same time, which makes it easy to run into errors due to mismatched dependencies.

I think Rust cargo is really good with that and the ecosystem actively encourages dividing into many small crates which allows other to better pick which parts they need (see the family of serde crates) and allows for parallel compilation of the crates.

I would recommend putting the other three in a single fluent crate.

It would mean that any tool would have to compile the Bundle/Resolver/ResourceManager parts of Fluent even if all they need it FTL parser. I think that this use case is distinct enough to warrant splitting it out (much like librustc uses syntax, syntax_pos and even serialize as separate crates - https://github.com/rust-lang/rust/blob/master/src/librustc/Cargo.toml#L28).

As to if fluent-bundle should be separate crate, I'm less strongly opinionated.

@zbraniecki
Copy link
Collaborator Author

Hooray! This branch now compiles and all tests pass! The performance is good, but can be better with time.

Initially, with no optimization work put into the parser, I'm seeing:

master:

test bench_menubar_parse ... bench:     407,840 ns/iter (+/- 32,920)
test bench_simple_parse  ... bench:      45,287 ns/iter (+/- 4,698)

zero-copy-resolver:

test bench_menubar_parse ... bench:     155,191 ns/iter (+/- 31,100)
test bench_simple_parse  ... bench:      22,147 ns/iter (+/- 15,138)

the resolver benchmarks are preserved.

I'd like to land this update soon now, to bring the rust implementation to Fluent 0.8 syntax and move to work on documentation, errors, ergonomics and higher level bits (like ResourceManager).

@zbraniecki
Copy link
Collaborator Author

I asked @Manishearth to review the examples and crate split and he said that it seems like a good natural split of meta-crates.

I'd like to preserve the split between fluent and fluent-bundle and eventually consider moving the ResourceManager to its own crate like fluent-resource-manager and referencing it from the catch-all fluent crate.

@zbraniecki
Copy link
Collaborator Author

@Xaeroxe - can you verify that the zero-copy-resolver branch works with Amethyst model? In particular I'd like to make sure that:

  • The zero-copy parser into fluent-bundle::resource::FluentResource which uses rental to allocate its String works well.
  • The API changes I'm making in fluent-bundle which allow libs like Amethyst to plug their own resource manager will work for amethyst-locale.
  • The POC of the fluent::resource_manager::ResourceManager is something that you'll either want to use, or replace with your own resource manager.

Thank you!

pub struct FluentResource(rentals::FluentResource);

impl FluentResource {
pub fn try_new(source: String) -> Result<Self, (Self, Vec<ParserError>)> {
Copy link
Contributor

@JohnHeitmann JohnHeitmann Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can/should this take Into<String> for better caller ergonomics?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We considered that with @Manishearth and he recommended not doing that in order to avoid hiding the allocation.
While ergonomically that would be nice, it would also make it harder for the reader to understand that we will allocate the String and store it inside of the FluentResource.

In the future, when static &str will become implicitly convertible to String [0] it should be easier.

[0] rust-lang/rust#46993

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: a thing that could be done here is to make the return value tuple into a proper public struct so it's easier to document

// XXX: In the future we'll want to get the partial
// value out of resolver and return it here.
// We also expect to get a Vec or errors out of resolver.
return Some((path.to_string(), errors));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can format_message be removed as part of this change? It looks like format does everything a caller might want. If not I'll add it to my doc pr.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't do everything. format allows you to get the message's value or a single attribute, but it doesn't allow you to retrieve value+attributes together.

format_message will be useful for higher level crates that will apply a localization message onto a widget/element.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format_message will be called compound in fluent.js. Would you like to make this change in fluent-rs in this PR or in a follow-up?

@stasm
Copy link
Contributor

stasm commented Jan 15, 2019

Each package has different set of requirements and "adds up". The fluent-syntax, much like in case of JS and python will be useful for tooling environments which don't need anything else beyond parse/ast/serialize cycle.
fluent-bundle adds the resolver but remains independent of any I/O making it the right level for environments such as gaming systems or Gecko which provide their own resource management systems.
fluent is the complete package which provides its own, minimal, ResourceManager minimizing the amount of custom code one has to write to make their project localizable.
fluent-cli (or maybe fluent-cli-tools?) is a set of binaries to work with ftl/fluent which may eventually become an umbrella for a collection of separately released tools (fluent-cli-parse, fluent-cli-validator etc.).

Does it make sense?

Yes, it makes sense, thanks. My question was driven by pragmatism and the following question: do we have use-cases for so much granularity? Sometimes it's better to design hierarchies based on the observed use-cases rather than on how things derive from each other logically.

However, since you already did all the work to separate them, it's probably best to keep them that way.

I think Rust cargo is really good with that and the ecosystem actively encourages dividing into many small crates which allows other to better pick which parts they need (see the family of serde crates) and allows for parallel compilation of the crates.

This isn't as much about cargo as it is about the rigor to always update the readmes, the changelogs, the dependencies of higher-level packages etc. Having more packages to juggle also means that it's easier to create permutations of versions which we haven't thought about.

Copy link
Contributor

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need a few more days to finish the review. Thanks for the patience. I have a few comments and questions which perhaps you could answer in the meantime.

It would indeed be easier to review this as a series of logically separate patches, as you suggested in an earlier comment. At the same time I realize that doing so would break some (or a lot of) tests, and fixing them in between the patches would be a lot of unnecessary work. In the future, I think I'd be OK with disabling tests aggressively until all parts of the refactor come together. For this PR, let's leave it as-is. (You might still want to clean up the commit history a bit, unless you're planning to squash all of it into a single commit).

I would also (strongly) prefer to land the ResourceManager in a separate PR altogether. Let's make this PR about fluent-syntax, fluent-bundle and fluent-cli.

I'll try to finish the review by the end of this week.

// XXX: In the future we'll want to get the partial
// value out of resolver and return it here.
// We also expect to get a Vec or errors out of resolver.
return Some((path.to_string(), errors));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format_message will be called compound in fluent.js. Would you like to make this change in fluent-rs in this PR or in a follow-up?

pub mod errors;
pub mod resolve;
pub mod resource;
pub mod types;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use pub use here to re-export the most common symbols, similar to how we do it in fluent.js?

For consumers, this would simplify their use imports:

// BEFORE
use fluent_bundle::bundle::FluentBundle;
use fluent_bundle::types::FluentValue;
use fluent_bundle::resource::FluentResource;
// AFTER
use fluent_bundle::{FluentBundle, FluentResource, FluentValue};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to wait with such convenience parts until the code stabilizes a little bit. In my mind it's easier to add it than to remove it.
Do you feel like those three structs are certain to be the ones we want to expose on the main crate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the three that we've seen used by the consumers of fluent in JS. I think it's OK to do it in a follow-up, too. Up to you.

_ => false,
};
pub fn get_slice(&self, start: usize, end: usize) -> &'p str {
unsafe { str::from_utf8_unchecked(&self.source[start..end]) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the consequences of using unsafe and str::from_utf8_unchecked here? Does the code check somewhere that the input is valid UTF8? Perhaps we should allow the users of the library to opt into the use of the fast & unsafe code (when they are sure all input is valid UTF8), but the code should default to using str::from_utf8?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consequence is that I'm using our collective brain power, instead of compiler's brain power, to ensure that I'm not slicing outside of the boundary.
The boundary checks are relatively expensive and since we're a full byte parser I believe we always attempt to slice within the range.

So, it's not about whether the input is valid UTF8, it's only about whether the parser always selects the right start/end combo that's within the source slice.

I'm going to use a fuzzer to further verify that claim, but I'd prefer not to overblow the lib with options that don't really mean too much to the user. Currently if you pick wrong start/end, the parser crashes, but if we instead use from_utf8 we'll likely use it with expect which will panic as well :)

If while working on 1.0 we'll discover that we have a lot of places in the parser that pick start/end wrong we can try to mitigate it here, but I'd honestly prefer to fix them in the parser.

Please, not that unsafe code is not "bad" per design. It makes total sense to use it in many cases, and people like Henri are actively asking for improvements in how we calculate boundary check requirements to minimize when we need them: https://hsivonen.fi/rust2019/ .

Our problem is that we have code like:

let source = "id = Foo".as_bytes();
let ptr = 0;
while let Some(b) = source.get(ptr) {
    if b == b'i' {
        if let Some(b2) = source.get(ptr + 1) {
            if b2 == b'd' {
              let id = unsafe { str::from_utf8_unchecked(ptr, ptr+1); }
            }
        }
    }
    ptr += 1;
}

(sorry for how stupid the code is). It would be insanely hard for the compiler to deduct that it can use boundary check elision, but we can deduct that quite easily - we just checked for what is at ptr, and what is at ptr+1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not store the string as &str and use as_bytes() in your processing code everywhere, so that you can safely subslice it with byte indices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consequence is that I'm using our collective brain power, instead of compiler's brain power, to ensure that I'm not slicing outside of the boundary.

I understand that unsafe is sometimes necessary and a good design may benefit from it. I question whether it's the right choice for the first iteration of this parser. Relying on collective brain power seems to be specifically what Rust was designed to work around, because the history shows us that brains are fallible.

The boundary checks are relatively expensive and since we're a full byte parser I believe we always attempt to slice within the range.

Just looking at how get_slice is implemented in skip_unicode_escape_sequence I was able to come up with the following test case which makes the parser panic:

x = {"\\u

I tested it with the following command:

./target/debug/fluent-cli <(echo -n 'x = {"\\u')

We should at least take special care when slicing using self.ptr + 1. With the checked version of from_utf8 we'd be forced to support the out of bounds index in the code, which would prevent the panic.

If while working on 1.0 we'll discover that we have a lot of places in the parser that pick start/end wrong we can try to mitigate it here, but I'd honestly prefer to fix them in the parser.

I'd prefer to start with the safer (even if slower) approach and only consider unsafe when we need to squeeze even more performance from the parser.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging a bit deeper into this, the panic seems to originate from &self.source[start..end] rather than the unsafe call to from_utf8_unchecked. Is there a checked way to index into a slice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I think there are two things going on here. Can you please confirm that my understanding is correct?

  1. When creating the slice of bytes with &self.source[start..end], end might go out of bounds. We should check that it isn't or maybe use an iterator and the checked slice::get method. The x = {"\\u testcase exposes this bug.

  2. This new slice is then passed to from_utf8_unchecked which is an unsafe method. If the slice of bytes contains a sequence of bytes which is invalid in UTF8, this method will panic. Realistically this is unlikely to happen with files which are read from disk and validated as UTF8, but might still be an issue when the data is not guaranteed to be UTF8.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(…) might still be an issue when the data is not guaranteed to be UTF8.

Is this even possible given that FluentResource::try_new takes a String?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The x = {"\u testcase exposes this bug.

I fixed it and reviewed other places where this may happen. Found no other so I think we're good.

This new slice is then passed to from_utf8_unchecked which is an unsafe method. If the slice of bytes contains a sequence of bytes which is invalid in UTF8, this method will panic

No, we don't care if the file is UTF8 or if it contains invalid UTF8 tokens because the boundaries we set are validated to be. In other words, since Fluent tokens end on valid UTF8 chars like "\n", "{" etc. we can establish a range and know that we're slicing at the right place. What's inside, doesn't really matter to us. So, if the entry looks like this:

key = This is a XXX value { $foo }

then we slice out three times:

  1. key which has a start and end known to be a valid alpha character and every character inside as well.
  2. This is a XXX value which has start as the first non-space char, and end as one-before { char, so also no risk of cutting in the middle of a multi-byte char.
  3. foo - similar to (1)

Even if in (2) the XXX is invalid UTF8, we don't care about it. It still slices properly because of what is a boundary in Fluent.

@@ -1,14 +1,842 @@
//! AST, parser and serializer operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dir layout: Is there a tangible benefit to keeping the parser code in its own directory, and then the errors in parser/errors/mod.rs? For my own code, I try to avoid nested directory hierarchies. As a rule of thumb, a directory with one or two files in it is probably not worth it :) I'd recommend placing all the files in fluent-syntax/src/.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a value in keeping the parser in its own directory. This way I can easily tell which parts (ftlstream/errors) are parser specific, while ast stays crate specific (and in the future I expect serializer to get its own dir). I removed the errors dir.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that this is unnecessary proliferation of directories, but it's also a detail, so not blocking at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep it in its own directory. I'm experimenting with adding lexer.rs there as well and I like to keep parser specific code in it away from ast/serializer.

@zbraniecki
Copy link
Collaborator Author

I would also (strongly) prefer to land the ResourceManager in a separate PR altogether. Let's make this PR about fluent-syntax, fluent-bundle and fluent-cli.

Done!

(You might still want to clean up the commit history a bit, unless you're planning to squash all of it into a single commit).

I do plan to squash.

I'm keeping @Manishearth suggestion open, and I believe the rest of the feedback has been applied. I'll try to investigate his suggestion, but I believe this shouldn't block the rest of review and can be done in a separate PR prior to release.

Copy link
Contributor

@Pike Pike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the parser/mod.rs this time, and I have to admit, I didn't understand all of it.

Some questions below are somewhat rust-related questions, some are just code questions.

I noticed that escapes in literals don't work yet. Also, I noticed two panics in the examples.

Ok(pattern.map(ast::Value::Pattern))
}

fn get_attributes<'p>(ps: &mut ParserStream<'p>) -> Result<Vec<ast::Attribute<'p>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make explicit that this method is effectively infallible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, yes. I'd like to consider improving the recovery here eventually by recognizing that the error originated in the attribute, jumping to the next line that either is a start of new entry or starts with __"." and if the latter, resume parsing attributes for that entry.

This would allow us to recover from:

key = Value
    .attr = Value { $4 }
    .attr2 = Value 2

by salvaging the attr2 into key message. But that's beyond the scope of this patch, so for now, I fixed it :)

while let Some(b) = ps.source.get(ptr) {
if ps.is_byte_alphabetic(*b) {
ptr += 1;
} else if ptr == ps.ptr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the start_of_id logic in the middle of the loop a tad hard to grasp. Is this actually simpler than doing an explicit first char?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Since get_identifier is such a hot loop I tested tons of variants and this is the fastest one. I'm going to switch to a more readable one for the sake of landing it sooner, and we can reevaluate perf optimizations later.

TextElementTermination::LineFeed,
));
}
b'\r' if ps.is_byte_at(b'\n', ps.ptr + 1) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go out-of-bounds for a stream ending in \r? No idea what happens in rust if that happens.

Copy link
Contributor

@stasm stasm Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK. ParserStream::is_byte_at uses the checked slice::get method which returns None whenever the unchecked one would panic.

};

if !is_valid {
//XXX: Generalize error type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or the opposite, make the error type more specific?

named,
}
} else {
return error!(ErrorKind::ForbiddenCallee, ps.ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does rust do early returns? I.e, could we just return an error on invalid, and not indent the valid part?

.version("1.0")
.about("Parses FTL file into an AST")
.args_from_usage(
"-s, --silence 'disable output'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: call it --silent and use sentence case for the help text, Disables the output, to make it consistent with the rest of the usage description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what is the goal of this option? It looks like it disables all output. I would expect it would silence the error reporting but still produce the AST.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I copied it from the JS impl and the goal was to be able to time the parser without timing the formatting and printing of the output. I changed it to only disable errors for now.

.about("Parses FTL file into an AST")
.args_from_usage(
"-s, --silence 'disable output'
<INPUT> 'Sets the input file to use'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Call it <FILE> and change the help text to Fluent file to parse.

if self.ch == Some(ch) {
self.next();
return true;
pub fn take_if(&mut self, b: u8) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Perhaps rename this to take_byte or consume_byte for symmetry with expect_byte?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect_* is a fallible method that returns an error. I'll rename it to take_byte_if.

TextElementTermination::LineFeed,
));
}
b'\r' if ps.is_byte_at(b'\n', ps.ptr + 1) => {
Copy link
Contributor

@stasm stasm Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK. ParserStream::is_byte_at uses the checked slice::get method which returns None whenever the unchecked one would panic.

Copy link
Contributor

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of comments. I'll finish the review today, I'd like to take another look at the ResolveValue trait.

I believe this shouldn't block the rest of review and can be done in a separate PR prior to release.

I don't mean to block this PR, but I really think there's something not right about the use of unsafe in ParserStream. Would you agree to use the checked variant of from_utf8 in this PR to land it without controversies, and then discuss switching to unsafe in a follow-up?

}
}
pub fn is_byte_digit(&self, b: u8) -> bool {
b >= b'0' && b <= b'9'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use u8::is_ascii_digit here?

return Ok('-');
}
pub fn is_byte_alphabetic(&self, b: u8) -> bool {
(b >= b'a' && b <= b'z') || (b >= b'A' && b <= b'Z')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use u8::is_ascii_alphabetic here?

_ => false,
};
pub fn get_slice(&self, start: usize, end: usize) -> &'p str {
unsafe { str::from_utf8_unchecked(&self.source[start..end]) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consequence is that I'm using our collective brain power, instead of compiler's brain power, to ensure that I'm not slicing outside of the boundary.

I understand that unsafe is sometimes necessary and a good design may benefit from it. I question whether it's the right choice for the first iteration of this parser. Relying on collective brain power seems to be specifically what Rust was designed to work around, because the history shows us that brains are fallible.

The boundary checks are relatively expensive and since we're a full byte parser I believe we always attempt to slice within the range.

Just looking at how get_slice is implemented in skip_unicode_escape_sequence I was able to come up with the following test case which makes the parser panic:

x = {"\\u

I tested it with the following command:

./target/debug/fluent-cli <(echo -n 'x = {"\\u')

We should at least take special care when slicing using self.ptr + 1. With the checked version of from_utf8 we'd be forced to support the out of bounds index in the code, which would prevent the panic.

If while working on 1.0 we'll discover that we have a lot of places in the parser that pick start/end wrong we can try to mitigate it here, but I'd honestly prefer to fix them in the parser.

I'd prefer to start with the safer (even if slower) approach and only consider unsafe when we need to squeeze even more performance from the parser.

Copy link
Contributor

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I'm mostly done with the review. It's a lot of code; thanks for the patience. I'd like to see a version of this PR with my comments addressed. The most important ones are:

  1. Comment the enums and their role in get_pattern.
  2. Add protection against out-of-bounds indexing into ps.source. (And fix the bug I found yesterday.)
  3. Sometimes the code uses unwrap in places where I imagine panic are not possible if the AST was parsed by the parser. Please consider starting a conversation about the expectation towards this code when this assumption is not satisfied.
  4. File issues for the things we know are follow-up material.

Thanks!

match self {
ast::InlineExpression::StringLiteral { raw } => {
// XXX: We need to decode the raw into unicode here.
Ok(FluentValue::from(*raw))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file an issue about this and reference it here.

Ok(FluentValue::from(*raw))
}
ast::InlineExpression::NumberLiteral { value } => {
Ok(FluentValue::as_number(*value).unwrap())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that you're using unwrap here (and a few more places) because the as_number conversion cannot fail if the AST node comes from the parser. Any invalid input would be caught by the parser as a syntax error. This poses an interesting question about the goals of the resolver: what should it do about AST nodes which have been tinkered with in code? I think it's totally OK to expect them to not resolve properly. However, I'm not sure if the program should panic in such cases. Do you have thoughts on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. I must say that I see this patchset as bringing the parser to current state, and I expect us to only after that start cleaning up the resolver, so I was kind of ok with introducing half-baked unwraps like this. I'll file a follow-up for them.

ref positional,
ref named,
} => {
let mut resolved_unnamed_args = Vec::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rename this to resolved_positional_args?

},
ptr
);
} else if ps.is_byte_digit(*b) || [b'_', b'-'].contains(&b) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you don't need &b in contains() here.

#[derive(Debug)]
enum PatternElementPointers<'a> {
Placeable(ast::Expression<'a>),
TextElement(usize, usize, usize, TextElementPosition),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe call them PatternElementPlaceholders?

}

#[derive(Debug, PartialEq)]
enum TextElementTermination {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the enums here need docs, which explain the why's of this code. I suggested how to improve the docs of CarriageReturn in a comment below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can TextElementTermination be combined with TextElementPosition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, and I'd prefer not to dig into rethinking of get_pattern now. Is it ok not to block on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

#[derive(Debug, PartialEq)]
enum TextElementType {
Blank,
NonBlank,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure why Blank vs. NonBlank are needed. Please add a comment.

if let Some(last_non_blank) = last_non_blank {
let elements = elements
.into_iter()
.take(last_non_blank + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why +1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of how take works. If last_non_blank is 0, we need to take 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but why? :) What does last_non_blank keep track of?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It keeps the track of which of the elements is the last one to be non_blank. Let's imagine this scenario:

key = This is
  multiline value

  and it still works


key2 = Value 2

For key it produces the following TextElements:

("This is\n", "   multiline value\n", "\n", "  and it still works\n", "\n", "\n")

So, we want to only take the lines 0..3. It may be that since we filter out blank lines, it would be better to remove that variable and just filter out blank lines. I'll look into this after landing this :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#81

}
}

fn get_literal<'p>(ps: &mut ParserStream<'p>) -> Result<ast::InlineExpression<'p>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only two literals in Fluent spec: StringLiteral and NumberLiteral. In fluent.js we called this function getSimpleExpression.

@@ -0,0 +1,11 @@
FTL_FIXTURES := $(wildcard *.ftl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this file please.

Zibi Braniecki added 2 commits January 23, 2019 13:37
@zbraniecki
Copy link
Collaborator Author

Applied all feedback.

OK, I think I'm mostly done with the review. It's a lot of code; thanks for the patience. I'd like to see a version of this PR with my comments addressed. The most important ones are:

1. Comment the enums and their role in `get_pattern`.

Added.

2. Add protection against out-of-bounds indexing into `ps.source`. (And fix the bug I found yesterday.)

Removed the unsafe code for now, fixed the bug and verified that no other place uses get_slice with pointers that are not already checked to be in range.

3. Sometimes the code uses `unwrap` in places where I imagine panic are not possible if the AST was parsed by the parser. Please consider starting a conversation about the expectation towards this code when this assumption is not satisfied.

#80

4. File issues for the things we know are follow-up material.

#79

@zbraniecki zbraniecki requested a review from stasm January 23, 2019 22:39
Copy link
Contributor

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for applying the feedback, @zbraniecki! Nice work on this refactor, let's ship it.

match res {
Ok(res) => print_entries_resource(&res),
Err((res, errors)) => {
print_entries_resource(&res);

if matches.is_present("silence") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be called silent now?

@@ -186,7 +189,7 @@ impl<'p> ParserStream<'p> {
}

pub fn get_slice(&self, start: usize, end: usize) -> &'p str {
unsafe { str::from_utf8_unchecked(&self.source[start..end]) }
str::from_utf8(&self.source[start..end]).expect("Slicing the source failed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing the unsafe block here, at least for now. Would you mind filing a follow-up for discussing re-adding it, please? I'd like to look into it a bit more, given that we're likely protected from invalid UTF8 input by the sheer fact of accepting String into FluentResource::try_new.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#82

//
// CRLF variant is specific because we want
// to skip it in text elements production.
// For example `a\r\nb` will produce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a missing space before b in the example here. Also (uber-nit), what's up with the narrow line width?

// proper state for the next line.
//
// CRLF variant is specific because we want
// to skip it in text elements production.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…we want to skip CR but keep the LF….

if let Some(last_non_blank) = last_non_blank {
let elements = elements
.into_iter()
.take(last_non_blank + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but why? :) What does last_non_blank keep track of?

// is blank or not.
// This is important to identify text elements
// which should not be taken into account
// when calculating common indent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice, I understand the purpose of those enums better now, thanks!

}

#[derive(Debug, PartialEq)]
enum TextElementTermination {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@zbraniecki zbraniecki merged commit 12cc561 into projectfluent:master Jan 24, 2019
@zbraniecki zbraniecki deleted the copy-free-resolver branch January 24, 2019 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants