-
Notifications
You must be signed in to change notification settings - Fork 99
Refactor of the code, 0.8 compatibility, zero-copy AST/parsing, Rust 2018 goodies #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor of the code, 0.8 compatibility, zero-copy AST/parsing, Rust 2018 goodies #76
Conversation
Add a simple binary for testing Support menubar.ftl except of expressions Support multiline patterns Add EOF tests support Support basic JSON fixtures Support first set of comparison tests Support most of the comments Use serializer instead of deserializer Uncomment some fixture tests with slight modifications Support Junk Use ps.take_if in more places Support all reference fixtures Fix benches Support TermReference Support variants and select expression Update to rust 2018 preview 2 Use 2018 use extern macros Clean up whitespace collecting Remove try Update bench to Rust 2018 Support parsing of 0.7 reference tests Support CallExpressions
9d66bfa
to
7996f12
Compare
My initial thinking is to split this PR into 4 PR's each updating a separate crate and get them reviewed separately, releasing them in order |
For |
This also resolves #16 |
This is a massive rewrite of the whole crate. It separates out three crates: fluent-syntax becomes a zero-copy AST+parser(+serialized in the future) compabile with Fluent 0.8 fluent-bundle becomes aligned with fluent-bundle in JS fluent becomes a top-level crate with added simple ResourceManager. All of the code is compatible with current stable Rust 2018.
7996f12
to
77c2b38
Compare
What is the benefit of going for 4 separate packages? In my experience, too much granularity makes publishing new versions tedious and it also means that there are more permutations of packages installed in the environment at the same time, which makes it easy to run into errors due to mismatched dependencies. I can see the benefit of publishing |
Each package has different set of requirements and "adds up". The Does it make sense?
I think Rust cargo is really good with that and the ecosystem actively encourages dividing into many small crates which allows other to better pick which parts they need (see the family of
It would mean that any tool would have to compile the Bundle/Resolver/ResourceManager parts of Fluent even if all they need it FTL parser. I think that this use case is distinct enough to warrant splitting it out (much like librustc uses As to if |
…ager via Cow" This reverts commit 6e02272.
Hooray! This branch now compiles and all tests pass! The performance is good, but can be better with time. Initially, with no optimization work put into the parser, I'm seeing: master:
zero-copy-resolver:
the resolver benchmarks are preserved. I'd like to land this update soon now, to bring the rust implementation to Fluent 0.8 syntax and move to work on documentation, errors, ergonomics and higher level bits (like ResourceManager). |
I asked @Manishearth to review the examples and crate split and he said that it seems like a good natural split of meta-crates. I'd like to preserve the split between |
@Xaeroxe - can you verify that the
Thank you! |
Added docs for FluentBundle. Also, corrected some errors in the pre-existing docs.
pub struct FluentResource(rentals::FluentResource); | ||
|
||
impl FluentResource { | ||
pub fn try_new(source: String) -> Result<Self, (Self, Vec<ParserError>)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can/should this take Into<String>
for better caller ergonomics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We considered that with @Manishearth and he recommended not doing that in order to avoid hiding the allocation.
While ergonomically that would be nice, it would also make it harder for the reader to understand that we will allocate the String and store it inside of the FluentResource.
In the future, when static &str
will become implicitly convertible to String [0] it should be easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: a thing that could be done here is to make the return value tuple into a proper public struct so it's easier to document
// XXX: In the future we'll want to get the partial | ||
// value out of resolver and return it here. | ||
// We also expect to get a Vec or errors out of resolver. | ||
return Some((path.to_string(), errors)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can format_message
be removed as part of this change? It looks like format
does everything a caller might want. If not I'll add it to my doc pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't do everything. format
allows you to get the message's value or a single attribute, but it doesn't allow you to retrieve value+attributes together.
format_message
will be useful for higher level crates that will apply a localization message onto a widget/element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format_message
will be called compound
in fluent.js
. Would you like to make this change in fluent-rs
in this PR or in a follow-up?
Yes, it makes sense, thanks. My question was driven by pragmatism and the following question: do we have use-cases for so much granularity? Sometimes it's better to design hierarchies based on the observed use-cases rather than on how things derive from each other logically. However, since you already did all the work to separate them, it's probably best to keep them that way.
This isn't as much about cargo as it is about the rigor to always update the readmes, the changelogs, the dependencies of higher-level packages etc. Having more packages to juggle also means that it's easier to create permutations of versions which we haven't thought about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need a few more days to finish the review. Thanks for the patience. I have a few comments and questions which perhaps you could answer in the meantime.
It would indeed be easier to review this as a series of logically separate patches, as you suggested in an earlier comment. At the same time I realize that doing so would break some (or a lot of) tests, and fixing them in between the patches would be a lot of unnecessary work. In the future, I think I'd be OK with disabling tests aggressively until all parts of the refactor come together. For this PR, let's leave it as-is. (You might still want to clean up the commit history a bit, unless you're planning to squash all of it into a single commit).
I would also (strongly) prefer to land the ResourceManager
in a separate PR altogether. Let's make this PR about fluent-syntax
, fluent-bundle
and fluent-cli
.
I'll try to finish the review by the end of this week.
// XXX: In the future we'll want to get the partial | ||
// value out of resolver and return it here. | ||
// We also expect to get a Vec or errors out of resolver. | ||
return Some((path.to_string(), errors)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format_message
will be called compound
in fluent.js
. Would you like to make this change in fluent-rs
in this PR or in a follow-up?
fluent-bundle/src/lib.rs
Outdated
pub mod errors; | ||
pub mod resolve; | ||
pub mod resource; | ||
pub mod types; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use pub use
here to re-export the most common symbols, similar to how we do it in fluent.js
?
For consumers, this would simplify their use
imports:
// BEFORE
use fluent_bundle::bundle::FluentBundle;
use fluent_bundle::types::FluentValue;
use fluent_bundle::resource::FluentResource;
// AFTER
use fluent_bundle::{FluentBundle, FluentResource, FluentValue};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to wait with such convenience parts until the code stabilizes a little bit. In my mind it's easier to add it than to remove it.
Do you feel like those three structs are certain to be the ones we want to expose on the main crate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the three that we've seen used by the consumers of fluent
in JS. I think it's OK to do it in a follow-up, too. Up to you.
_ => false, | ||
}; | ||
pub fn get_slice(&self, start: usize, end: usize) -> &'p str { | ||
unsafe { str::from_utf8_unchecked(&self.source[start..end]) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the consequences of using unsafe
and str::from_utf8_unchecked
here? Does the code check somewhere that the input is valid UTF8? Perhaps we should allow the users of the library to opt into the use of the fast & unsafe code (when they are sure all input is valid UTF8), but the code should default to using str::from_utf8
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consequence is that I'm using our collective brain power, instead of compiler's brain power, to ensure that I'm not slicing outside of the boundary.
The boundary checks are relatively expensive and since we're a full byte parser I believe we always attempt to slice within the range.
So, it's not about whether the input is valid UTF8, it's only about whether the parser always selects the right start/end combo that's within the source slice.
I'm going to use a fuzzer to further verify that claim, but I'd prefer not to overblow the lib with options that don't really mean too much to the user. Currently if you pick wrong start/end, the parser crashes, but if we instead use from_utf8
we'll likely use it with expect
which will panic as well :)
If while working on 1.0 we'll discover that we have a lot of places in the parser that pick start/end wrong we can try to mitigate it here, but I'd honestly prefer to fix them in the parser.
Please, not that unsafe code is not "bad" per design. It makes total sense to use it in many cases, and people like Henri are actively asking for improvements in how we calculate boundary check requirements to minimize when we need them: https://hsivonen.fi/rust2019/ .
Our problem is that we have code like:
let source = "id = Foo".as_bytes();
let ptr = 0;
while let Some(b) = source.get(ptr) {
if b == b'i' {
if let Some(b2) = source.get(ptr + 1) {
if b2 == b'd' {
let id = unsafe { str::from_utf8_unchecked(ptr, ptr+1); }
}
}
}
ptr += 1;
}
(sorry for how stupid the code is). It would be insanely hard for the compiler to deduct that it can use boundary check elision, but we can deduct that quite easily - we just checked for what is at ptr
, and what is at ptr+1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not store the string as &str
and use as_bytes()
in your processing code everywhere, so that you can safely subslice it with byte indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consequence is that I'm using our collective brain power, instead of compiler's brain power, to ensure that I'm not slicing outside of the boundary.
I understand that unsafe
is sometimes necessary and a good design may benefit from it. I question whether it's the right choice for the first iteration of this parser. Relying on collective brain power seems to be specifically what Rust was designed to work around, because the history shows us that brains are fallible.
The boundary checks are relatively expensive and since we're a full byte parser I believe we always attempt to slice within the range.
Just looking at how get_slice
is implemented in skip_unicode_escape_sequence
I was able to come up with the following test case which makes the parser panic:
x = {"\\u
I tested it with the following command:
./target/debug/fluent-cli <(echo -n 'x = {"\\u')
We should at least take special care when slicing using self.ptr + 1
. With the checked version of from_utf8
we'd be forced to support the out of bounds index in the code, which would prevent the panic.
If while working on 1.0 we'll discover that we have a lot of places in the parser that pick start/end wrong we can try to mitigate it here, but I'd honestly prefer to fix them in the parser.
I'd prefer to start with the safer (even if slower) approach and only consider unsafe
when we need to squeeze even more performance from the parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Digging a bit deeper into this, the panic seems to originate from &self.source[start..end]
rather than the unsafe call to from_utf8_unchecked
. Is there a checked way to index into a slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so I think there are two things going on here. Can you please confirm that my understanding is correct?
-
When creating the slice of bytes with
&self.source[start..end]
,end
might go out of bounds. We should check that it isn't or maybe use an iterator and the checkedslice::get
method. Thex = {"\\u
testcase exposes this bug. -
This new slice is then passed to
from_utf8_unchecked
which is an unsafe method. If the slice of bytes contains a sequence of bytes which is invalid in UTF8, this method will panic. Realistically this is unlikely to happen with files which are read from disk and validated as UTF8, but might still be an issue when the data is not guaranteed to be UTF8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(…) might still be an issue when the data is not guaranteed to be UTF8.
Is this even possible given that FluentResource::try_new
takes a String
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The x = {"\u testcase exposes this bug.
I fixed it and reviewed other places where this may happen. Found no other so I think we're good.
This new slice is then passed to from_utf8_unchecked which is an unsafe method. If the slice of bytes contains a sequence of bytes which is invalid in UTF8, this method will panic
No, we don't care if the file is UTF8 or if it contains invalid UTF8 tokens because the boundaries we set are validated to be. In other words, since Fluent tokens end on valid UTF8 chars like "\n", "{" etc. we can establish a range and know that we're slicing at the right place. What's inside, doesn't really matter to us. So, if the entry looks like this:
key = This is a XXX value { $foo }
then we slice out three times:
key
which has astart
andend
known to be a valid alpha character and every character inside as well.This is a XXX value
which hasstart
as the first non-space char, andend
as one-before{
char, so also no risk of cutting in the middle of a multi-byte char.foo
- similar to (1)
Even if in (2) the XXX is invalid UTF8, we don't care about it. It still slices properly because of what is a boundary in Fluent.
@@ -1,14 +1,842 @@ | |||
//! AST, parser and serializer operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dir layout: Is there a tangible benefit to keeping the parser code in its own directory, and then the errors in parser/errors/mod.rs
? For my own code, I try to avoid nested directory hierarchies. As a rule of thumb, a directory with one or two files in it is probably not worth it :) I'd recommend placing all the files in fluent-syntax/src/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a value in keeping the parser in its own directory. This way I can easily tell which parts (ftlstream/errors) are parser specific, while ast stays crate specific (and in the future I expect serializer to get its own dir). I removed the errors dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that this is unnecessary proliferation of directories, but it's also a detail, so not blocking at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to keep it in its own directory. I'm experimenting with adding lexer.rs
there as well and I like to keep parser specific code in it away from ast/serializer.
Done!
I do plan to squash. I'm keeping @Manishearth suggestion open, and I believe the rest of the feedback has been applied. I'll try to investigate his suggestion, but I believe this shouldn't block the rest of review and can be done in a separate PR prior to release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the parser/mod.rs
this time, and I have to admit, I didn't understand all of it.
Some questions below are somewhat rust-related questions, some are just code questions.
I noticed that escapes in literals don't work yet. Also, I noticed two panics in the examples.
fluent-syntax/src/parser/mod.rs
Outdated
Ok(pattern.map(ast::Value::Pattern)) | ||
} | ||
|
||
fn get_attributes<'p>(ps: &mut ParserStream<'p>) -> Result<Vec<ast::Attribute<'p>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make explicit that this method is effectively infallible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, yes. I'd like to consider improving the recovery here eventually by recognizing that the error originated in the attribute, jumping to the next line that either is a start of new entry or starts with __"."
and if the latter, resume parsing attributes for that entry.
This would allow us to recover from:
key = Value
.attr = Value { $4 }
.attr2 = Value 2
by salvaging the attr2
into key
message. But that's beyond the scope of this patch, so for now, I fixed it :)
fluent-syntax/src/parser/mod.rs
Outdated
while let Some(b) = ps.source.get(ptr) { | ||
if ps.is_byte_alphabetic(*b) { | ||
ptr += 1; | ||
} else if ptr == ps.ptr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the start_of_id
logic in the middle of the loop a tad hard to grasp. Is this actually simpler than doing an explicit first char?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Since get_identifier
is such a hot loop I tested tons of variants and this is the fastest one. I'm going to switch to a more readable one for the sake of landing it sooner, and we can reevaluate perf optimizations later.
TextElementTermination::LineFeed, | ||
)); | ||
} | ||
b'\r' if ps.is_byte_at(b'\n', ps.ptr + 1) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this go out-of-bounds for a stream ending in \r
? No idea what happens in rust if that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK. ParserStream::is_byte_at
uses the checked slice::get
method which returns None
whenever the unchecked one would panic.
fluent-syntax/src/parser/mod.rs
Outdated
}; | ||
|
||
if !is_valid { | ||
//XXX: Generalize error type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or the opposite, make the error type more specific?
named, | ||
} | ||
} else { | ||
return error!(ErrorKind::ForbiddenCallee, ps.ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does rust do early returns? I.e, could we just return an error on invalid, and not indent the valid part?
fluent-cli/src/main.rs
Outdated
.version("1.0") | ||
.about("Parses FTL file into an AST") | ||
.args_from_usage( | ||
"-s, --silence 'disable output' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: call it --silent
and use sentence case for the help text, Disables the output
, to make it consistent with the rest of the usage description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what is the goal of this option? It looks like it disables all output. I would expect it would silence the error reporting but still produce the AST.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I copied it from the JS impl and the goal was to be able to time
the parser without timing the formatting and printing of the output. I changed it to only disable errors for now.
fluent-cli/src/main.rs
Outdated
.about("Parses FTL file into an AST") | ||
.args_from_usage( | ||
"-s, --silence 'disable output' | ||
<INPUT> 'Sets the input file to use'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Call it <FILE>
and change the help text to Fluent file to parse
.
if self.ch == Some(ch) { | ||
self.next(); | ||
return true; | ||
pub fn take_if(&mut self, b: u8) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Perhaps rename this to take_byte
or consume_byte
for symmetry with expect_byte
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_*
is a fallible method that returns an error. I'll rename it to take_byte_if
.
TextElementTermination::LineFeed, | ||
)); | ||
} | ||
b'\r' if ps.is_byte_at(b'\n', ps.ptr + 1) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK. ParserStream::is_byte_at
uses the checked slice::get
method which returns None
whenever the unchecked one would panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round of comments. I'll finish the review today, I'd like to take another look at the ResolveValue
trait.
I believe this shouldn't block the rest of review and can be done in a separate PR prior to release.
I don't mean to block this PR, but I really think there's something not right about the use of unsafe
in ParserStream
. Would you agree to use the checked variant of from_utf8
in this PR to land it without controversies, and then discuss switching to unsafe
in a follow-up?
} | ||
} | ||
pub fn is_byte_digit(&self, b: u8) -> bool { | ||
b >= b'0' && b <= b'9' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use u8::is_ascii_digit
here?
return Ok('-'); | ||
} | ||
pub fn is_byte_alphabetic(&self, b: u8) -> bool { | ||
(b >= b'a' && b <= b'z') || (b >= b'A' && b <= b'Z') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use u8::is_ascii_alphabetic
here?
_ => false, | ||
}; | ||
pub fn get_slice(&self, start: usize, end: usize) -> &'p str { | ||
unsafe { str::from_utf8_unchecked(&self.source[start..end]) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consequence is that I'm using our collective brain power, instead of compiler's brain power, to ensure that I'm not slicing outside of the boundary.
I understand that unsafe
is sometimes necessary and a good design may benefit from it. I question whether it's the right choice for the first iteration of this parser. Relying on collective brain power seems to be specifically what Rust was designed to work around, because the history shows us that brains are fallible.
The boundary checks are relatively expensive and since we're a full byte parser I believe we always attempt to slice within the range.
Just looking at how get_slice
is implemented in skip_unicode_escape_sequence
I was able to come up with the following test case which makes the parser panic:
x = {"\\u
I tested it with the following command:
./target/debug/fluent-cli <(echo -n 'x = {"\\u')
We should at least take special care when slicing using self.ptr + 1
. With the checked version of from_utf8
we'd be forced to support the out of bounds index in the code, which would prevent the panic.
If while working on 1.0 we'll discover that we have a lot of places in the parser that pick start/end wrong we can try to mitigate it here, but I'd honestly prefer to fix them in the parser.
I'd prefer to start with the safer (even if slower) approach and only consider unsafe
when we need to squeeze even more performance from the parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I'm mostly done with the review. It's a lot of code; thanks for the patience. I'd like to see a version of this PR with my comments addressed. The most important ones are:
- Comment the enums and their role in
get_pattern
. - Add protection against out-of-bounds indexing into
ps.source
. (And fix the bug I found yesterday.) - Sometimes the code uses
unwrap
in places where I imagine panic are not possible if the AST was parsed by the parser. Please consider starting a conversation about the expectation towards this code when this assumption is not satisfied. - File issues for the things we know are follow-up material.
Thanks!
match self { | ||
ast::InlineExpression::StringLiteral { raw } => { | ||
// XXX: We need to decode the raw into unicode here. | ||
Ok(FluentValue::from(*raw)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file an issue about this and reference it here.
Ok(FluentValue::from(*raw)) | ||
} | ||
ast::InlineExpression::NumberLiteral { value } => { | ||
Ok(FluentValue::as_number(*value).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that you're using unwrap
here (and a few more places) because the as_number
conversion cannot fail if the AST node comes from the parser. Any invalid input would be caught by the parser as a syntax error. This poses an interesting question about the goals of the resolver: what should it do about AST nodes which have been tinkered with in code? I think it's totally OK to expect them to not resolve properly. However, I'm not sure if the program should panic in such cases. Do you have thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. I must say that I see this patchset as bringing the parser to current state, and I expect us to only after that start cleaning up the resolver, so I was kind of ok with introducing half-baked unwraps like this. I'll file a follow-up for them.
fluent-bundle/src/resolve.rs
Outdated
ref positional, | ||
ref named, | ||
} => { | ||
let mut resolved_unnamed_args = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps rename this to resolved_positional_args
?
fluent-syntax/src/parser/mod.rs
Outdated
}, | ||
ptr | ||
); | ||
} else if ps.is_byte_digit(*b) || [b'_', b'-'].contains(&b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you don't need &b
in contains()
here.
fluent-syntax/src/parser/mod.rs
Outdated
#[derive(Debug)] | ||
enum PatternElementPointers<'a> { | ||
Placeable(ast::Expression<'a>), | ||
TextElement(usize, usize, usize, TextElementPosition), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe call them PatternElementPlaceholders
?
} | ||
|
||
#[derive(Debug, PartialEq)] | ||
enum TextElementTermination { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the enums here need docs, which explain the why's of this code. I suggested how to improve the docs of CarriageReturn
in a comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can TextElementTermination
be combined with TextElementPosition
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, and I'd prefer not to dig into rethinking of get_pattern now. Is it ok not to block on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
#[derive(Debug, PartialEq)] | ||
enum TextElementType { | ||
Blank, | ||
NonBlank, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure why Blank
vs. NonBlank
are needed. Please add a comment.
if let Some(last_non_blank) = last_non_blank { | ||
let elements = elements | ||
.into_iter() | ||
.take(last_non_blank + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why +1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of how take
works. If last_non_blank
is 0
, we need to take 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but why? :) What does last_non_blank
keep track of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It keeps the track of which of the elements
is the last one to be non_blank. Let's imagine this scenario:
key = This is
multiline value
and it still works
key2 = Value 2
For key
it produces the following TextElements:
("This is\n", " multiline value\n", "\n", " and it still works\n", "\n", "\n")
So, we want to only take the lines 0..3
. It may be that since we filter out blank lines, it would be better to remove that variable and just filter out blank lines. I'll look into this after landing this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fluent-syntax/src/parser/mod.rs
Outdated
} | ||
} | ||
|
||
fn get_literal<'p>(ps: &mut ParserStream<'p>) -> Result<ast::InlineExpression<'p>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only two literals in Fluent
spec: StringLiteral
and NumberLiteral
. In fluent.js
we called this function getSimpleExpression
.
@@ -0,0 +1,11 @@ | |||
FTL_FIXTURES := $(wildcard *.ftl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this file please.
…emented potentially beyond the slice range.
Applied all feedback.
Added.
Removed the unsafe code for now, fixed the bug and verified that no other place uses
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for applying the feedback, @zbraniecki! Nice work on this refactor, let's ship it.
fluent-cli/src/main.rs
Outdated
match res { | ||
Ok(res) => print_entries_resource(&res), | ||
Err((res, errors)) => { | ||
print_entries_resource(&res); | ||
|
||
if matches.is_present("silence") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be called silent
now?
@@ -186,7 +189,7 @@ impl<'p> ParserStream<'p> { | |||
} | |||
|
|||
pub fn get_slice(&self, start: usize, end: usize) -> &'p str { | |||
unsafe { str::from_utf8_unchecked(&self.source[start..end]) } | |||
str::from_utf8(&self.source[start..end]).expect("Slicing the source failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for removing the unsafe
block here, at least for now. Would you mind filing a follow-up for discussing re-adding it, please? I'd like to look into it a bit more, given that we're likely protected from invalid UTF8 input by the sheer fact of accepting String
into FluentResource::try_new
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fluent-syntax/src/parser/mod.rs
Outdated
// | ||
// CRLF variant is specific because we want | ||
// to skip it in text elements production. | ||
// For example `a\r\nb` will produce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a missing space before b
in the example here. Also (uber-nit), what's up with the narrow line width?
fluent-syntax/src/parser/mod.rs
Outdated
// proper state for the next line. | ||
// | ||
// CRLF variant is specific because we want | ||
// to skip it in text elements production. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…we want to skip CR but keep the LF…
.
if let Some(last_non_blank) = last_non_blank { | ||
let elements = elements | ||
.into_iter() | ||
.take(last_non_blank + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but why? :) What does last_non_blank
keep track of?
// is blank or not. | ||
// This is important to identify text elements | ||
// which should not be taken into account | ||
// when calculating common indent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nice, I understand the purpose of those enums better now, thanks!
} | ||
|
||
#[derive(Debug, PartialEq)] | ||
enum TextElementTermination { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
This is a major rewrite, to the point where I'm not even sure how to exactly plan reviewing it and landing, but I think it is getting close to being done.
There's one open question about
FluentBundle::add_messages
which I posted in projectfluent/fluent.js#328This, unfortunately, break a lot of examples/tests in fluent-bundle as before they freely passed
&str
and constructedFluentResource
on fly.We'll need to answer that question, and then clean up the code, make sure that such zero-copy abstraction works well for Amethyst, and begin reviewing it to land.
I believe that the proposed structure and memory allocation model should bring us close to the model we'll use for fluent-rs 1.0.