-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use arena for contexts #182
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some inline comments for the things that are unclear :).
Another thing is that the can_parse_issue120
test fails after these changes. If you look at what the test does, you can see that it clones a SyntaxDefinition
and then puts it into a new SyntaxSet
. This works with the current syntect because of ContextPtr
, but is not something that should be allowed.
Note that even though the code calls link_syntaxes
, it can't fix the reference because at that point it's just a ContextId
.
To prevent that, I think we have two options:
- Make the
SyntaxDefinition
API type not clonable. The content still needs to be clonable somehow because we need that to make a wholeSyntaxSet
clonable. - Allow cloning and make it possible to relink
I think option 2 is nicer, and we could do it like this: Instead of having a ContextReference::Direct
variant, change the structure to this:
enum ContextReferenceThing {
Unlinked(ContextReference),
Linked(ContextReference, ContextId),
})
That way we could relink because we still have the original references.
To ensure that the callers never forget to link, we could remove the add_syntax
from SyntaxSet
and instead make it go via a builder, and make build
do the linking. And then, the serialized syntax set could already be linked too :). What do you think?
src/parsing/parser.rs
Outdated
/// Create a state from a syntax, keeps its own reference counted | ||
/// pointer to the main context of the syntax. | ||
pub fn new(syntax: &SyntaxDefinition) -> ParseState { | ||
pub fn new(syntax_set: &'a SyntaxSet, syntax: &'a SyntaxDefinition) -> ParseState<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there's nothing here preventing the caller to pass in a SyntaxDefinition
that doesn't actually come from the passed in SyntaxSet
. I'm not sure if that's a big problem, but yeah.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way would be to return some opaque SyntaxRef
type that contains a ref to the SyntaxSet
and the starting context from SyntaxSet
. This can be done while satisfying my desire to have ParseState
not store the SyntaxSet
ref by using a PhantomData
type. Although it can't be reconciled with my not having a lifetime parameter in ParseState, although again I'm not sure if that's necessary or not.
I'm also not sure if this would be very clean API-wise. Right now the methods that let you look up syntax definitions in a set do double duty of letting you figure out info about the syntax like its name (for display in a text editor or whatever) as well as letting you highlight with it.
It may not be worth trying to make this strict, but if we can come up with a nice way that doesn't cause borrow checker hell we should do it.
src/parsing/syntax_definition.rs
Outdated
pub regex: Option<Regex>, | ||
// #[serde(skip_serializing, skip_deserializing)] | ||
// TODO: cache these somewhere? | ||
// pub regex: Option<Regex>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to look at how we want to cache regexes. I was going to look at what the regex crate does for its internal cache and then do something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lazy regex compilation is definitely a necessary feature before merging, without it either parsing or startup is really slow.
Something like the regex
crate is a decent idea. Ideally this refactor should make multithreaded use easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this and noticed that onig's Regex is actually Send
and Sync
now: rust-onig/rust-onig#47
So it looks like all we need is lazy initialization, not a thread-local cache. I've raised a PR here: #186
src/parsing/syntax_definition.rs
Outdated
Inline(ContextPtr), | ||
Direct(LinkerLink), | ||
// Inline(Context), | ||
Inline(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially left the distinction between Named
and Inline
because recursively_mark_no_prototype
only looked at Inline
. But after #180, this would not be necessary anymore.
Having said that, I don't know if we want SyntaxDefinition
to stay as close as possible to the YAML structure or not. In other words, are all the structures within SyntaxDefinition
meant to be part of the API surface for users of this library to be able to inspect?
src/parsing/syntax_set.rs
Outdated
@@ -25,6 +25,7 @@ use onig::Regex; | |||
/// | |||
/// Re-linking— linking, adding more unlinked syntaxes with `load_syntaxes`, | |||
/// and then linking again—is allowed. | |||
// TODO: clone? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this is not clone yet is because of the FirstLineCache
. I imagine we'd use a similar solution to caching match regexes for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah good idea. Either one pool of compiled regexes that has both sets of regexes in it, or just one RegexCache
type with two instances.
src/parsing/yaml_load.rs
Outdated
.map(|x| SyntaxDefinition::parse_reference(x, state)) | ||
.enumerate() | ||
.map(|(i, x)| SyntaxDefinition::parse_reference( | ||
&format!("{}.{}", name, i), x, state, contexts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keith-hall you mentioned that Sublime already gives names to anonymous contexts. Do you have a reference or a way to check how Sublime assigns the names? Then we could do it the same way here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it used to be possible to find how ST names these contexts by checking the ST console when an anonymous context included a meta_scope
and was subject to a with_prototype
- see sublimehq/sublime_text#1683. This is the closest I have to a reference or a way to check the names (with ST build 3153 or older).
It seems like the naming convention is #anon_
followed by the named context, followed by _
and the index of the anonymous context within that named context (regardless of nesting level, in an XML-like "document order").
This was my test syntax definition:
%YAML 1.2
---
scope: source.example-anon
contexts:
main:
- include: c
c:
- meta_scope: test_c
- match: c_test
scope: c_test
d:
- meta_scope: test_d
- match: d_example
scope: d_example
push: #anonymous context 0 inside `d` context
- meta_scope: test_e
- match: e
scope: e
push: # anonymous context 1 inside `d` context
- meta_scope: testing123
- match: f
push: # anonymous context 2 inside `d` context
- match: g
with_prototype:
- match: stop_already...
with_prototype:
- match: nope
with_prototype:
- match: this
- match: another
push: # anonymous context 3 inside `d` context
- meta_scope: blah
- match: something
push: # anonymous context 4 inside `d` context
- meta_scope: foobar
- match: hello
with_prototype:
- match: me
and the ST console output:
generating syntax summary
rule 719396458c7046797910f3c1990b81d6::#anon_d_1 has a scope name, but is unreachable, so the name will never be used
rule c has a scope name, but is unreachable, so the name will never be used
rule d has a scope name, but is unreachable, so the name will never be used
rule #anon_d_0 has a scope name, but is unreachable, so the name will never be used
rule #anon_d_1 has a scope name, but is unreachable, so the name will never be used
rule #anon_d_4 has a scope name, but is unreachable, so the name will never be used
I believe that 719396458c7046797910f3c1990b81d6
can be ignored for our purposes, I think it's some kind of unique identifier ST gave when creating new anonymous contexts with the with_prototype
applied, which is not something we do in syntect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general. I think there might be some clever ways to fiddle with things so that linking is cleaner though.
Right now this PR adds noticeable more code than it deletes and I feel like there may be a better design that doesn't. But if adding code is necessary that's okay, this refactor would still buy us context names for debugging as well as likely better multithreading support.
src/parsing/parser.rs
Outdated
// TODO: removed PartialEq, Eq from this because SyntaxSet is not | ||
#[derive(Debug, Clone)] | ||
pub struct ParseState<'a> { | ||
syntax_set: &'a SyntaxSet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is where the SyntaxSet
should go, I think it should maybe be passed into parse_line
.
The reason is that ParseState
is designed for caching in text editors, and it would be kinda sad to redundantly cache a ton of pointers to the same SyntaxSet.
I also don't have a good idea of how much borrow checker hell having ParseState store a reference will cause. Might be really easy, might be hard. It could be worth looking at how this might fit into Xi as a case study. If it is hard then the StateLevel
s could store a context index instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try out the changes in xi-editor.
Passing it in parse_line
seems like you could accidentally pass in a different one and get all kinds of messed up results.
The way it's currently written with the lifetime specifier reflects the reality nicely: The ParseState
is valid for as long as the corresponding SyntaxSet
exists. If that goes away or is changed, a ParseState
instance should no longer be used.
src/parsing/parser.rs
Outdated
fn eq(&self, other: &StateLevel) -> bool { | ||
context_ptr_eq(&self.context, &other.context) && | ||
self.context == other.context && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might have performance implications since I think this does a deep compare. The PartialEq
implementation for Context
is really only intended for unit tests. Probably want to cast to pointers and compare those.
I think this implementation is for the purpose of implementing fancy caching patterns, and I'm not sure if this compare is valuable there, I don't think anyone uses this so it won't show up until someone tries to implement a fancy caching pattern and its super slow.
src/parsing/parser.rs
Outdated
/// Create a state from a syntax, keeps its own reference counted | ||
/// pointer to the main context of the syntax. | ||
pub fn new(syntax: &SyntaxDefinition) -> ParseState { | ||
pub fn new(syntax_set: &'a SyntaxSet, syntax: &'a SyntaxDefinition) -> ParseState<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way would be to return some opaque SyntaxRef
type that contains a ref to the SyntaxSet
and the starting context from SyntaxSet
. This can be done while satisfying my desire to have ParseState
not store the SyntaxSet
ref by using a PhantomData
type. Although it can't be reconciled with my not having a lifetime parameter in ParseState, although again I'm not sure if that's necessary or not.
I'm also not sure if this would be very clean API-wise. Right now the methods that let you look up syntax definitions in a set do double duty of letting you figure out info about the syntax like its name (for display in a text editor or whatever) as well as letting you highlight with it.
It may not be worth trying to make this strict, but if we can come up with a nice way that doesn't cause borrow checker hell we should do it.
src/parsing/syntax_definition.rs
Outdated
pub regex: Option<Regex>, | ||
// #[serde(skip_serializing, skip_deserializing)] | ||
// TODO: cache these somewhere? | ||
// pub regex: Option<Regex>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lazy regex compilation is definitely a necessary feature before merging, without it either parsing or startup is really slow.
Something like the regex
crate is a decent idea. Ideally this refactor should make multithreaded use easier.
src/parsing/syntax_definition.rs
Outdated
ContextId { syntax_index, context_index } | ||
} | ||
|
||
// TODO: maybe this should be on SyntaxSet instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure about this too. On the one hand it seems like it might fit better on the SyntaxSet in terms of who's doing the resolving of what.
On the other hand the SyntaxSet
already has a lot of methods and it's something people read docs for, and so we'd probably need to give it a more descriptive name. Probably better to keep it tucked away here where it more specifically applies.
src/parsing/syntax_set.rs
Outdated
@@ -25,6 +25,7 @@ use onig::Regex; | |||
/// | |||
/// Re-linking— linking, adding more unlinked syntaxes with `load_syntaxes`, | |||
/// and then linking again—is allowed. | |||
// TODO: clone? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah good idea. Either one pool of compiled regexes that has both sets of regexes in it, or just one RegexCache
type with two instances.
src/parsing/syntax_definition.rs
Outdated
|
||
|
||
#[derive(Clone, Debug, Eq, PartialEq, Serialize, Deserialize)] | ||
pub struct ContextId { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we can figure out a way to cleanly put all the Context
s in one big Vec
in the SyntaxSet
. That would make ContextId
half the size, as well as saving a bounds check + pointer lookup on every ID resolve.
I think such a design would also make whole-syntax-set passes like linking cleaner to implement, possibly through nice visitor methods that take a closure that's called on every link or something. Also future cache data structures with a slot for every context would be easier to index.
src/parsing/syntax_set.rs
Outdated
if let Some(context_index) = syntax.find_context_index_by_name("prototype") { | ||
// let prototype = &mut syntax.contexts[context_index]; | ||
// TODO: this works but is it nice? Would it be nicer to make | ||
// recursively_mark_no_prototype a method on SyntaxDefinition? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me wonder if this is different than how Sublime does it. Now I'm wondering if in Sublime I include a named context in the prototype, and then push that named context from the main section, will it use the prototype or not. This also makes linking not idempotent since if a prototype includes something from another syntax (don't think this ever happens in practice) the first time it'll work fine, but on second link it can interfere with the prototype of another syntax.
I think Sublime may use a model of duplicating contexts when they have a prototype. Although I'm not sure if this ever comes up in practice.
Anyhow yah maybe this would be better on SyntaxDefinition
, although I think ideally all contexts get moved into a single Vec
in the SyntaxSet
and then there's some kind of recursive context walking helper method that this can use or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I'm wondering if in Sublime I include a named context in the prototype, and then push that named context from the main section, will it use the prototype or not.
I tested that as part of #180 and the answer is: No, it also won't use the prototype in that case. So the way we currently do it seems to be correct.
src/parsing/yaml_load.rs
Outdated
@@ -128,6 +126,21 @@ impl SyntaxDefinition { | |||
SyntaxDefinition::parse_top_level(doc, scope_repo.deref_mut(), lines_include_newline, fallback_name) | |||
} | |||
|
|||
// TODO: visibility. also, is this the right file to have these? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, note that this whole module isn't included if the yaml loading feature isn't included, and I think you use this in a test outside this module.
src/parsing/syntax_set.rs
Outdated
@@ -57,6 +58,34 @@ impl Default for SyntaxSet { | |||
} | |||
} | |||
|
|||
struct ContextMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there must be a better way of doing linking, although it may require restructuring some data.
Actually I just realize I don't even understand why this is a thing, why not just iterate over the list of syntaxes directly to find the right index instead of constructing all of these extra vectors with the same indexes and information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because at the point we want to find a syntax/context, we're within a method that has a mutable borrow of a Context
or some component of it.
I tried what you suggested but I couldn't get it past the borrow checker. If you have any ideas that would be good. We could probably do something with unsafe, because all we want to do is mutate ContextReference
, not changing any of the structure apart from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay makes sense.
One way around this might be to switch to having a flat Vec<Context>
with SyntaxDefinition
s just having references into it. The YAML loader would take a mutable ref to a SyntaxSet
(or the underlying Vec
) that it would shove new contexts into. Then SyntaxDefinition
would have something like a HashMap<String, ContextId>
to represent the names.
That would accomplish moving to a flat system (which I suspect will work better/faster in other ways, see other comments) and also separate the lookup data (and move back to an efficient map lookup for names).
It would sacrifice being able to pick up syntaxes and drop them in different sets, but I don't think that's that useful. The one test that does that can just ask the SyntaxSet to load the file it wants directly instead of loading the entire directory and picking one out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting idea. I think it's a bigger change than what I have here, but I'd like to try it out.
It would sacrifice being able to pick up syntaxes and drop them in different sets
Depending on how we implement it, I think we could still make that possible. We could retrieve the original contexts based on the HashMap<String, ContextId>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I've:
- Made the change to introduce a
SyntaxSetBuilder
on this branch - Created Use a single Vec<Context> in SyntaxSet after linking #187 for the flat
Vec<Context>
approach on top of this branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New builder and benchmark stuff looks good
src/parsing/syntax_set.rs
Outdated
@@ -91,58 +95,24 @@ impl SyntaxSet { | |||
SyntaxSet::default() | |||
} | |||
|
|||
/// Create a builder from this set to load more syntax definitions. | |||
pub fn builder(self) -> SyntaxSetBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks for thinking of this. It's good to have this so fancy text editors can load new syntax definitions easily.
Contexts (e.g. inline contexts) are no longer nested within parent contexts, but instead are now in the flat list like top-level contexts. This makes it possible to work with indexes in recursively_mark_no_prototype. Context linking does not yet borrow check because of how references are resolved.
This pleases the borrow checker.
The previous code didn't have this problem because it used try_borrow_mut and so a recursive attempt to borrow the same context would enter the block.
This turned out to be cleaner.
Previously, linking had to be done manually using `link_syntaxes` in order for the syntaxes to work. This change introduces a SyntaxSetBuilder that contains all the methods to add/load syntaxes. When calling `build`, the syntaxes are linked and you get a `SyntaxSet`. This has the nice side effect of separating all the methods that mutate to a separate type and making SyntaxSet immutable.
This means resolving a ContextId is now a single lookup. Having the SyntaxSetBuilder (introduced in a previous commit) helped make this change.
85ce859
to
822c472
Compare
Now that we have a separate SyntaxReference type which just contains the context IDs, there's not point in having the Vec in SyntaxDefinition anymore.
The order determines the order of the flat Vec<Context> in the SyntaxSet, so sort them to make dumps deterministic. Another option would be to switch from HashMap to BTreeMap or to IndexMap from the indexmap crate.
@trishume I've ported the syntect-plugin, see here: xi-editor/xi-editor@master...robinst:syntect-plugin-arena The lifetimes are not that nice, but it works. I had to re-add the implementation of The plugin using I'll look into the other remaining things for now :). |
Sorry for leaving this to sit longer than usual, I've been really busy lately with a school project. I'll try and take a look at what the plugin rewrite looks like this weekend, but if you have time to test switching to ContextID and the benchmarks don't regress much, as I suspect they may not due to things staying in cache, then that may be a good idea. |
This way, you can load the default set, add some more syntaxes and link them to the existing syntaxes.
This means ParseState doesn't contain a reference to SyntaxSet anymore and doesn't need a lifetime. This makes it simpler for code that wants to cache it, but it also means parse_line needs to be called with the SyntaxSet as an argument so that it can look up contexts.
No worries! I've now finished all the big outstanding TODOs. Now the only thing that's left is documentation and some organizational things such as method visibility/naming and where to put them. I've also added some tests for The change for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the latest into_builder
stuff looks good!
So I've merged the other PR into this one, and updated the docs and the README. The only thing left that I'd like to look at is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice new docs 👍
Yup merging soon sounds good to me. And re your questions: Leaving the context stuff hidden is fine, I can't think of any even pro text editor use cases that need them and if anyone does they can ask. And yah your suggested renames sound good. I like |
Actually let's do the API renames in a new PR, makes it possible to review. Should I click the Merge button? :) |
Ah btw,
|
Nice, a separate PR for renames sounds good. I normally merge but you should indeed have the satisfaction of hitting merge on this enormous amount of awesome work, go ahead! After the renames I'll look at drafting a 3.0 release with notes on the breaking and other changes. |
Merged! 🎉 I'll prepare the other PR in the next few days. |
Changes context references from
Rc<RefCell<Context>>
to a Vec index, which makes it possible to makeSyntaxSet
bothSend
andSync
, see #83.This still needs more work and there's tons of TODOs to clean up, but I have some questions which will affect the direction.
The high-level overview of changes so far:
contexts
onSyntaxDefinition
changed from aHashMap<String, ContextPtr>
to aVec<Context>
Context
has a newname
field, and thecontexts
are sorted by that. The idea is that during linking, we can do a binary search to find contexts. After linking, we have ausize
index to look up the context quickly.High-level TODOs:
SyntaxDefinition
safe or prevent itFirstLineCache
ParseState
can be cached (try out in xi-editor here)Vec<Context>
inSyntaxSet
: see Use a single Vec<Context> in SyntaxSet after linking #187ParseState
should referenceSyntaxSet
andContext
s directly or just keepContextId
sSyntaxSetBuilder
API as recommended by guidelines: https://rust-lang-nursery.github.io/api-guidelines/type-safety.html#consuming-builders (also go over method names) -> some methods returnResult
, so can't make chaining work for all methods. Probably best to leave as is.