-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[idl_parser] Track included files by hash #6434
Conversation
I don't know what this is for, but it's the only piece of code external to idl_parser.cpp that expects the key of Parser::included_files_ to be a path. And it appears to be unused.
Parser::included_files_ is a map whose main purpose is to keep track of which files have already been parsed in order to protect against multiple inclusion. Its key is the path that the file was found at during parsing (or, if it's an in-memory file, just its name). This commit changes the key to be the 64 bit FNV-1a hash of the file's name (just the name, not the complete path) xor'd with the hash of the file's contents (unless it's an in-memory file, then we only hash the name.) This allows multiple include protection to function even in the face of unique per-file include paths (fixes google#6425).
6634953
to
838a89d
Compare
CI told me to do it.
Assuming doing proper cross-platform path canonicalization is indeed too difficult to get exactly right, this seems a reasonably elegant way to solve the multiple inclusion problem to me. What do you think, @vglavnyy @krojew ? Not too worried about performance implications since this only affects schema loading, which I hope is never performance sensitive. |
Sounds good enough, although Murphy's law implies we'll get collisions now instead 😉 |
source_hash = HashFile(source_filename, source); | ||
else | ||
source_hash = HashFile(source_filename, nullptr); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how it works.
Why parse should continue processing if a schema has include filename
directive but the file doesn't exist?
Can you give an example that covers both paths of this if (FileExists(source_filename))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see other reply
if (!LoadFile(filepath.c_str(), true, &contents)) | ||
return Error("unable to load include file: " + name); | ||
// Parse it. | ||
if (!file_loaded) return Error("unable to load include file: " + name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same, why file_loaded
checked if and only its hash not found?
Why empty contents
is used for calculation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we encounter an include directive, we need to answer two questions:
- Has the included file already been parsed?
- If not, can we load it for parsing?
It's necessary to defer query 2 until we know the answer to query 1 is "no", becasue Parser::Parse()
can be called schemas that do not exist on-disk (example). If one of those in-memory schemas is subsequently included in another schema, we won't be able to load it from disk but we can know that it was previously parsed based on its name alone. At least, that is the assumtion that was made even prior to #6371.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mmmspatz thank you for the explanation.
Your solution looks correct.
Any update on this? |
@aardappel @mmmspatz can we move forward with this somehow? |
I'd like to see @vglavnyy's concern's addressed first. |
I just learned source_filename might also be null. In that case, we should exclude it from the hash instead of hashing a zero length string, just like we exclude source when it is null. Presumably nameless files will never be included (they can't, can they?) so this doesn't really matter, but I think it's prettier anyways.
I'm adding this to #6353 since it fixes a quite important regression. |
Awesome, thanks all! |
Parser::included_files_
is a map whose main purpose is to keep track ofwhich files have already been parsed in order to protect against
multiple inclusion. Its key is the path that the file was found at
during parsing (or, if it's an in-memory file, just its name).
The second commit changes the key to be the 64 bit FNV-1a hash of the file's
name (just the name, not the complete path) xor'd with the hash of the
file's contents (unless it's an in-memory file, then we only hash the
name.)
This allows multiple include protection to function even in the face of
unique per-file include paths (fixes #6425).
The first commit deletes a (hopefully) unused code fragment that interfered with this change.
The last commit solves an issue reported by CI, probably related to the first commit. Need some insight here.