Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[idl_parser] Track included files by hash #6434

Merged
merged 4 commits into from
Mar 1, 2021

Conversation

mmmspatz
Copy link
Contributor

@mmmspatz mmmspatz commented Feb 1, 2021

Parser::included_files_ is a map whose main purpose is to keep track of
which files have already been parsed in order to protect against
multiple inclusion. Its key is the path that the file was found at
during parsing (or, if it's an in-memory file, just its name).

The second commit changes the key to be the 64 bit FNV-1a hash of the file's
name (just the name, not the complete path) xor'd with the hash of the
file's contents (unless it's an in-memory file, then we only hash the
name.)

This allows multiple include protection to function even in the face of
unique per-file include paths (fixes #6425).

The first commit deletes a (hopefully) unused code fragment that interfered with this change.
The last commit solves an issue reported by CI, probably related to the first commit. Need some insight here.

I don't know what this is for, but it's the only piece of code external
to idl_parser.cpp that expects the key of Parser::included_files_ to be
a path. And it appears to be unused.
@github-actions github-actions bot added c++ codegen Involving generating code from schema javascript typescript labels Feb 1, 2021
Parser::included_files_ is a map whose main purpose is to keep track of
which files have already been parsed in order to protect against
multiple inclusion. Its key is the path that the file was found at
during parsing (or, if it's an in-memory file, just its name).

This commit changes the key to be the 64 bit FNV-1a hash of the file's
name (just the name, not the complete path) xor'd with the hash of the
file's contents (unless it's an in-memory file, then we only hash the
name.)

This allows multiple include protection to function even in the face of
unique per-file include paths (fixes google#6425).
CI told me to do it.
@github-actions github-actions bot added the rust label Feb 1, 2021
@mmmspatz mmmspatz changed the title Mspatz/hash includes [idl_parser] Track included files by hash Feb 1, 2021
@aardappel
Copy link
Collaborator

Assuming doing proper cross-platform path canonicalization is indeed too difficult to get exactly right, this seems a reasonably elegant way to solve the multiple inclusion problem to me. What do you think, @vglavnyy @krojew ?

Not too worried about performance implications since this only affects schema loading, which I hope is never performance sensitive.

@krojew
Copy link
Contributor

krojew commented Feb 1, 2021

Sounds good enough, although Murphy's law implies we'll get collisions now instead 😉

source_hash = HashFile(source_filename, source);
else
source_hash = HashFile(source_filename, nullptr);

Copy link
Contributor

@vglavnyy vglavnyy Feb 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how it works.
Why parse should continue processing if a schema has include filename directive but the file doesn't exist?
Can you give an example that covers both paths of this if (FileExists(source_filename))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!LoadFile(filepath.c_str(), true, &contents))
return Error("unable to load include file: " + name);
// Parse it.
if (!file_loaded) return Error("unable to load include file: " + name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same, why file_loaded checked if and only its hash not found?
Why empty contents is used for calculation?

Copy link
Contributor Author

@mmmspatz mmmspatz Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we encounter an include directive, we need to answer two questions:

  1. Has the included file already been parsed?
  2. If not, can we load it for parsing?

It's necessary to defer query 2 until we know the answer to query 1 is "no", becasue Parser::Parse() can be called schemas that do not exist on-disk (example). If one of those in-memory schemas is subsequently included in another schema, we won't be able to load it from disk but we can know that it was previously parsed based on its name alone. At least, that is the assumtion that was made even prior to #6371.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmmspatz thank you for the explanation.
Your solution looks correct.

@krojew
Copy link
Contributor

krojew commented Feb 18, 2021

Any update on this?

@krojew
Copy link
Contributor

krojew commented Feb 22, 2021

@aardappel @mmmspatz can we move forward with this somehow?

@aardappel
Copy link
Collaborator

I'd like to see @vglavnyy's concern's addressed first.

I just learned source_filename might also be null. In that case, we
should exclude it from the hash instead of hashing a zero length
string, just like we exclude source when it is null.

Presumably nameless files will never be included (they can't, can they?)
so this doesn't really matter, but I think it's prettier anyways.
@krojew
Copy link
Contributor

krojew commented Feb 25, 2021

I'm adding this to #6353 since it fixes a quite important regression.

@aardappel
Copy link
Collaborator

Awesome, thanks all!

@aardappel aardappel merged commit bd4e0b3 into google:master Mar 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ codegen Involving generating code from schema javascript rust typescript
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple includes of the same symbol broken
4 participants