[idl_parser] Track included files by hash #6434

mmmspatz · 2021-02-01T02:43:20Z

Parser::included_files_ is a map whose main purpose is to keep track of
which files have already been parsed in order to protect against
multiple inclusion. Its key is the path that the file was found at
during parsing (or, if it's an in-memory file, just its name).

The second commit changes the key to be the 64 bit FNV-1a hash of the file's
name (just the name, not the complete path) xor'd with the hash of the
file's contents (unless it's an in-memory file, then we only hash the
name.)

This allows multiple include protection to function even in the face of
unique per-file include paths (fixes #6425).

The first commit deletes a (hopefully) unused code fragment that interfered with this change.
The last commit solves an issue reported by CI, probably related to the first commit. Need some insight here.

I don't know what this is for, but it's the only piece of code external to idl_parser.cpp that expects the key of Parser::included_files_ to be a path. And it appears to be unused.

Parser::included_files_ is a map whose main purpose is to keep track of which files have already been parsed in order to protect against multiple inclusion. Its key is the path that the file was found at during parsing (or, if it's an in-memory file, just its name). This commit changes the key to be the 64 bit FNV-1a hash of the file's name (just the name, not the complete path) xor'd with the hash of the file's contents (unless it's an in-memory file, then we only hash the name.) This allows multiple include protection to function even in the face of unique per-file include paths (fixes google#6425).

CI told me to do it.

aardappel · 2021-02-01T20:17:57Z

Assuming doing proper cross-platform path canonicalization is indeed too difficult to get exactly right, this seems a reasonably elegant way to solve the multiple inclusion problem to me. What do you think, @vglavnyy @krojew ?

Not too worried about performance implications since this only affects schema loading, which I hope is never performance sensitive.

krojew · 2021-02-01T20:31:27Z

Sounds good enough, although Murphy's law implies we'll get collisions now instead 😉

vglavnyy · 2021-02-07T09:13:23Z

src/idl_parser.cpp

+      source_hash = HashFile(source_filename, source);
+    else
+      source_hash = HashFile(source_filename, nullptr);
+


I don't understand how it works.
Why parse should continue processing if a schema has include filename directive but the file doesn't exist?
Can you give an example that covers both paths of this if (FileExists(source_filename))?

see other reply

vglavnyy · 2021-02-07T09:13:43Z

src/idl_parser.cpp

-        if (!LoadFile(filepath.c_str(), true, &contents))
-          return Error("unable to load include file: " + name);
+        // Parse it.
+        if (!file_loaded) return Error("unable to load include file: " + name);


The same, why file_loaded checked if and only its hash not found?
Why empty contents is used for calculation?

When we encounter an include directive, we need to answer two questions:

Has the included file already been parsed?

If not, can we load it for parsing?

It's necessary to defer query 2 until we know the answer to query 1 is "no", becasue Parser::Parse() can be called schemas that do not exist on-disk (example). If one of those in-memory schemas is subsequently included in another schema, we won't be able to load it from disk but we can know that it was previously parsed based on its name alone. At least, that is the assumtion that was made even prior to #6371.

@mmmspatz thank you for the explanation.
Your solution looks correct.

krojew · 2021-02-18T17:09:55Z

Any update on this?

krojew · 2021-02-22T18:35:57Z

@aardappel @mmmspatz can we move forward with this somehow?

aardappel · 2021-02-22T19:01:25Z

I'd like to see @vglavnyy's concern's addressed first.

I just learned source_filename might also be null. In that case, we should exclude it from the hash instead of hashing a zero length string, just like we exclude source when it is null. Presumably nameless files will never be included (they can't, can they?) so this doesn't really matter, but I think it's prettier anyways.

krojew · 2021-02-25T06:22:01Z

I'm adding this to #6353 since it fixes a quite important regression.

aardappel · 2021-03-01T20:33:51Z

Awesome, thanks all!

[idl_gen] Delete ts::GenPrefixedImport()

63f2adf

I don't know what this is for, but it's the only piece of code external to idl_parser.cpp that expects the key of Parser::included_files_ to be a path. And it appears to be unused.

github-actions bot added c++ codegen Involving generating code from schema javascript typescript labels Feb 1, 2021

mmmspatz force-pushed the mspatz/hash-includes branch from 6634953 to 838a89d Compare February 1, 2021 02:45

mmmspatz mentioned this pull request Feb 1, 2021

Multiple includes of the same symbol broken #6425

Closed

Ran tests/generate_code.sh

5113914

CI told me to do it.

github-actions bot added the rust label Feb 1, 2021

mmmspatz changed the title ~~Mspatz/hash includes~~ [idl_parser] Track included files by hash Feb 1, 2021

vglavnyy reviewed Feb 7, 2021

View reviewed changes

vglavnyy approved these changes Feb 27, 2021

View reviewed changes

aardappel merged commit bd4e0b3 into google:master Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[idl_parser] Track included files by hash #6434

[idl_parser] Track included files by hash #6434

mmmspatz commented Feb 1, 2021 •

edited

Loading

aardappel commented Feb 1, 2021

krojew commented Feb 1, 2021

vglavnyy Feb 7, 2021 •

edited

Loading

mmmspatz Feb 23, 2021

vglavnyy Feb 7, 2021

mmmspatz Feb 23, 2021 •

edited

Loading

vglavnyy Feb 27, 2021

krojew commented Feb 18, 2021

krojew commented Feb 22, 2021

aardappel commented Feb 22, 2021

krojew commented Feb 25, 2021

aardappel commented Mar 1, 2021

[idl_parser] Track included files by hash #6434

[idl_parser] Track included files by hash #6434

Conversation

mmmspatz commented Feb 1, 2021 • edited Loading

aardappel commented Feb 1, 2021

krojew commented Feb 1, 2021

vglavnyy Feb 7, 2021 • edited Loading

Choose a reason for hiding this comment

mmmspatz Feb 23, 2021

Choose a reason for hiding this comment

vglavnyy Feb 7, 2021

Choose a reason for hiding this comment

mmmspatz Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

vglavnyy Feb 27, 2021

Choose a reason for hiding this comment

krojew commented Feb 18, 2021

krojew commented Feb 22, 2021

aardappel commented Feb 22, 2021

krojew commented Feb 25, 2021

aardappel commented Mar 1, 2021

mmmspatz commented Feb 1, 2021 •

edited

Loading

vglavnyy Feb 7, 2021 •

edited

Loading

mmmspatz Feb 23, 2021 •

edited

Loading