Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve WebM detection #486

Merged
merged 1 commit into from
Sep 1, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion core.js
Original file line number Diff line number Diff line change
Expand Up @@ -736,7 +736,8 @@ async function _fromTokenizer(tokenizer) {
while (children > 0) {
const element = await readElement();
if (element.id === 0x42_82) {
return tokenizer.readToken(new Token.StringType(element.len, 'utf-8')); // Return DocType
const rawValue = await tokenizer.readToken(new Token.StringType(element.len, 'utf-8'));
return rawValue.replace(/\00.*$/g, ''); // Return DocType
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it document the maximum amount of null characters there could be? Would be nice to have a limit in place so it wouldn't hang on faulty files that has too many null characters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no maximum, it's used as a kind of padding.
The maximum length is of the string read is already terminated by element.len.

Copy link
Collaborator Author

@Borewit Borewit Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you mean check element.len? The length can exceed the JavaScript number length and is encoded a specific way (VINT examples).
At that point the assumption is already it is EBML and starts to consume the tokenizer
and iterate through the EBML elements.
You could say, the docType must be a relative short value, but then we already matched 0x1A, 0x45, 0xDF, 0xA3 & 0x42_82. Extremely unlikely we hit that point without the format being EBML.

}

await tokenizer.ignore(element.len); // ignore payload
Expand Down
Binary file added fixture/fixture-null.webm
Binary file not shown.
3 changes: 3 additions & 0 deletions test.js
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,9 @@ const names = {
'fixture-fast-web', // PDF saved from Adobe Illustrator, using the default "[Illustrator Default"] preset, but enabling "Optimize for Fast Web View"
'fixture-printed', // PDF printed from Adobe Illustrator, but with a PDF printer.
],
webm: [
'fixture-null', // EBML DocType with trailing null character
],
};

// Define an entry here only if the file type has potential
Expand Down