Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve WebM detection #486

Merged
merged 1 commit into from
Sep 1, 2021
Merged

Improve WebM detection #486

merged 1 commit into from
Sep 1, 2021

Conversation

Borewit
Copy link
Collaborator

@Borewit Borewit commented Aug 30, 2021

Fixes recognition of WebM format.

Fixes: #485

test.js Outdated Show resolved Hide resolved
@@ -736,7 +736,8 @@ async function _fromTokenizer(tokenizer) {
while (children > 0) {
const element = await readElement();
if (element.id === 0x42_82) {
return tokenizer.readToken(new Token.StringType(element.len, 'utf-8')); // Return DocType
const rawValue = await tokenizer.readToken(new Token.StringType(element.len, 'utf-8'));
return rawValue.replace(/\00.*$/g, ''); // Return DocType
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it document the maximum amount of null characters there could be? Would be nice to have a limit in place so it wouldn't hang on faulty files that has too many null characters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no maximum, it's used as a kind of padding.
The maximum length is of the string read is already terminated by element.len.

Copy link
Collaborator Author

@Borewit Borewit Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you mean check element.len? The length can exceed the JavaScript number length and is encoded a specific way (VINT examples).
At that point the assumption is already it is EBML and starts to consume the tokenizer
and iterate through the EBML elements.
You could say, the docType must be a relative short value, but then we already matched 0x1A, 0x45, 0xDF, 0xA3 & 0x42_82. Extremely unlikely we hit that point without the format being EBML.

Fixes recognition of WebM format.

Resolves: #485
@Borewit Borewit changed the title Ignore leading null values in EBML UTF-8 value Ignore trailing null values in EBML UTF-8 value Aug 31, 2021
@sindresorhus sindresorhus changed the title Ignore trailing null values in EBML UTF-8 value Improve WebM detection Sep 1, 2021
@sindresorhus sindresorhus merged commit b23be62 into main Sep 1, 2021
@sindresorhus sindresorhus deleted the fix-issue-485 branch September 1, 2021 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Type detection failing for webm format file
2 participants