Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] keep track of minver and maxver per element #207

Merged
merged 4 commits into from
Jan 14, 2024

Conversation

robUx4
Copy link
Contributor

@robUx4 robUx4 commented Jan 6, 2024

They correspond to the attributes in the EBML specs.

For they are just informational. We may use this information when reading data to find which element are allowed given the read/version in the EBML header, that can help with error recovery, skipping elements that are not legit given the context.

@robUx4 robUx4 added api-break breaks the API (e.g. programs using it will have to adjust their source code) abi-break breaks the ABI (e.g. programs linked against the library have to be recompiled) labels Jan 6, 2024
Copy link
Contributor

@mbunkus mbunkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I wrote elsewhere, I'm not a big fan of the libraries constraining the elements that can be read or written depending on the version. The application should deal with it. Or there should be a switch to turn such a feature off & let the application deal with it. My legitimate use case is writing the EBML header with rather low version numbers & updating it later on when elements requiring newer versions to be written.

I would definitely profit from having that information stored & accessible in libEBML/libMatroska, though. At the moment I store the same information directly in MKVToolNix. It makes total sense to let the spectools embed the information in the libraries as it is already available in the specs.

const num_version minver;
// the maximum DocType version this element is allowed in
// ANY_VERSION if the element is supported in all (known) version
const num_version maxver;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a great place to use std::optional instead of magic values. Just make minver & maxver an std::optional<unsigned int>. Special values are so C++99 😁

Unfortunately std::optional is a feature of C++17, and again, so far we only require C++14. Repeating myself here, but I'd be more than fine with bumping the requirements up to C++17, but if I remember correctly this would have the probably undesired consequence that VLC would now also require C++17 instead of C++14. Or do I misremember?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, for now VLC only requires a C++14 compiler to build. That means people using just that will not be able to compile libmatroska within VLC, and have Matroska support disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change you propose is probably better. But until we officially make the switch to C++17 we can't use that. So this should not be blocking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We can always refactor to use std::optional later if we decided to require C++17.

@robUx4
Copy link
Contributor Author

robUx4 commented Jan 6, 2024

Like I wrote elsewhere, I'm not a big fan of the libraries constraining the elements that can be read or written depending on the version. The application should deal with it. Or there should be a switch to turn such a feature off & let the application deal with it. My legitimate use case is writing the EBML header with rather low version numbers & updating it later on when elements requiring newer versions to be written.

Good because that's also my opinion. In VLC we will read invalid elements no matter what.

The API allow to check which elements are valid and maybe skip them if they are invalid. It will probably something like ShouldWrite but for... reading. The host app will be in total control of what to read (mkvalidator could report when a illegal element is read, for example) but the default will be the same as now.

@mbunkus
Copy link
Contributor

mbunkus commented Jan 7, 2024

As I'm confused with the logic in the other PRs wrt. deprecated elements, I thought it would be good to have a function bool IsDeprecated() const and/or bool IsDeprecatedInVersion(num_version version) const directly in EbmlDocVersion. In that function you could add a comment describing how deprecation is expressed with the existing values. Other places would not need to know about the internals & can just call the function.

Additionally functions such as const IsValidInVersion(num_version version) const, num_version GetMinVersion() const, num_version GetMaxVersion() would be nice to have.

The goal is to hide all the internals of EbmlDocVersion from users of the class. Leave how values are stored/used up to the class, don't bother others with it.

And a last nitpick: I'm not the biggest fan of using unsigned char here. Member variables will be aligned to sensible boundaries (32bit?) for faster access on modern CPUs anyway. Just make that unsigned int.

Maybe also rename num_version to version_type to make it optically much clearer that this is a type. Or even value_type; would also match how things are named in the standard library (e.g. std::vector<T>::value_type).

@robUx4
Copy link
Contributor Author

robUx4 commented Jan 7, 2024

Are you sure the member variables are usually aligned ? When I compile with AppleClang 15 this test passes in debug and release build:

  static_assert(sizeof(EbmlDocVersion) == 2, "nope");

I think since there's no bigger type, it keeps the smallest alignment which is 1 octet here.

It's all about the balance between the size in memory and the speed of the code. For 200 elements, that would add 1200 bytes in memory with code a little bit faster. Also libmatroska will subclass this class with 2 boolean which are also 1 octet each. That means the size would go to 32 octets instead of 4 octets if we force int in EbmlDocVersion, which means an extra 5600 octets used in memory.

IMO unless the code is much faster it's not worth the extra memory.

@robUx4
Copy link
Contributor Author

robUx4 commented Jan 7, 2024

One reason to use a bigger storage size would be to avoid the 256 doctype version limit. It's clearly not on the horizon for Matroska. But maybe other EBML formats might need it.

@mbunkus
Copy link
Contributor

mbunkus commented Jan 7, 2024

Are you sure the member variables are usually aligned ?

Absolutely. That's why you need things such as __attribute__((__packed__)) or the MSVC equivalent of #pragma pack(push, 1) if you want a struct whose memory layout exactly represents something, e.g. structures in a file format.

It's quite possible that sizeof(EbmlDocVersion) is 2, but when you use an EbmlDocVersion inside another class then the following member variables in that embedding class will be aligned.

Here's an example that shows both effects:

[0 mosu@sweet-chili ~/test] cat cpp1.cpp
#include <cstdint>
#include <iostream>

struct s1 {
  uint32_t a;
  unsigned char b;
  uint32_t c;
};

struct __attribute__((__packed__)) s2 {
  uint32_t a;
  unsigned char b;
  uint32_t c;
};

struct embedding {
  s1 a;
  s2 b;
  s1 c;
};

int
main() {
  std::cout << "s1: " << sizeof(s1) << " s2: " << sizeof(s2) << " embedding: " << sizeof(embedding) << "\n";
  return 0;
}
[0 mosu@sweet-chili ~/test] ./cpp1
s1: 12 s2: 9 embedding: 36

One interesting fact about why these values are aligned is that a lot of CPUs cannot even read 32bit values from non-aligned addresses (they might end the process with a SIGBUS), or only with huge performance penalties.

It's a thing, this alignment.

@robUx4
Copy link
Contributor Author

robUx4 commented Jan 13, 2024

I'm familiar with the alignment, but that's not exactly what is happening here. As said above the current size of EbmlDocVersion is 2. I did the following class:

class VersionExt : public EbmlDocVersion
{
    public:
        unsigned int val;
};

If the padding of the parent class was overriden, we would have a sizeof of 12. But the sizeof is 8. So the original class does keep its size of 2, the 4 of the unsigned int are added, and there's a padding of 2 to round to a power of 2.

Also this gives a sizeof 12:

class VersionExt : public EbmlDocVersion
{
    public:
        unsigned int val;
        unsigned char a;
};

When the same fields in a different order gives 8:

class VersionExt : public EbmlDocVersion
{
    public:
        unsigned char a;
        unsigned int val;
};

As we're changing the ABI that's something we should keep in mind as we may reorder elements to save on memory at no cost.

I'll switch to unsigned int, not for memory alignment/performance but simply because it allows more values.

This corresponds to the element attributes of the same name in the EBML spec.
It's possible to provide a sub-class with more features, like WebM/DivX support in matroska.
We could have a version with/without but that's way too much.
Maybe some macros can be turned into templates and then we can have default values.
@mbunkus mbunkus merged commit ab2e85e into Matroska-Org:master Jan 14, 2024
15 checks passed
@mbunkus
Copy link
Contributor

mbunkus commented Jan 14, 2024

LOL, Github thought I had merged this one when I pushed the libEBML branch… anyway, both are merged & pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
abi-break breaks the ABI (e.g. programs linked against the library have to be recompiled) api-break breaks the API (e.g. programs using it will have to adjust their source code)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants