Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPM: not all strings are UTF-8 #672

Open
armijnhemel opened this issue May 15, 2023 · 4 comments
Open

RPM: not all strings are UTF-8 #672

armijnhemel opened this issue May 15, 2023 · 4 comments
Labels

Comments

@armijnhemel
Copy link
Collaborator

In the current rpm.ksy the encoding for strings is set to UTF-8. There are RPM files that fail to parse, because as it turns out not everyone has been playing nice with encodings.

An example is this file from Fedora Core 3:

https://archives.fedoraproject.org/pub/archive/fedora/linux/core/3/x86_64/os/Fedora/RPMS/bash-3.0-17.x86_64.rpm

One of the tags is a record_type_string_array related to ChangeLogs and some people seem to have used Latin-1 characters instead.

Trond Eivind Glomsr\xf8d <teg@redhat.com> 2.0.5a-10

Currently record_type_string_array is defined as follows:

  record_type_string_array:
    params:
      - id: num_values
        type: u4
    seq:
      - id: values
        type: strz
        repeat: expr
        repeat-expr: num_values

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

@armijnhemel armijnhemel changed the title RPM: not all strings UTF-8 RPM: not all strings are UTF-8 May 15, 2023
@generalmimon
Copy link
Member

generalmimon commented May 15, 2023

@armijnhemel:

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

Well, if there isn't a single character encoding we could specify in the .ksy, the "next best thing" is to downgrade to a byte array:

   record_type_string_array:
     params:
       - id: num_values
         type: u4
     seq:
       - id: values
-        type: strz
+        terminator: 0
         repeat: expr
         repeat-expr: num_values

A byte array is the implicit type in .ksy specs when no type is given but the field size is delimited by size, size-eos: true or terminator.

@armijnhemel
Copy link
Collaborator Author

@armijnhemel:

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

Well, if there's no one clear character encoding we could specify in the .ksy, the "next best thing" is to downgrade to a byte array:

   record_type_string_array:
     params:
       - id: num_values
         type: u4
     seq:
       - id: values
-        type: strz
+        terminator: 0
         repeat: expr
         repeat-expr: num_values

A byte array is the implicit type in .ksy specs when no type is given but the field size is delimited by size, size-eos: true or terminator.

I actually had been thinking about that and looked at the docs, but that seems to indicate that terminator was only for strings. Using a byte array and then processing the strings in an external script would work for me.

@armijnhemel
Copy link
Collaborator Author

@armijnhemel:

and the default encoding is UTF-8, so this will obviously not work. I don't know how I could fix this.

Well, if there's no one clear character encoding we could specify in the .ksy, the "next best thing" is to downgrade to a byte array:

   record_type_string_array:
     params:
       - id: num_values
         type: u4
     seq:
       - id: values
-        type: strz
+        terminator: 0
         repeat: expr
         repeat-expr: num_values

A byte array is the implicit type in .ksy specs when no type is given but the field size is delimited by size, size-eos: true or terminator.

I actually had been thinking about that and looked at the docs, but that seems to indicate that terminator was only for strings. Using a byte array and then processing the strings in an external script would work for me.

Thinking a bit more about this: probably this isn't a good idea, as \x00 can be part of a valid UTF-8 string.

@armijnhemel
Copy link
Collaborator Author

I found it easier to just work around it like this:

  1. parse regularly (which will parse the vast majority of RPM files out there)
  2. reparse if it fails with a copy of the RPM specification with the above change (byte array instead of strz)
  3. decode all the strings to valid UTF-8 for some common encodings

This is cleaner than trying to fix it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants