[MSHARED-1453] Canonicalize properties files #77

elharo · 2024-11-26T16:23:16Z

No description provided.

gnodet · 2024-11-26T17:11:51Z

src/main/java/org/apache/maven/archiver/PomPropertiesUtil.java

-                pw.println(l);
+            for (String line : lines) {
+                writer.write(line);
+                writer.write( '\n' );


System.lineSeparator() ?

Moot. This code can be deleted now that we're using properties.store

gnodet · 2024-11-26T17:13:41Z

src/main/java/org/apache/maven/archiver/PomPropertiesUtil.java

@@ -71,8 +71,8 @@ private void createPropertiesFile(Properties properties, Path outputFile, boolea
            return;
        }

-        try (PrintWriter pw = new PrintWriter(outputFile.toFile(), StandardCharsets.ISO_8859_1.name());
-                StringWriter sw = new StringWriter()) {
+        try ( Writer writer = Files.newBufferedWriter(outputFile, StandardCharsets.ISO_8859_1);


The loadPropertiesFile method let the Properties class decide the charset used (so it's using ISO_8859_1, as it's the default). I think we should use an OutputStream here and pass it to properties.store() without the charset.

good idea. That makes this much simpler.

elharo

The old code sorted the properties and removed comments. Ive dropped that for now. Was it necessary for reproducible builds or something?

elharo · 2024-11-26T17:55:54Z

src/main/java/org/apache/maven/archiver/PomPropertiesUtil.java

@@ -71,8 +71,8 @@ private void createPropertiesFile(Properties properties, Path outputFile, boolea
            return;
        }

-        try (PrintWriter pw = new PrintWriter(outputFile.toFile(), StandardCharsets.ISO_8859_1.name());
-                StringWriter sw = new StringWriter()) {
+        try ( Writer writer = Files.newBufferedWriter(outputFile, StandardCharsets.ISO_8859_1);


good idea. That makes this much simpler.

elharo · 2024-11-26T17:56:10Z

src/main/java/org/apache/maven/archiver/PomPropertiesUtil.java

-                pw.println(l);
+            for (String line : lines) {
+                writer.write(line);
+                writer.write( '\n' );


Moot. This code can be deleted now that we're using properties.store

gnodet · 2024-11-26T18:13:06Z

The old code sorted the properties and removed comments. Ive dropped that for now. Was it necessary for reproducible builds or something?

Ah, could be. Having a stable output is really important imho.

elharo · 2024-11-26T18:14:31Z

OK. I'm going to add some tests to this too. The class is sorely lacking in them.

gnodet · 2024-11-26T19:59:39Z

src/main/java/org/apache/maven/shared/archiver/PomPropertiesUtil.java

-                pw.println(l);
-            }
+        try (OutputStream out = Files.newOutputStream(outputFile)) {
+            properties.store(out, null);


I don't think we can use properties.store here.
The removal of comments and the ordering is definitely important for reproducible builds.
We could refactor the code to use BufferedReader.lines(), filter out comments, sort, and print using the streams api.

gnodet · 2024-11-26T22:14:03Z

src/main/java/org/apache/maven/shared/archiver/PomPropertiesUtil.java

+            for (String key : sortedPropertyNames) {
+                out.write(key);
+                out.write(": ");
+                out.write(unsortedProperties.getProperty(key));


I think that's wrong. We need escaping for both key and value here. See https://github.com/openjdk/jdk/blob/8c2b4f62714f26ab3bc4808c734502af632a1eef/src/java.base/share/classes/java/util/Properties.java#L686-L738

I'll add this.

Now I see why the original code wrote out a properties file and read it back in. That avoided the need to reimplement the escaping.

However, that might not always work. It relies on the Properties format being consistent across VMs and Java versions and it doesn't have to be. We're accounting for comments, separator character, and ordering here, but there are other possible differences that can occur. It's risky to assume that Eclipse Temurin 21 is going to produce the same output as OpenJDK 8.

I don't think it's risky. That part is very clearly specified. Changing the way properties file are stored would be a big breakage.
See https://docs.oracle.com/javase/8/docs/api/java/util/Properties.html#store-java.io.Writer-java.lang.String- for an exact explanation of what is written.
The way key / separator / value are written is clearly specified and has not changed from JDK 8 to JDK 24.

Then every entry in this Properties table is written out, one per line. For each entry the key string is written, then an ASCII =, then the associated element string. For the key, all space characters are written with a preceding \ character. For the element, leading space characters, but not embedded or trailing space characters, are written with a preceding \ character. The key and element characters #, !, =, and : are written with a preceding backslash to ensure that they are properly loaded.

Feel free to re-implement this mechanism, but I'm not really sure it's worth it.

To guarantee the output, I think we are going to need to reimplement the escaping here. This could be tricky. What are the rules about copy-pasting OpenJDK code into an Apache project? We probably can't do that since it's GPL, not ASpache licensed.

And also:

[...] the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes as defined in section [3.3 (https://docs.oracle.com/javase/specs/jls/se24/html/jls-3.html#jls-3.3) of The Java Language Specification; only a single 'u' character is allowed in an escape sequence.

To guarantee the output, I think we are going to need to reimplement the escaping here. This could be tricky. What are the rules about copy-pasting OpenJDK code into an Apache project? We probably can't do that since it's GPL, not ASpache licensed.

The output is already guaranteed by the javadoc of the Properties.store() method. I really don't see what additional guarantees you're looking for.

gnodet · 2024-11-26T22:14:39Z

src/main/java/org/apache/maven/shared/archiver/PomPropertiesUtil.java

+        try (Writer out = Files.newBufferedWriter(outputFile, StandardCharsets.ISO_8859_1)) {
+            for (String key : sortedPropertyNames) {
+                out.write(key);
+                out.write(": ");


The separator used in properties.store() is = without spaces.

Colon is allowed. Per wikipedia "There are 3 delimiting characters: equal ('='), colon (':') and whitespace (' ', '\t' and '\f')." but I'll change it. Note the test case passes.

gnodet · 2024-11-26T23:03:43Z

I think the assumption driving this issue is wrong. All JDK output the exact same file for a given output, apart from: the comment date, and the order of properties.
The output is very clearly specified in the Properties javadoc, and much more detailed than just a way to produce an output which can later be loaded.

gnodet · 2024-11-26T23:10:11Z

The javadoc from JDK 1.4 is actually more concise, because it did only support ISO 8859-1 at that time:

Then every entry in this Properties table is written out, one per line. For each entry the key string is written, then an ASCII =, then the associated element string. Each character of the key and element strings is examined to see whether it should be rendered as an escape sequence. The ASCII characters , tab, form feed, newline, and carriage return are written as \, \t, \f \n, and \r, respectively. Characters less than \u0020 and characters greater than \u007E are written as \uxxxx for the appropriate hexadecimal value xxxx. For the key, all space characters are written with a preceding \ character. For the element, leading space characters, but not embedded or trailing space characters, are written with a preceding \ character. The key and element characters #, !, =, and : are written with a preceding backslash to ensure that they are properly loaded.

gnodet · 2024-11-26T23:28:48Z

I'd go for something like:

    private void createPropertiesFile(Properties properties, Path outputFile)
            throws IOException {
        Path outputDir = outputFile.getParent();
        if (outputDir != null && !Files.isDirectory(outputDir)) {
            Files.createDirectories(outputDir);
        }
        StringWriter sw = new StringWriter();
        properties.store(sw, null);
        String nl = System.lineSeparator();
        String output = Stream.of(sw.toString().split("\\R"))
                .filter(line -> !line.startsWith("#"))
                .sorted()
                .collect(Collectors.joining(nl, "", nl));
        try (Writer pw = new CachingWriter(outputFile, StandardCharsets.ISO_8859_1)) {
            pw.write(output);
        }
    }

gnodet · 2024-11-26T23:38:30Z

src/main/java/org/apache/maven/shared/archiver/PomPropertiesUtil.java

            }
        }
    }

+    private static String escape(String s) {
+        String escaped = StringEscapeUtils.escapeJava(s);


That's wrong. It's not general java escaping mechanism. The escaping is specific to the properties file. There are rules for spaces and separators =, :, and the rules are slightly different for the key and for the values.

Yes, I know. I'm working on the additional pieces. That's just the quickest way to handle the Unicode part.

Well, the quickest way is to not reimplement the whole thing.
Please look at https://github.com/apache/maven-archiver/pull/79/files

Unicode encoding is clearly not well supported in the current code, but the PR I pointed above fixes it. The reason is that calling Properties.store(Writer) bypasses the encoding, while calling Properties.store(OutputStream) correctly supports unicode.

michael-o · 2024-11-27T08:22:17Z

I do remember @hboutemy doing this already somewhere...

gnodet · 2024-11-27T09:51:32Z

src/test/java/org/apache/maven/shared/archiver/PomPropertiesUtilTest.java

+        // Now read the file directly to check for alphabetical order and encoding
+        List<String> contents = Files.readAllLines(pomPropertiesFile, StandardCharsets.ISO_8859_1);
+        assertEquals(4, contents.size());
+        assertEquals("a\\ key\\ with\\\twhitespace=value\\ with\\\twhitespace", contents.get(0));


the test is wrong afaik. The mechanism is different for keys and values

Then every entry in this Properties table is written out, one per line. For each entry the key string is written, then an ASCII =, then the associated element string. For the key, all space characters are written with a preceding \ character. For the element, leading space characters, but not embedded or trailing space characters, are written with a preceding \ character. The key and element characters #, !, =, and : are written with a preceding backslash to ensure that they are properly loaded.

gnodet · 2024-11-27T09:55:57Z

src/main/java/org/apache/maven/shared/archiver/PomPropertiesUtil.java

+            if (Character.isWhitespace(c) || c == '#' || c == '!' || c == '=' || c == ':') { // backslash escape
+                sb.append('\\');
+                sb.append(c);
+            } else if (c < 256) { // 8859-1


The check is wrong.

OK,. this one I don't see. What's wrong here?

The following test succeeds:

Properties p = new Properties(); p.put("foo", "aéüb"); ByteArrayOutputStream baos = new ByteArrayOutputStream(); p.store(baos, null); String s = baos.toString(); assertTrue(s.contains("foo=a\\u00E9\\u00FCb"));

So non plain ascii chars should be unicode encoded.

gnodet

I don't want to spend more time on re-implementing something from the JDK for no benefit really...

michael-o · 2024-11-27T11:09:29Z

I don't want to spend more time on re-implementing something from the JDK for no benefit really...

Me as well.

elharo added 2 commits November 26, 2024 11:15

inspections

e0b146f

linefeed

c14ba55

elharo requested a review from gnodet November 26, 2024 16:23

gnodet reviewed Nov 26, 2024

View reviewed changes

elharo added 2 commits November 26, 2024 12:51

Merge branch 'master' into clean

e669ca0

properties

5b93bbb

elharo commented Nov 26, 2024

View reviewed changes

elharo marked this pull request as draft November 26, 2024 18:14

wip

03a9f2c

elharo changed the title ~~A few nits picke dup by IntelliJ~~ A few nits picked up by IntelliJ Nov 26, 2024

gnodet reviewed Nov 26, 2024

View reviewed changes

elharo added 4 commits November 26, 2024 16:12

Add test

f66503b

check size

f925ac4

comment

953ac66

spotless

24111c5

gnodet reviewed Nov 26, 2024

View reviewed changes

elharo changed the title ~~A few nits picked up by IntelliJ~~ [MSHARED-1453] Canonicalize properties files Nov 26, 2024

revert ManifestConfiguration.java

6504b37

Unicode escape

2076c71

gnodet reviewed Nov 26, 2024

View reviewed changes

gnodet mentioned this pull request Nov 26, 2024

Remove unused parameters, use CachingWriter, simplify a bit #79

Merged

michael-o requested a review from hboutemy November 27, 2024 08:21

elharo added 2 commits November 27, 2024 04:42

escape whitespace

b4ed34c

remove dependency on commons-text

f04cb56

gnodet reviewed Nov 27, 2024

View reviewed changes

gnodet requested changes Nov 27, 2024

View reviewed changes

elharo added 2 commits November 27, 2024 06:36

escapeValue

6bed28b

hex encoding

8024b6f

elharo closed this Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MSHARED-1453] Canonicalize properties files #77

[MSHARED-1453] Canonicalize properties files #77

elharo commented Nov 26, 2024

gnodet Nov 26, 2024

elharo Nov 26, 2024

gnodet Nov 26, 2024

elharo Nov 26, 2024

elharo left a comment

elharo Nov 26, 2024

elharo Nov 26, 2024

gnodet commented Nov 26, 2024

elharo commented Nov 26, 2024

gnodet Nov 26, 2024 •

edited

Loading

gnodet Nov 26, 2024 •

edited

Loading

elharo Nov 26, 2024

elharo Nov 26, 2024

elharo Nov 26, 2024 •

edited

Loading

gnodet Nov 26, 2024

elharo Nov 26, 2024

gnodet Nov 26, 2024

gnodet Nov 26, 2024

gnodet Nov 26, 2024

elharo Nov 26, 2024

gnodet commented Nov 26, 2024 •

edited

Loading

gnodet commented Nov 26, 2024

gnodet commented Nov 26, 2024

gnodet Nov 26, 2024

elharo Nov 26, 2024

gnodet Nov 27, 2024

gnodet Nov 27, 2024

michael-o commented Nov 27, 2024

gnodet Nov 27, 2024

gnodet Nov 27, 2024 •

edited

Loading

elharo Nov 27, 2024

gnodet Nov 27, 2024 •

edited

Loading

gnodet left a comment

michael-o commented Nov 27, 2024

[MSHARED-1453] Canonicalize properties files #77

[MSHARED-1453] Canonicalize properties files #77

Conversation

elharo commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elharo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnodet commented Nov 26, 2024

elharo commented Nov 26, 2024

gnodet Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

gnodet Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elharo Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnodet commented Nov 26, 2024 • edited Loading

gnodet commented Nov 26, 2024

gnodet commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michael-o commented Nov 27, 2024

Choose a reason for hiding this comment

gnodet Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnodet Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

gnodet left a comment

Choose a reason for hiding this comment

michael-o commented Nov 27, 2024

gnodet Nov 26, 2024 •

edited

Loading

gnodet Nov 26, 2024 •

edited

Loading

elharo Nov 26, 2024 •

edited

Loading

gnodet commented Nov 26, 2024 •

edited

Loading

gnodet Nov 27, 2024 •

edited

Loading

gnodet Nov 27, 2024 •

edited

Loading