HTML tag abbr seems to confuse the parser #97

HidenoriKobayashi · 2023-10-06T02:43:59Z

I'm redirected to here from google/comprehensive-rust#1284.

The source:
https://github.com/google/comprehensive-rust/blob/a38a33c8fba58678d0a9127d9644242bce41ed94/src/bare-metal/microcontrollers/probe-rs.md

The po file (bit outdated but updating it does not resolve the issue):
https://github.com/google/comprehensive-rust/blob/a38a33c8fba58678d0a9127d9644242bce41ed94/po/ja.po#L13420

Now, if I make this change (just for testing purpose)

diff --git a/po/ja.po b/po/ja.po
index b2baaf215743..2a9eb0c872d4 100644
--- a/po/ja.po
+++ b/po/ja.po
@@ -13418,7 +13418,7 @@ msgstr ""
 
 #: src/bare-metal/microcontrollers/probe-rs.md:10
 msgid "`cargo-embed` is a cargo subcommand to build and flash binaries, log "
-msgstr ""
+msgstr "`cargo-embed`"
 
 #: src/bare-metal/microcontrollers/probe-rs.md:11
 msgid "RTT"

I get a result like this:

The strange thing about this is that string cargo-embed got merged into the last item in the list.

The text was updated successfully, but these errors were encountered:

mgeisler · 2023-10-06T09:22:49Z

Thanks @HidenoriKobayashi!

I think there might be two problems here. First, we have this test, which demonstrates that we chop up the text when we encounter a HTML tag (including a HTML comment):

mdbook-i18n-helpers/i18n-helpers/src/lib.rs

Lines 745 to 754 in 941a188

    
           fn extract_messages_inline_html() { 
        
               // HTML tags are skipped, but text inside is extracted: 
        
               assert_extract_messages( 
        
                   "Hi <script>alert('there');</script>", 
        
                   vec![ 
        
                       (1, "Hi "), // 
        
                       (1, "alert('there');"), 
        
                   ], 
        
               ); 
        
           }

There is also this test, which demonstrates that we lose text immediately following a (block-level?) HTML tag:

mdbook-i18n-helpers/i18n-helpers/src/lib.rs

Lines 936 to 950 in 941a188

    
               fn extract_messages_details() { 
        
                   // This isn't great: we lose text following a HTML tag: 
        
                   assert_extract_messages( 
        
                       "Preamble\n\ 
        
                        <details>\n\ 
        
                        Some Details\n\ 
        
                        </details>\n\ 
        
                        \n\ 
        
                        Postamble", 
        
                       vec![ 
        
                           (1, "Preamble"), // 
        
                           // Missing "Some Details" 
        
                           (6, "Postamble"), 
        
                       ], 
        
                   );

I think it's an open question if we could or should extract the HTML into the POT file. One concern would be if we should just extract foo for something like <abbr title="foo">bar</abbr>? That would probably require us to know that the title attribute has translatable text (which is doable). We would still be chopping up the text — which I guess is nicer than asking everybody to deal with HTML in the translation files?

This is only about what gets extracted and put into the POT file. The other half of this is how we weave things together when we translate the Markdown later.

I just tried adding a little unit test for mdbook-gettext:

    #[test]
    fn test_translate_html() {
        let catalog = create_catalog(&[("foo ", "FOO "), ("bar", "BAR"), (" baz", " BAZ")]);
        assert_eq!(
            translate("foo <b>bar</b> baz", &catalog),
            "FOO <b>BAR</b> BAZ"
        );
    }

it fails with

---- tests::test_translate_html stdout ----
thread 'tests::test_translate_html' panicked at i18n-helpers/src/bin/mdbook-gettext.rs:239:9:
assertion failed: `(left == right)`

Diff < left / right > :
<FOO <b>BAR</b>BAZ
>FOO <b>BAR</b> BAZ

which shows that we lose a ' ' immediately following the </b>!

mgeisler · 2023-10-06T09:44:44Z

Further investigation shows that extract_events(" BAZ", state) give a Text(Borrowed("BAZ")) back! It should really give " BAZ" back so this seems to be where the space is lost.

mgeisler · 2023-10-06T09:52:28Z

This is actually fair: Markdown does not distinguish between a line of text starting with " foo" vs "foo". So when we extract the events, we lose the leading whitespace.

I'm not entirely sure where where we should account for this, but perhaps extract_events could add it back in? Or perhaps the caller should account for being in an inline context and then add it? Note that we have a pulldown_cmark_to_cmark::State object available and it knows that last_was_html... so that might be helpful here?

mgeisler · 2023-10-06T09:57:53Z

So I think these two tests should work:

diff --git a/i18n-helpers/src/bin/mdbook-gettext.rs b/i18n-helpers/src/bin/mdbook-gettext.rs
index b686be4..291bb2c 100644
--- a/i18n-helpers/src/bin/mdbook-gettext.rs
+++ b/i18n-helpers/src/bin/mdbook-gettext.rs
@@ -233,6 +233,15 @@ mod tests {
         );
     }
 
+    #[test]
+    fn test_translate_html() {
+        let catalog = create_catalog(&[("foo ", "FOO "), ("bar", "BAR"), (" baz", " BAZ")]);
+        assert_eq!(
+            translate("foo <b>bar</b> baz", &catalog),
+            "FOO <b>BAR</b> BAZ"
+        );
+    }
+
     #[test]
     fn test_translate_table() {
         let catalog = create_catalog(&[
diff --git a/i18n-helpers/src/lib.rs b/i18n-helpers/src/lib.rs
index 3bf2a5f..484d348 100644
--- a/i18n-helpers/src/lib.rs
+++ b/i18n-helpers/src/lib.rs
@@ -590,6 +590,30 @@ mod tests {
         assert_eq!(extract_events("", None), vec![]);
     }
 
+    #[test]
+    fn extract_events_leading_whitespace() {
+        assert_eq!(
+            extract_events("  foo", None),
+            vec![
+                (1, Start(Paragraph)),
+                (1, Text("  foo".into())),
+                (1, End(Paragraph)),
+            ]
+        );
+    }
+
+    #[test]
+    fn extract_events_trailing_whitespace() {
+        assert_eq!(
+            extract_events("foo  ", None),
+            vec![
+                (1, Start(Paragraph)),
+                (1, Text("foo  ".into())),
+                (1, End(Paragraph)),
+            ]
+        );
+    }
+
     #[test]
     fn extract_events_paragraph() {
         assert_eq!(

The mixing of HTML and Markdown breaks our translation pipeline: we see the HTML and fail to parse things correctly. This might be google/mdbook-i18n-helpers#97, but I'm not 100% sure. The fix is to make put the HTML on its own line: then the Markdown is parsed again inside.

The mixing of HTML and Markdown breaks our translation pipeline: we see the HTML and fail to parse things correctly. This might be google/mdbook-i18n-helpers#97, but I'm not 100% sure. The fix is to make put the HTML on its own line: then the Markdown is parsed again inside. Fixes #1527.

mgeisler · 2024-01-09T09:41:31Z

@kdarkhan, the problem here is about how we lose a ' ' when the Markdown AST is a mix of Text and Html nodes. I believe #125 is related: it's about how breaking apart the text on HTML elements causes problems. I think we could solve both problems if we would let Text and Html nodes be grouped together.

kdarkhan · 2024-05-22T17:41:40Z

I believe this should be resolved now with #195 merged.

mgeisler · 2024-05-27T13:35:20Z

I think so too, let's close it!

mgeisler added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed labels Oct 6, 2023

mgeisler mentioned this issue Nov 30, 2023

Fix Markdown in build-rules.md google/comprehensive-rust#1530

Merged

mgeisler mentioned this issue Jan 9, 2024

Wrong normalization of message with HTML #125

Closed

kdarkhan self-assigned this Jan 29, 2024

kdarkhan removed their assignment May 22, 2024

mgeisler closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML tag abbr seems to confuse the parser #97

HTML tag abbr seems to confuse the parser #97

HidenoriKobayashi commented Oct 6, 2023

mgeisler commented Oct 6, 2023 •

edited

Loading

mgeisler commented Oct 6, 2023

mgeisler commented Oct 6, 2023

mgeisler commented Oct 6, 2023

mgeisler commented Jan 9, 2024

kdarkhan commented May 22, 2024 •

edited

Loading

mgeisler commented May 27, 2024

HTML tag abbr seems to confuse the parser #97

HTML tag abbr seems to confuse the parser #97

Comments

HidenoriKobayashi commented Oct 6, 2023

mgeisler commented Oct 6, 2023 • edited Loading

mgeisler commented Oct 6, 2023

mgeisler commented Oct 6, 2023

mgeisler commented Oct 6, 2023

mgeisler commented Jan 9, 2024

kdarkhan commented May 22, 2024 • edited Loading

mgeisler commented May 27, 2024

mgeisler commented Oct 6, 2023 •

edited

Loading

kdarkhan commented May 22, 2024 •

edited

Loading