Creating an empty .po file causes non-ASCII characters to be silently discarded from msgids #442

stevecotton · 2023-09-28T12:12:58Z

The quickstart guide in the po4a(1) manpage says "Simply create an empty file with the .pot extension in the specified po_directory (e.g. man/po4a/foo.pot), and po4a will fill it with the expected content."

I assumed that .po files could be created in the same way, by creating an empty one and running po4a po4a.cfg. Doing that fills the .po file, but silently strips non-ASCII characters out of the msgids as it does so. This seems to be a deliberate feature of gettext's msgmerge - if it's given an empty .po file and a UTF-8 .pot file, it assumes that the .po file should be ASCII, and strips letters with umlauts out of the msgids. Running it directly gives warnings about that, but they aren't shown when running it via po4a.

I have a German source file, and enabled UTF-8 in the .cfg file:

[po_directory] po
[type: text] 02_Beispiel.md en:en/02_Example.md
[options] --master-charset UTF-8 --localized-charset UTF-8

I'm not submitting a patch, as I'm not sure which way you'd prefer to handle it, but suggest either checking for empty files or adding "Don't create empty .po files, as these may cause the wrong charset to be used. Instead use the translators' tools to create a .po from the .pot." to the quickstart.

Debian bug #1022216 seems related, but is using po4a-updatepo.

I'm using Debian Bookworm with gettext version 0.21-12, and have checked that the bug is still reproducible with po4a c9f5cf9.

The text was updated successfully, but these errors were encountered:

ciampix · 2023-12-31T13:31:19Z

Hi Steve,
why not simply update the po4a documentation with adding the recommendation of adding those two options like you did?
--master-charset UTF-8
--localized-charset UTF-8

stevecotton · 2023-12-31T16:30:39Z

I was going to refer to msgmerge's docs about how they want to preserve whichever charset the translator chooses, and that your suggestion would force UTF-8 instead. However, I've just found an interesting behavior of msgmerge 0.21 (as shipped in Debian stable), which looks like it'll need some additional bugs filed; I'll get to that, but not tonight.

When running msgmerge -U temp.po somefile.pot:

non-existent .po file: msgmerge refuses to create the file
empty .po file, or no header: msgmerge does not add a header, assumes ASCII, and mangles UTF-8
.po file header says ASCII: msgmerge changes the header to say UTF-8, and writes UTF-8
.po file header says GB2312: msgmerge changes the header to say UTF-8, and writes UTF-8

mquinson · 2024-01-02T00:49:49Z

I have yet another situation with msgmerge. I'm trying to update a UTF-8 PO file against a iso-8859 POT file, and it mangles the non-ascii chars:

$ ls
iso8859.pot  iso8859.up.po
$ file *
iso8859.pot:   GNU gettext message catalogue, ISO-8859 text
iso8859.up.po: GNU gettext message catalogue, Unicode text, UTF-8 text
$ grep charset= *
iso8859.pot:"Content-Type: text/plain; charset=ISO-8859-1\n"
iso8859.up.po:"Content-Type: text/plain; charset=UTF-8\n"
$ iconv -f UTF-8 -t Latin1 iso8859.up.po -o /dev/null
$ iconv -f iso-8859-1 -t UTF-8 iso8859.pot -o /dev/null

The iconv commands do not output any error, proving that the file encoding matches the declared charset in the header (and the file guess). Let's now try to msgmerge the PO file.

$ msgmerge iso8859.up.po iso8859.pot
# Language up translations for po package
# Copyright (C) 2020 Free Software Foundation, Inc.
# This file is distributed under the same license as the po package.
# Automatically generated, 2020.
#
msgid ""
msgstr ""
"Project-Id-Version: po 4a\n"
"POT-Creation-Date: 2024-01-02 02:22+0100\n"
"PO-Revision-Date: 2020-04-09 17:33+0200\n"
"Last-Translator: Automatically generated\n"
"Language-Team: none\n"
"Language: up\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

#. type: =head1
#: iso8859.pod:1
#, fuzzy
iso8859.up.po:21: invalid multibyte sequence
iso8859.up.po:21: invalid multibyte sequence
msgid "Ttulo de prueba"
msgstr "TÍTULO DE PRUEBA"

#. type: textblock
#: iso8859.pod:3
#, fuzzy
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
msgid "blbleble llalala"
msgstr "BLÈBLEBLE LÁLALALA"

All non-ascii chars of the msgids get mangled for some reason (the 'invalid multibyte sequence' lines are part of the msgmerge stderr, not of the actual file content). I'm puzzled. I'm using msgmerge 0.21 from Debian testing.

Any help would be really welcome here.

mquinson · 2024-01-02T00:54:11Z

The files I used in this test:
iso8859.up.po.txt
iso8859.pot.txt

mquinson · 2024-01-02T01:08:13Z

Submitted as https://savannah.gnu.org/bugs/index.php?65104

mquinson · 2024-01-02T01:09:40Z

I guess that we should force UTF-8 on PO and POT files to stay safe. Do you have a better idea?

mquinson closed this as completed in 5078a53 Jan 5, 2024

mquinson mentioned this issue Jan 5, 2024

Temp pot file not created #438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating an empty .po file causes non-ASCII characters to be silently discarded from msgids #442

Creating an empty .po file causes non-ASCII characters to be silently discarded from msgids #442

stevecotton commented Sep 28, 2023

ciampix commented Dec 31, 2023

stevecotton commented Dec 31, 2023

mquinson commented Jan 2, 2024

mquinson commented Jan 2, 2024

mquinson commented Jan 2, 2024

mquinson commented Jan 2, 2024

Creating an empty .po file causes non-ASCII characters to be silently discarded from msgids #442

Creating an empty .po file causes non-ASCII characters to be silently discarded from msgids #442

Comments

stevecotton commented Sep 28, 2023

ciampix commented Dec 31, 2023

stevecotton commented Dec 31, 2023

mquinson commented Jan 2, 2024

mquinson commented Jan 2, 2024

mquinson commented Jan 2, 2024

mquinson commented Jan 2, 2024