Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating an empty .po file causes non-ASCII characters to be silently discarded from msgids #442

Closed
stevecotton opened this issue Sep 28, 2023 · 6 comments

Comments

@stevecotton
Copy link

The quickstart guide in the po4a(1) manpage says "Simply create an empty file with the .pot extension in the specified po_directory (e.g. man/po4a/foo.pot), and po4a will fill it with the expected content."

I assumed that .po files could be created in the same way, by creating an empty one and running po4a po4a.cfg. Doing that fills the .po file, but silently strips non-ASCII characters out of the msgids as it does so. This seems to be a deliberate feature of gettext's msgmerge - if it's given an empty .po file and a UTF-8 .pot file, it assumes that the .po file should be ASCII, and strips letters with umlauts out of the msgids. Running it directly gives warnings about that, but they aren't shown when running it via po4a.

I have a German source file, and enabled UTF-8 in the .cfg file:

[po_directory] po
[type: text] 02_Beispiel.md en:en/02_Example.md
[options] --master-charset UTF-8 --localized-charset UTF-8

I'm not submitting a patch, as I'm not sure which way you'd prefer to handle it, but suggest either checking for empty files or adding "Don't create empty .po files, as these may cause the wrong charset to be used. Instead use the translators' tools to create a .po from the .pot." to the quickstart.

Debian bug #1022216 seems related, but is using po4a-updatepo.

I'm using Debian Bookworm with gettext version 0.21-12, and have checked that the bug is still reproducible with po4a c9f5cf9.

@ciampix
Copy link
Collaborator

ciampix commented Dec 31, 2023

Hi Steve,
why not simply update the po4a documentation with adding the recommendation of adding those two options like you did?
--master-charset UTF-8
--localized-charset UTF-8

@stevecotton
Copy link
Author

I was going to refer to msgmerge's docs about how they want to preserve whichever charset the translator chooses, and that your suggestion would force UTF-8 instead. However, I've just found an interesting behavior of msgmerge 0.21 (as shipped in Debian stable), which looks like it'll need some additional bugs filed; I'll get to that, but not tonight.

When running msgmerge -U temp.po somefile.pot:

  • non-existent .po file: msgmerge refuses to create the file
  • empty .po file, or no header: msgmerge does not add a header, assumes ASCII, and mangles UTF-8
  • .po file header says ASCII: msgmerge changes the header to say UTF-8, and writes UTF-8
  • .po file header says GB2312: msgmerge changes the header to say UTF-8, and writes UTF-8

@mquinson
Copy link
Owner

mquinson commented Jan 2, 2024

I have yet another situation with msgmerge. I'm trying to update a UTF-8 PO file against a iso-8859 POT file, and it mangles the non-ascii chars:

$ ls
iso8859.pot  iso8859.up.po
$ file *
iso8859.pot:   GNU gettext message catalogue, ISO-8859 text
iso8859.up.po: GNU gettext message catalogue, Unicode text, UTF-8 text
$ grep charset= *
iso8859.pot:"Content-Type: text/plain; charset=ISO-8859-1\n"
iso8859.up.po:"Content-Type: text/plain; charset=UTF-8\n"
$ iconv -f UTF-8 -t Latin1 iso8859.up.po -o /dev/null
$ iconv -f iso-8859-1 -t UTF-8 iso8859.pot -o /dev/null

The iconv commands do not output any error, proving that the file encoding matches the declared charset in the header (and the file guess). Let's now try to msgmerge the PO file.

$ msgmerge iso8859.up.po iso8859.pot
# Language up translations for po package
# Copyright (C) 2020 Free Software Foundation, Inc.
# This file is distributed under the same license as the po package.
# Automatically generated, 2020.
#
msgid ""
msgstr ""
"Project-Id-Version: po 4a\n"
"POT-Creation-Date: 2024-01-02 02:22+0100\n"
"PO-Revision-Date: 2020-04-09 17:33+0200\n"
"Last-Translator: Automatically generated\n"
"Language-Team: none\n"
"Language: up\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

#. type: =head1
#: iso8859.pod:1
#, fuzzy
iso8859.up.po:21: invalid multibyte sequence
iso8859.up.po:21: invalid multibyte sequence
msgid "Ttulo de prueba"
msgstr "TÍTULO DE PRUEBA"

#. type: textblock
#: iso8859.pod:3
#, fuzzy
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
msgid "blbleble llalala"
msgstr "BLÈBLEBLE LÁLALALA"

All non-ascii chars of the msgids get mangled for some reason (the 'invalid multibyte sequence' lines are part of the msgmerge stderr, not of the actual file content). I'm puzzled. I'm using msgmerge 0.21 from Debian testing.

Any help would be really welcome here.

@mquinson
Copy link
Owner

mquinson commented Jan 2, 2024

The files I used in this test:
iso8859.up.po.txt
iso8859.pot.txt

@mquinson
Copy link
Owner

mquinson commented Jan 2, 2024

@mquinson
Copy link
Owner

mquinson commented Jan 2, 2024

I guess that we should force UTF-8 on PO and POT files to stay safe. Do you have a better idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants