Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msgfmt segfaults building git 2.19.1 #39

Closed
rofl0r opened this issue Nov 25, 2018 · 19 comments
Closed

msgfmt segfaults building git 2.19.1 #39

rofl0r opened this issue Nov 25, 2018 · 19 comments

Comments

@rofl0r
Copy link
Member

rofl0r commented Nov 25, 2018

to reproduce:

msgfmt --check --statistics -o ru.mo po/ru.po

resulting in:

Program received signal SIGSEGV, Segmentation fault.
0x000000000040273c in process (in=0x605340, out=0x7ffff7ffb020)
    at src/msgfmt.c:471
471                     d.strlist[i].trans->off += d.off + d.len[pe_msgid] + d.len[pe_plural] + d.len[pe_ctxt];
(gdb) p i
$1 = 1
(gdb) p d.strlist[i]
$2 = {str = {len = 0, off = 53980}, trans = 0x0}
(gdb) p d.strlist[i-1]
$3 = {str = {len = 0, off = 53980}, trans = 0x619480}

the trans pointer of the second item is NULL. interestingly the 2nd item seems identical to the first apart from the trans pointer. i guess it's from the empty msgid "" string right at the start of the file.

@xhebox do you have time to look into this ?

@xhebox
Copy link
Collaborator

xhebox commented Nov 25, 2018

@rofl0r I can take a try, but i will let you know if i can not complete it under this broken system. I just escape from the busy and hard time as a freshman. And i'm currently busy on fixing my broken toolchain and broken distribution.

@xhebox
Copy link
Collaborator

xhebox commented Nov 27, 2018

@rofl0r here's why:

msgid "ordered %u objects, expected %<PRIu32>"
msgstr ""

Here, a sysdep msgid, but an empty msgstr, resulting in a mismatched num of msgid/msgstr(msgid - msgstr == 1). And my dirty hack to drop empty msgstr translation makes the num equal improperly, so abort() is avoided, and segfault came naturally.

We really need a good logic in process_line_callback, so that we can drop an invalid translation anytime we want. One has came up to my mind: we could use a struct to record the whole ctxt/id/plural/str process when collect_size/copy_string, rather than playing with just a priv_type that could only tell me what previous statement is...

This actually reminds me that the poparser is also a bit tricky... These issues occurred over and over again on gettext-tiny. Not a surprise, we can not use hacks to meet the needs anymore. So i propose a new poparser, which:

  1. parses po files into blocks. Each block corresponds to a msgid, a possible msgctxt/msgid_plural, and possibly multiple msgstr, and some recognizable comments(start with #:, #,) generated by xgettext. Blank lines and normal comments are ignored.

  2. is not using action_tbl in feed_line() and not calling the cb_func everytime a newline is fed. action_tblis just a 2d array, it can just care about two states. But we have to care about at least four states(ctxt/id/plural/str). We need to care about the whole block, or we need a bigger state machine in another word.

We should just parse the new lines, check if the block is terminated, and leave the all work to the callback function once a new block is parsed.

@rofl0r
Copy link
Member Author

rofl0r commented Nov 27, 2018

msgid "ordered %u objects, expected %"
msgstr ""

what does gnu gettext do with this one ?

@xhebox
Copy link
Collaborator

xhebox commented Nov 27, 2018

Drop it. Empty msgstr is ignored by default. -C will trigger an error, since X/Open format strictly not allow the existence of emty msgstr. But to be honest, empty msgstr makes translation more convenient for translators.

@rofl0r
Copy link
Member Author

rofl0r commented Nov 27, 2018

ok. do you have time to work on your suggested poparser modifications ? i suspect it should not be too much work...

@xhebox
Copy link
Collaborator

xhebox commented Nov 27, 2018

I have. No more than 1k loc, i guess. One week should be OK. Since i have other things in the meantime.

@rofl0r
Copy link
Member Author

rofl0r commented Nov 28, 2018

1000 lines ? i hope not :)
joking aside, that's really good news, thanks!

xhebox added a commit that referenced this issue Dec 2, 2018
as stated in #39,
the old parser is not good enough to handle all the po files. Similiar
issues occurred over and over again because our dirty hacks in the
project.

so, i propose and implement this new parser, which:

1. needs to parse a po file two times. the first time will acquire
the maximum width of every entry. the second time will copy the
well-prepared contents into struct po_message, and pass it to the
callback function.

2. every struct po_message contains all the information of one
translation: msgid, msgctxt, msgid_plural, and msgstrs. comments may be
added later.

the logic of code is quite simple, nothing special need to explain. the
special points are:

1. the first time, new parser gives no infomation about what the string
is like. neither will the new parser give the exact size(sysdeped), nor
you can calculate the exact size on your own. only xxx_len, strlen, sy-
sdep in po_message_t is available. xxx_len is the length of the
corressponding entry, strlen is almost the same.

2. sysdep present how many cases the string could be expanded to. since
you know the length of the original string and the original string is
always longer than the converted one, you can get a safe buffer size to
work at the second stage.

3. poparser_sysdep(), a function like unescape(), with a bit flag
as the third argument. that is, three bits correspond to st_priu32,
st_priu64, st_priumax. since there're only up to two cases for every
kind of sysdep, you could count from 0 to msg->sysdep-1, and
poparser_sysdep will iterate every possible case eventually.
@xhebox
Copy link
Collaborator

xhebox commented Dec 2, 2018

@rofl0r i've opened a new branch. actually the newpoparser managed to reduce some codes... msgfmt greatly reduced by 243 while poparser.c increased by 152.

let's go back to the main topic. i only completed the work of poparser and msgfmt this weekend. msgmerge, though, is quite simple to rewrite, i have to leave it for the next weekend(or earlier). you could also try to rewrite, since msgmerge is just doing copy. you could test the new msgfmt freely. bug is welcome.

besides, the new parser is completely capable of parsing po header and flags in comments. i only parse charset, nplurals, fuzzy as the old one, but we could now extend those with ease.

@rofl0r
Copy link
Member Author

rofl0r commented Dec 2, 2018

great job. i left a number of comments on your commits. i think you oversimplified the sysdep replacement code, and the version you provide is unfit for anything but the most simple cases. imo it should be possible to fit in the previous code, which i tested rigorously back when i wrote it. anyway, i'm waiting for your comments...

@rofl0r
Copy link
Member Author

rofl0r commented Dec 2, 2018

for the record, this was the commit adding the sysdep expansion: 438a47c

@xhebox
Copy link
Collaborator

xhebox commented Dec 3, 2018

  1. cmake version string. i forgot to pull the newest commits before coding. and i picked the wrong code when rebasing.

  2. indent problems. i've switched to spacevim, but somehow my vim defaults to softtabs(spaces). also, it causes some useless indents while i use '='.

  3. sysdep. imo, it's good enough, though i just found a bug when i typed this comment. i dont know why your codes will be so complicated, maybe you have a 'nice' misunderstanding on the problem. let me give a full description on my solution.

first, our problem: replace %<xxx> with %x in short. it's straight forward to just copy when there's only one %<xxx>. what we need to consider is the case where there're multiple %<xxx>.

let case(x) be a function that accepts %<xxx> as the input and outputs a set of specifiers like {u, lu}. we will need to do a mulplication to collect every case: case(%<xxx1>) x case(%<xxx2>) x ... x case(%<xxxn>) = {cartesian product}.

but attention, we dont need to do mulplication for duplicated %<xxx>, e.g. no case(%<PRIu32>) x case(%<PRIu32>). because, how could there be a platform where PRIu32 could be u and lu at the same time?

second, my solution: i want my sysdep() be a function like unescape() and does not contain malloc().

according to the above analysis, i dont need to replace identical sysdep strings into different specifiers. so for every case, sysdep() just need to pick up the right specifier for the correspoding sysdep string. and just search for the sysdep string, copy the specifier and skip the sysdep string and so on.

and the trick is that i use the num of case to pick up the right specifier. there're only up to four cases: case(%<PRIu32>) x case(%<PRIu64>) x case(%<PRIuMAX>) = { {u, lu, ju}, {u,llu,ju}, {lu,lu,ju}, {lu,llu,ju} }. and %<PRIuMAX> wont change all the time. so 4 cases corresponds to a two-bit flag control %<PRIu32> and %<PRIu64>. but this fails, since i wont know what 2-case stand for 32 or 64...

third. how to fix bug: we could replace msg->sysdep with an array sysdep[3](which adds to sysdep). in this way, we could iterate like:

callback(in, out, n) {

n[3] = {0};
for (k=true;k;) {
  k = false;
  sysdep(in, out,  n);

  for (i=0; i<3; i++) {
     if (n[i] < (sysdep[i]-1)) {
        n[i]++;
        k= true;
        break;
     }
  }
}

}
  1. i currently have no time to code. i could get enough time to solve problems after at least two days. i'd like to leave the work to you, or just wait for me. never mind, i found i need to correct tiny to update my broken distribution, so i have pushed new commits.

xhebox added a commit that referenced this issue Dec 3, 2018
detail has been descriped at
#39 (comment).

in short, we can not tell whether a `msg->sysdep = 2` stands for
`case(%<PRIu32>)` or `case(%<PRIu64>)`. we need to record the
appearances of every sysdep kind separately.
@rofl0r
Copy link
Member Author

rofl0r commented Dec 3, 2018

but attention, we dont need to do mulplication for duplicated %, e.g. no case(%) x case(%). because, how could there be a platform where PRIu32 could be u and lu at the same time?

our produced .mo files need to be portable across systems.
that means it's possible to create a .mo file on x86_64, but use it on i386.
therefore an input string containing e.g. PRIu64, will depending on the libc of the user, generate either "lu" or "llu", so our generated .mo file need to create one msgid string with "lu" and another one with "llu" to have all bases covered. if there's more than one sysdep string, this leads to the combinatorial explosion which my existing code covers correctly.

@xhebox
Copy link
Collaborator

xhebox commented Dec 5, 2018

@rofl0r the new branch, whatever, passed my test. i'd like to hear a reply from you.

@rofl0r
Copy link
Member Author

rofl0r commented Dec 5, 2018

great, thank you! i've asked @selkfoster to test your branch, maybe @awilfox could have a look too ? as you know sabotage itself does not use translations, so it's not really suited for testing (almost all packages are using --disable-nls).

@selkfoster
Copy link

I will test the new parser branch, this take too much time, because it implies to rebuild all the packages. I will be able to test probably this weekend, I will let you know guys ...

@rofl0r
Copy link
Member Author

rofl0r commented Dec 6, 2018

@xhebox

i started testing the sysdep expansion with the following input:

msgid ""
msgstr ""
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n > 1);\n"

msgid "foo %<PRIu64> bar %<PRIu32> %<PRIu64> %<PRIuMAX> quux"
msgstr "nyet %<PRIu64> bar %<PRIu32> %<PRIu64> %<PRIuMAX> quux"

which results in buggy output: the first %<PRIu64> is ignored, and contained in both the msgid and msgstr

@xhebox
Copy link
Collaborator

xhebox commented Dec 7, 2018

@rofl0r this seems obvious. i use strstr to search, which will match PRIu32 first if it's there... i will give a solution after class. use strchr first, and try to match one of three sysdep strings. thx for your efforts :).

EDIT: solved @rofl0r , and solved another bug at the same time

xhebox added a commit that referenced this issue Dec 7, 2018
follow #39 (comment).

it's obvious that, strstr will search for `%<PRIu32>` first, if there's
one, then we get there and skip all other sysdep strings before the first
`%<PRIu32>`. but what we want is, to search the first sysdep string.

so, our new stragegy is to search for `%` instead. such that, we will
always match the first sysdep string.
@selkfoster
Copy link

True. This build (git) now works fine against the newpoparser. :-)

rofl0r pushed a commit that referenced this issue Jan 16, 2019
as stated in #39,
the old parser is not good enough to handle all the po files. Similiar
issues occurred over and over again because our dirty hacks in the
project.

so, i propose and implement this new parser, which:

1. needs to parse a po file two times. the first time will acquire
the maximum width of every entry. the second time will copy the
well-prepared contents into struct po_message, and pass it to the
callback function.

2. every struct po_message contains all the information of one
translation: msgid, msgctxt, msgid_plural, and msgstrs. comments may be
added later.

the logic of code is quite simple, nothing special need to explain. the
special points are:

1. the first time, new parser gives no infomation about what the string
is like. neither will the new parser give the exact size(sysdeped), nor
you can calculate the exact size on your own. only xxx_len, strlen, sy-
sdep in po_message_t is available. xxx_len is the length of the
corressponding entry, strlen is almost the same.

2. sysdep present how many cases the string could be expanded to. since
you know the length of the original string and the original string is
always longer than the converted one, you can get a safe buffer size to
work at the second stage.

3. poparser_sysdep(), a function like unescape(), with a bit flag
as the third argument. that is, three bits correspond to st_priu32,
st_priu64, st_priumax. since there're only up to two cases for every
kind of sysdep, you could count from 0 to msg->sysdep-1, and
poparser_sysdep will iterate every possible case eventually.
rofl0r pushed a commit that referenced this issue Jan 16, 2019
follow #39 (comment).

it's obvious that, strstr will search for `%<PRIu32>` first, if there's
one, then we get there and skip all other sysdep strings before the first
`%<PRIu32>`. but what we want is, to search the first sysdep string.

so, our new stragegy is to search for `%` instead. such that, we will
always match the first sysdep string.
rofl0r pushed a commit that referenced this issue Jan 16, 2019
as stated in #39,
the old parser is not good enough to handle all the po files. Similiar
issues occurred over and over again because our dirty hacks in the
project.

so, i propose and implement this new parser, which:

1. needs to parse a po file two times. the first time will acquire
the maximum width of every entry. the second time will copy the
well-prepared contents into struct po_message, and pass it to the
callback function.

2. every struct po_message contains all the information of one
translation: msgid, msgctxt, msgid_plural, and msgstrs. comments may be
added later.

the logic of code is quite simple, nothing special need to explain. the
special points are:

1. the first time, new parser gives no infomation about what the string
is like. neither will the new parser give the exact size(sysdeped), nor
you can calculate the exact size on your own. only xxx_len, strlen, sy-
sdep in po_message_t is available. xxx_len is the length of the
corressponding entry, strlen is almost the same.

2. sysdep present how many cases the string could be expanded to. since
you know the length of the original string and the original string is
always longer than the converted one, you can get a safe buffer size to
work at the second stage.

3. poparser_sysdep(), a function like unescape(), with a bit flag
as the third argument. that is, three bits correspond to st_priu32,
st_priu64, st_priumax. since there're only up to two cases for every
kind of sysdep, you could count from 0 to msg->sysdep-1, and
poparser_sysdep will iterate every possible case eventually.
rofl0r pushed a commit that referenced this issue Jan 16, 2019
follow #39 (comment).

it's obvious that, strstr will search for `%<PRIu32>` first, if there's
one, then we get there and skip all other sysdep strings before the first
`%<PRIu32>`. but what we want is, to search the first sysdep string.

so, our new stragegy is to search for `%` instead. such that, we will
always match the first sysdep string.
@xhebox
Copy link
Collaborator

xhebox commented Feb 24, 2019

#40 (comment), closed.

@xhebox xhebox closed this as completed Feb 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants