msgfmt segfaults building git 2.19.1 #39

rofl0r · 2018-11-25T22:28:32Z

to reproduce:

msgfmt --check --statistics -o ru.mo po/ru.po

resulting in:

Program received signal SIGSEGV, Segmentation fault.
0x000000000040273c in process (in=0x605340, out=0x7ffff7ffb020)
    at src/msgfmt.c:471
471                     d.strlist[i].trans->off += d.off + d.len[pe_msgid] + d.len[pe_plural] + d.len[pe_ctxt];
(gdb) p i
$1 = 1
(gdb) p d.strlist[i]
$2 = {str = {len = 0, off = 53980}, trans = 0x0}
(gdb) p d.strlist[i-1]
$3 = {str = {len = 0, off = 53980}, trans = 0x619480}

the trans pointer of the second item is NULL. interestingly the 2nd item seems identical to the first apart from the trans pointer. i guess it's from the empty msgid "" string right at the start of the file.

@xhebox do you have time to look into this ?

The text was updated successfully, but these errors were encountered:

xhebox · 2018-11-25T23:48:50Z

@rofl0r I can take a try, but i will let you know if i can not complete it under this broken system. I just escape from the busy and hard time as a freshman. And i'm currently busy on fixing my broken toolchain and broken distribution.

xhebox · 2018-11-27T07:42:42Z

@rofl0r here's why:

msgid "ordered %u objects, expected %<PRIu32>"
msgstr ""

Here, a sysdep msgid, but an empty msgstr, resulting in a mismatched num of msgid/msgstr(msgid - msgstr == 1). And my dirty hack to drop empty msgstr translation makes the num equal improperly, so abort() is avoided, and segfault came naturally.

We really need a good logic in process_line_callback, so that we can drop an invalid translation anytime we want. One has came up to my mind: we could use a struct to record the whole ctxt/id/plural/str process when collect_size/copy_string, rather than playing with just a priv_type that could only tell me what previous statement is...

This actually reminds me that the poparser is also a bit tricky... These issues occurred over and over again on gettext-tiny. Not a surprise, we can not use hacks to meet the needs anymore. So i propose a new poparser, which:

parses po files into blocks. Each block corresponds to a msgid, a possible msgctxt/msgid_plural, and possibly multiple msgstr, and some recognizable comments(start with #:, #,) generated by xgettext. Blank lines and normal comments are ignored.
is not using action_tbl in feed_line() and not calling the cb_func everytime a newline is fed. action_tblis just a 2d array, it can just care about two states. But we have to care about at least four states(ctxt/id/plural/str). We need to care about the whole block, or we need a bigger state machine in another word.

We should just parse the new lines, check if the block is terminated, and leave the all work to the callback function once a new block is parsed.

rofl0r · 2018-11-27T14:06:30Z

msgid "ordered %u objects, expected %"
msgstr ""

what does gnu gettext do with this one ?

xhebox · 2018-11-27T15:50:22Z

Drop it. Empty msgstr is ignored by default. -C will trigger an error, since X/Open format strictly not allow the existence of emty msgstr. But to be honest, empty msgstr makes translation more convenient for translators.

rofl0r · 2018-11-27T21:35:24Z

ok. do you have time to work on your suggested poparser modifications ? i suspect it should not be too much work...

xhebox · 2018-11-27T22:56:55Z

I have. No more than 1k loc, i guess. One week should be OK. Since i have other things in the meantime.

rofl0r · 2018-11-28T14:49:31Z

1000 lines ? i hope not :)
joking aside, that's really good news, thanks!

as stated in #39, the old parser is not good enough to handle all the po files. Similiar issues occurred over and over again because our dirty hacks in the project. so, i propose and implement this new parser, which: 1. needs to parse a po file two times. the first time will acquire the maximum width of every entry. the second time will copy the well-prepared contents into struct po_message, and pass it to the callback function. 2. every struct po_message contains all the information of one translation: msgid, msgctxt, msgid_plural, and msgstrs. comments may be added later. the logic of code is quite simple, nothing special need to explain. the special points are: 1. the first time, new parser gives no infomation about what the string is like. neither will the new parser give the exact size(sysdeped), nor you can calculate the exact size on your own. only xxx_len, strlen, sy- sdep in po_message_t is available. xxx_len is the length of the corressponding entry, strlen is almost the same. 2. sysdep present how many cases the string could be expanded to. since you know the length of the original string and the original string is always longer than the converted one, you can get a safe buffer size to work at the second stage. 3. poparser_sysdep(), a function like unescape(), with a bit flag as the third argument. that is, three bits correspond to st_priu32, st_priu64, st_priumax. since there're only up to two cases for every kind of sysdep, you could count from 0 to msg->sysdep-1, and poparser_sysdep will iterate every possible case eventually.

xhebox · 2018-12-02T06:48:27Z

@rofl0r i've opened a new branch. actually the newpoparser managed to reduce some codes... msgfmt greatly reduced by 243 while poparser.c increased by 152.

let's go back to the main topic. i only completed the work of poparser and msgfmt this weekend. msgmerge, though, is quite simple to rewrite, i have to leave it for the next weekend(or earlier). you could also try to rewrite, since msgmerge is just doing copy. you could test the new msgfmt freely. bug is welcome.

besides, the new parser is completely capable of parsing po header and flags in comments. i only parse charset, nplurals, fuzzy as the old one, but we could now extend those with ease.

rofl0r · 2018-12-02T14:51:24Z

great job. i left a number of comments on your commits. i think you oversimplified the sysdep replacement code, and the version you provide is unfit for anything but the most simple cases. imo it should be possible to fit in the previous code, which i tested rigorously back when i wrote it. anyway, i'm waiting for your comments...

rofl0r · 2018-12-02T14:55:31Z

for the record, this was the commit adding the sysdep expansion: 438a47c

xhebox · 2018-12-03T03:22:36Z

cmake version string. i forgot to pull the newest commits before coding. and i picked the wrong code when rebasing.
indent problems. i've switched to spacevim, but somehow my vim defaults to softtabs(spaces). also, it causes some useless indents while i use '='.
sysdep. imo, it's good enough, though i just found a bug when i typed this comment. i dont know why your codes will be so complicated, maybe you have a 'nice' misunderstanding on the problem. let me give a full description on my solution.

first, our problem: replace %<xxx> with %x in short. it's straight forward to just copy when there's only one %<xxx>. what we need to consider is the case where there're multiple %<xxx>.

let case(x) be a function that accepts %<xxx> as the input and outputs a set of specifiers like {u, lu}. we will need to do a mulplication to collect every case: case(%<xxx1>) x case(%<xxx2>) x ... x case(%<xxxn>) = {cartesian product}.

but attention, we dont need to do mulplication for duplicated %<xxx>, e.g. no case(%<PRIu32>) x case(%<PRIu32>). because, how could there be a platform where PRIu32 could be u and lu at the same time?

second, my solution: i want my sysdep() be a function like unescape() and does not contain malloc().

according to the above analysis, i dont need to replace identical sysdep strings into different specifiers. so for every case, sysdep() just need to pick up the right specifier for the correspoding sysdep string. and just search for the sysdep string, copy the specifier and skip the sysdep string and so on.

and the trick is that i use the num of case to pick up the right specifier. there're only up to four cases: case(%<PRIu32>) x case(%<PRIu64>) x case(%<PRIuMAX>) = { {u, lu, ju}, {u,llu,ju}, {lu,lu,ju}, {lu,llu,ju} }. and %<PRIuMAX> wont change all the time. so 4 cases corresponds to a two-bit flag control %<PRIu32> and %<PRIu64>. but this fails, since i wont know what 2-case stand for 32 or 64...

third. how to fix bug: we could replace msg->sysdep with an array sysdep[3](which adds to sysdep). in this way, we could iterate like:

callback(in, out, n) {

n[3] = {0};
for (k=true;k;) {
  k = false;
  sysdep(in, out,  n);

  for (i=0; i<3; i++) {
     if (n[i] < (sysdep[i]-1)) {
        n[i]++;
        k= true;
        break;
     }
  }
}

}

~~i currently have no time to code. i could get enough time to solve problems after at least two days. i'd like to leave the work to you, or just wait for me.~~ never mind, i found i need to correct tiny to update my broken distribution, so i have pushed new commits.

detail has been descriped at #39 (comment). in short, we can not tell whether a `msg->sysdep = 2` stands for `case(%<PRIu32>)` or `case(%<PRIu64>)`. we need to record the appearances of every sysdep kind separately.

rofl0r · 2018-12-03T13:09:01Z

but attention, we dont need to do mulplication for duplicated %, e.g. no case(%) x case(%). because, how could there be a platform where PRIu32 could be u and lu at the same time?

our produced .mo files need to be portable across systems.
that means it's possible to create a .mo file on x86_64, but use it on i386.
therefore an input string containing e.g. PRIu64, will depending on the libc of the user, generate either "lu" or "llu", so our generated .mo file need to create one msgid string with "lu" and another one with "llu" to have all bases covered. if there's more than one sysdep string, this leads to the combinatorial explosion which my existing code covers correctly.

xhebox · 2018-12-05T04:51:31Z

@rofl0r the new branch, whatever, passed my test. i'd like to hear a reply from you.

rofl0r · 2018-12-05T16:22:48Z

great, thank you! i've asked @selkfoster to test your branch, maybe @awilfox could have a look too ? as you know sabotage itself does not use translations, so it's not really suited for testing (almost all packages are using --disable-nls).

selkfoster · 2018-12-05T17:33:20Z

I will test the new parser branch, this take too much time, because it implies to rebuild all the packages. I will be able to test probably this weekend, I will let you know guys ...

rofl0r · 2018-12-06T21:29:18Z

@xhebox

i started testing the sysdep expansion with the following input:

msgid ""
msgstr ""
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n > 1);\n"

msgid "foo %<PRIu64> bar %<PRIu32> %<PRIu64> %<PRIuMAX> quux"
msgstr "nyet %<PRIu64> bar %<PRIu32> %<PRIu64> %<PRIuMAX> quux"

which results in buggy output: the first %<PRIu64> is ignored, and contained in both the msgid and msgstr

xhebox · 2018-12-07T01:03:35Z

@rofl0r this seems obvious. i use strstr to search, which will match PRIu32 first if it's there... i will give a solution after class. use strchr first, and try to match one of three sysdep strings. thx for your efforts :).

EDIT: solved @rofl0r , and solved another bug at the same time

follow #39 (comment). it's obvious that, strstr will search for `%<PRIu32>` first, if there's one, then we get there and skip all other sysdep strings before the first `%<PRIu32>`. but what we want is, to search the first sysdep string. so, our new stragegy is to search for `%` instead. such that, we will always match the first sysdep string.

selkfoster · 2018-12-15T18:35:46Z

True. This build (git) now works fine against the newpoparser. :-)

as stated in #39, the old parser is not good enough to handle all the po files. Similiar issues occurred over and over again because our dirty hacks in the project. so, i propose and implement this new parser, which: 1. needs to parse a po file two times. the first time will acquire the maximum width of every entry. the second time will copy the well-prepared contents into struct po_message, and pass it to the callback function. 2. every struct po_message contains all the information of one translation: msgid, msgctxt, msgid_plural, and msgstrs. comments may be added later. the logic of code is quite simple, nothing special need to explain. the special points are: 1. the first time, new parser gives no infomation about what the string is like. neither will the new parser give the exact size(sysdeped), nor you can calculate the exact size on your own. only xxx_len, strlen, sy- sdep in po_message_t is available. xxx_len is the length of the corressponding entry, strlen is almost the same. 2. sysdep present how many cases the string could be expanded to. since you know the length of the original string and the original string is always longer than the converted one, you can get a safe buffer size to work at the second stage. 3. poparser_sysdep(), a function like unescape(), with a bit flag as the third argument. that is, three bits correspond to st_priu32, st_priu64, st_priumax. since there're only up to two cases for every kind of sysdep, you could count from 0 to msg->sysdep-1, and poparser_sysdep will iterate every possible case eventually.

follow #39 (comment). it's obvious that, strstr will search for `%<PRIu32>` first, if there's one, then we get there and skip all other sysdep strings before the first `%<PRIu32>`. but what we want is, to search the first sysdep string. so, our new stragegy is to search for `%` instead. such that, we will always match the first sysdep string.

as stated in #39, the old parser is not good enough to handle all the po files. Similiar issues occurred over and over again because our dirty hacks in the project. so, i propose and implement this new parser, which: 1. needs to parse a po file two times. the first time will acquire the maximum width of every entry. the second time will copy the well-prepared contents into struct po_message, and pass it to the callback function. 2. every struct po_message contains all the information of one translation: msgid, msgctxt, msgid_plural, and msgstrs. comments may be added later. the logic of code is quite simple, nothing special need to explain. the special points are: 1. the first time, new parser gives no infomation about what the string is like. neither will the new parser give the exact size(sysdeped), nor you can calculate the exact size on your own. only xxx_len, strlen, sy- sdep in po_message_t is available. xxx_len is the length of the corressponding entry, strlen is almost the same. 2. sysdep present how many cases the string could be expanded to. since you know the length of the original string and the original string is always longer than the converted one, you can get a safe buffer size to work at the second stage. 3. poparser_sysdep(), a function like unescape(), with a bit flag as the third argument. that is, three bits correspond to st_priu32, st_priu64, st_priumax. since there're only up to two cases for every kind of sysdep, you could count from 0 to msg->sysdep-1, and poparser_sysdep will iterate every possible case eventually.

follow #39 (comment). it's obvious that, strstr will search for `%<PRIu32>` first, if there's one, then we get there and skip all other sysdep strings before the first `%<PRIu32>`. but what we want is, to search the first sysdep string. so, our new stragegy is to search for `%` instead. such that, we will always match the first sysdep string.

xhebox · 2019-02-24T06:02:13Z

#40 (comment), closed.

xhebox mentioned this issue Feb 24, 2019

newpoparser: issue with vim-8.1.0327 #40

Closed

xhebox closed this as completed Feb 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

msgfmt segfaults building git 2.19.1 #39

msgfmt segfaults building git 2.19.1 #39

rofl0r commented Nov 25, 2018

xhebox commented Nov 25, 2018

xhebox commented Nov 27, 2018

rofl0r commented Nov 27, 2018

xhebox commented Nov 27, 2018

rofl0r commented Nov 27, 2018

xhebox commented Nov 27, 2018

rofl0r commented Nov 28, 2018

xhebox commented Dec 2, 2018

rofl0r commented Dec 2, 2018

rofl0r commented Dec 2, 2018

xhebox commented Dec 3, 2018 •

edited

Loading

rofl0r commented Dec 3, 2018 •

edited

Loading

xhebox commented Dec 5, 2018

rofl0r commented Dec 5, 2018

selkfoster commented Dec 5, 2018

rofl0r commented Dec 6, 2018 •

edited

Loading

xhebox commented Dec 7, 2018 •

edited

Loading

selkfoster commented Dec 15, 2018

xhebox commented Feb 24, 2019

msgfmt segfaults building git 2.19.1 #39

msgfmt segfaults building git 2.19.1 #39

Comments

rofl0r commented Nov 25, 2018

xhebox commented Nov 25, 2018

xhebox commented Nov 27, 2018

rofl0r commented Nov 27, 2018

xhebox commented Nov 27, 2018

rofl0r commented Nov 27, 2018

xhebox commented Nov 27, 2018

rofl0r commented Nov 28, 2018

xhebox commented Dec 2, 2018

rofl0r commented Dec 2, 2018

rofl0r commented Dec 2, 2018

xhebox commented Dec 3, 2018 • edited Loading

rofl0r commented Dec 3, 2018 • edited Loading

xhebox commented Dec 5, 2018

rofl0r commented Dec 5, 2018

selkfoster commented Dec 5, 2018

rofl0r commented Dec 6, 2018 • edited Loading

xhebox commented Dec 7, 2018 • edited Loading

selkfoster commented Dec 15, 2018

xhebox commented Feb 24, 2019

xhebox commented Dec 3, 2018 •

edited

Loading

rofl0r commented Dec 3, 2018 •

edited

Loading

rofl0r commented Dec 6, 2018 •

edited

Loading

xhebox commented Dec 7, 2018 •

edited

Loading