Make phrase replacements accept a wider range of whitespace characters #203

doeblerh · 2021-03-16T11:46:53Z

Currently util.replace_phrases accepts only ' ' (normal space) and '\t' as whitespace characters.
When LaTeX input, however, contains space of different width, e.g., \,, which represents a narrow no-breaking space,
this gets translated to '\u202f', which is not matched by r'[ \t]*'.

I would suggest replacing

            s = r'(?:[ \t]*\n[ \t]*|[ \t]+)'

by

            s = r'(?:[^\S\n]*\n[^\S\n]*|[^\S\n]+)'

where [^\S\n] matches any white space codepoint but \n; admitting that this is kind of an awkward double negation.

My use-case:

I write "i.e." as i.\,e., which gets translated to 'i.\u202fe., tokenized by LT as

<token regexp="yes">i</token>
<token>.</token>
<token spacebefore="yes">e</token>
<token>.</token>

and in turn fires rule EG_SPACE. Since filing this as an upstream bug in LT, whose fix would require their tokenizer to distinguish different width spaces, seems hopeless to me, I simply wanted to get rid of the problem by adding a replacement phrase

i. e. & i.e.

which would fix my problem, if the space in the LHS would match arbitrary non-line-breaking spaces.

The text was updated successfully, but these errors were encountered:

doeblerh · 2021-03-16T12:00:22Z

I should probably add that I already tried to use a phrase replacement rule line

'i.\u202fe. & i.e.\n'

(where the replacemnts file contains no \ charcter but the u202f codepoint itself)
but this does not work either.

torik42 · 2021-03-18T21:20:02Z

I don’t think, this is a YaLafi problem. In German, the abbreviation d.\,h. is recognized as correct, while d.h. and d. h. are wrong. According to this document (page 6, link in German) there should be no thin space in English abbreviations.

If you still want to write i.\,e., you may disable the EG_SPACE rule or, even better, define your own rules using a local server.

torik42 · 2021-03-19T14:37:58Z

I had a quick look into this again. There is still the point that the replacement rule using the Unicode character does not work, here is why: utils.replace_phrases uses string.split() to split the replacement rule into parts. Thus, the input 'i.\u202fe. & i.e.' is divided into ['i.', 'e.', '&', 'i.e.']. Hence, it results in the same replacements as 'i. e. & i.e.' would. Assuming that nobody uses tabs in the replacement file, one could instead use string.split(' '). In the above example this would result in ['i.\u202fe.', '&', 'i.e.'] which should solve the problem.

@doeblerh: I think your suggestion has an unwanted side effect: The rule i. e. & i.e. would also replace the LaTeX code i.\, e. (there is an extra whitespace between \, and e) to i.e., although it produces a too large spacing.

torik42 added enhancement New feature or request type: filter Issues directly related to the filter labels Aug 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make phrase replacements accept a wider range of whitespace characters #203

Make phrase replacements accept a wider range of whitespace characters #203

doeblerh commented Mar 16, 2021

doeblerh commented Mar 16, 2021

torik42 commented Mar 18, 2021

torik42 commented Mar 19, 2021

Make phrase replacements accept a wider range of whitespace characters #203

Make phrase replacements accept a wider range of whitespace characters #203

Comments

doeblerh commented Mar 16, 2021

My use-case:

doeblerh commented Mar 16, 2021

torik42 commented Mar 18, 2021

torik42 commented Mar 19, 2021