You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently util.replace_phrases accepts only ' ' (normal space) and '\t' as whitespace characters.
When LaTeX input, however, contains space of different width, e.g., \,, which represents a narrow no-breaking space,
this gets translated to '\u202f', which is not matched by r'[ \t]*'.
I would suggest replacing
s = r'(?:[ \t]*\n[ \t]*|[ \t]+)'
by
s = r'(?:[^\S\n]*\n[^\S\n]*|[^\S\n]+)'
where [^\S\n] matches any white space codepoint but \n; admitting that this is kind of an awkward double negation.
My use-case:
I write "i.e." as i.\,e., which gets translated to 'i.\u202fe., tokenized by LT as
and in turn fires rule EG_SPACE. Since filing this as an upstream bug in LT, whose fix would require their tokenizer to distinguish different width spaces, seems hopeless to me, I simply wanted to get rid of the problem by adding a replacement phrase
i. e. & i.e.
which would fix my problem, if the space in the LHS would match arbitrary non-line-breaking spaces.
The text was updated successfully, but these errors were encountered:
I don’t think, this is a YaLafi problem. In German, the abbreviation d.\,h. is recognized as correct, while d.h. and d. h. are wrong. According to this document (page 6, link in German) there should be no thin space in English abbreviations.
If you still want to write i.\,e., you may disable the EG_SPACE rule or, even better, define your own rules using a local server.
I had a quick look into this again. There is still the point that the replacement rule using the Unicode character does not work, here is why: utils.replace_phrases uses string.split() to split the replacement rule into parts. Thus, the input 'i.\u202fe. & i.e.' is divided into ['i.', 'e.', '&', 'i.e.']. Hence, it results in the same replacements as 'i. e. & i.e.' would. Assuming that nobody uses tabs in the replacement file, one could instead use string.split(' '). In the above example this would result in ['i.\u202fe.', '&', 'i.e.'] which should solve the problem.
@doeblerh: I think your suggestion has an unwanted side effect: The rule i. e. & i.e. would also replace the LaTeX code i.\, e. (there is an extra whitespace between \, and e) to i.e., although it produces a too large spacing.
Currently
util.replace_phrases
accepts only' '
(normal space) and'\t'
as whitespace characters.When LaTeX input, however, contains space of different width, e.g.,
\,
, which represents a narrow no-breaking space,this gets translated to
'\u202f'
, which is not matched byr'[ \t]*'
.I would suggest replacing
by
where
[^\S\n]
matches any white space codepoint but\n
; admitting that this is kind of an awkward double negation.My use-case:
I write "i.e." as
i.\,e.
, which gets translated to'i.\u202fe.
, tokenized by LT asand in turn fires rule
EG_SPACE
. Since filing this as an upstream bug in LT, whose fix would require their tokenizer to distinguish different width spaces, seems hopeless to me, I simply wanted to get rid of the problem by adding a replacement phrasewhich would fix my problem, if the space in the LHS would match arbitrary non-line-breaking spaces.
The text was updated successfully, but these errors were encountered: