Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make phrase replacements accept a wider range of whitespace characters #203

Open
doeblerh opened this issue Mar 16, 2021 · 3 comments
Open
Labels
enhancement New feature or request type: filter Issues directly related to the filter

Comments

@doeblerh
Copy link

Currently util.replace_phrases accepts only ' ' (normal space) and '\t' as whitespace characters.
When LaTeX input, however, contains space of different width, e.g., \,, which represents a narrow no-breaking space,
this gets translated to '\u202f', which is not matched by r'[ \t]*'.

I would suggest replacing

            s = r'(?:[ \t]*\n[ \t]*|[ \t]+)'

by

            s = r'(?:[^\S\n]*\n[^\S\n]*|[^\S\n]+)'

where [^\S\n] matches any white space codepoint but \n; admitting that this is kind of an awkward double negation.

My use-case:

I write "i.e." as i.\,e., which gets translated to 'i.\u202fe., tokenized by LT as

<token regexp="yes">i</token>
<token>.</token>
<token spacebefore="yes">e</token>
<token>.</token>

and in turn fires rule EG_SPACE. Since filing this as an upstream bug in LT, whose fix would require their tokenizer to distinguish different width spaces, seems hopeless to me, I simply wanted to get rid of the problem by adding a replacement phrase

i. e. & i.e.

which would fix my problem, if the space in the LHS would match arbitrary non-line-breaking spaces.

@doeblerh
Copy link
Author

I should probably add that I already tried to use a phrase replacement rule line

'i.\u202fe. & i.e.\n'

(where the replacemnts file contains no \ charcter but the u202f codepoint itself)
but this does not work either.

@torik42
Copy link
Owner

torik42 commented Mar 18, 2021

I don’t think, this is a YaLafi problem. In German, the abbreviation d.\,h. is recognized as correct, while d.h. and d. h. are wrong. According to this document (page 6, link in German) there should be no thin space in English abbreviations.

If you still want to write i.\,e., you may disable the EG_SPACE rule or, even better, define your own rules using a local server.

@torik42
Copy link
Owner

torik42 commented Mar 19, 2021

I had a quick look into this again. There is still the point that the replacement rule using the Unicode character does not work, here is why: utils.replace_phrases uses string.split() to split the replacement rule into parts. Thus, the input 'i.\u202fe. & i.e.' is divided into ['i.', 'e.', '&', 'i.e.']. Hence, it results in the same replacements as 'i. e. & i.e.' would. Assuming that nobody uses tabs in the replacement file, one could instead use string.split(' '). In the above example this would result in ['i.\u202fe.', '&', 'i.e.'] which should solve the problem.

@doeblerh: I think your suggestion has an unwanted side effect: The rule i. e. & i.e. would also replace the LaTeX code i.\, e. (there is an extra whitespace between \, and e) to i.e., although it produces a too large spacing.

@torik42 torik42 added enhancement New feature or request type: filter Issues directly related to the filter labels Aug 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request type: filter Issues directly related to the filter
Projects
None yet
Development

No branches or pull requests

2 participants