Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remark-grid-table: not breakable spaces break the parsing #369

Closed
artragis opened this issue Jan 8, 2020 · 16 comments
Closed

remark-grid-table: not breakable spaces break the parsing #369

artragis opened this issue Jan 8, 2020 · 16 comments

Comments

@artragis
Copy link
Member

artragis commented Jan 8, 2020

given

+-----------+--------------------------------------------+
| a         | b                                          |
+-----------+--------------------------------------------+
|       c   |  ‌                              e           |
+-----------+--------------------------------------------+

I expect

root[1] (1:1-5:59, 0-295)
└─ element[2] (1:1-5:59, 0-295) [tagName="WrapperBlock"]
   ├─ gridTable[1] [data={"hName":"table"}]
   │  └─ tableHeader[1] [data={"hName":"tbody"}]
   │     └─ tableRow[2] [data={"hName":"tr"}]
   │        ├─ tableCell[1] [data={"hName":"td","hProperties":{"colspan":1,"rowspan":1}}]
   │        │  └─ paragraph[1] (1:1-1:2, 0-1)
   │        │     └─ text: "a" (1:1-1:2, 0-1)
   │        └─ tableCell[1] [data={"hName":"td","hProperties":{"colspan":1,"rowspan":1}}]
   │           └─ paragraph[1] (1:1-1:2, 0-1)
   │              └─ text: "b" (1:1-1:2, 0-1)
   └─ paragraph[1] (1:1-2:59, 0-117)
      └─ text: "|       c   |  ‌                              e           |\n+-----------+--------------------------------------------+" (1:1-2:59, 0-117)

But I get

root[1] (1:1-5:59, 0-294)
└─ element[1] (1:1-5:59, 0-294) [tagName="WrapperBlock"]
   └─ gridTable[1] [data={"hName":"table"}]
      └─ tableHeader[2] [data={"hName":"tbody"}]
         ├─ tableRow[2] [data={"hName":"tr"}]
         │  ├─ tableCell[1] [data={"hName":"td","hProperties":{"colspan":1,"rowspan":1}}]
         │  │  └─ paragraph[1] (1:1-1:2, 0-1)
         │  │     └─ text: "a" (1:1-1:2, 0-1)
         │  └─ tableCell[1] [data={"hName":"td","hProperties":{"colspan":1,"rowspan":1}}]
         │     └─ paragraph[1] (1:1-1:2, 0-1)
         │        └─ text: "b" (1:1-1:2, 0-1)
         └─ tableRow[2] [data={"hName":"tr"}]
            ├─ tableCell[1] [data={"hName":"td","hProperties":{"colspan":1,"rowspan":1}}]
            │  └─ paragraph[1] (1:1-1:2, 0-1)
            │     └─ text: "c" (1:1-1:2, 0-1)
            └─ tableCell[1] [data={"hName":"td","hProperties":{"colspan":1,"rowspan":1}}]
               └─ paragraph[1] (1:1-1:2, 0-1)
                  └─ text: "e" (1:1-1:2, 0-1)

extra

there are some special chars on the second row :

>>> a="""|       c   |  ‌                              e           |"""
>>> a
'|       c   |  \xe2\x80\x8c                              e           |'
@artragis
Copy link
Member Author

artragis commented Jan 8, 2020

It appears this char is zero-width non joiner.

On zmarkdown server 6.1 it generates a "Incorrectly eaten value: please report this warning on http://git.io/vg5Ft" error.

@StaloneLab
Copy link
Member

Will be working on it next week if no one has solved it by then.

@StaloneLab
Copy link
Member

Okay, I took a look this morning, and the problem is complicated in fact... Here's what happens:

  • for a table to be correct, the grid-table tokenizer expect all it's lines to be visually equal in width;
  • the first line is 58 visual width, but for some unknown reasons, the fourth line is considered 59 because the zero width character is counted has being 1 in width by the string-width package;
  • therefore, the table is considered to end at third line: it is tokenized, and the remaining of the string is tokenized by the remark's paragraph tokenizer.

The be convinced of this fact, notice how the following works while not having visually equal lines (I removed a space from the fourth line:

+-----------+--------------------------------------------+
| a         | b                                          |
+-----------+--------------------------------------------+
|       c   |  ‌                              e          |
+-----------+--------------------------------------------+

To me, the issue is upfront on string-width wrongly detecting line width; someone already opened an issue there, but it seems the maintainer doesn't want to solve it because the length depends on the context and he's lacking test cases.

Maybe we could update the issue mentionned above to see if the maintainer of the package could do something about it; or we will have to write our own package...

@vhf
Copy link
Contributor

vhf commented Jan 18, 2020

@TitiAlone what about fixing it on our side by counting how many of these (except the only joiner one) there is in each cell (or line, I didn't check how grid-table is tokenized) and subtracting this number from the character count?

@StaloneLab
Copy link
Member

StaloneLab commented Jan 18, 2020

We could, I thought about it, but found it very hacky; IMO it is a bad solution because if string-width corrects the bug one day, we will have the opposite problem (counting less characters than there really are).

@vhf
Copy link
Contributor

vhf commented Jan 18, 2020

Makes sense, and I think there will never be a definitive fix to this issue.

The 👨‍👨‍👧‍👦 emoji is 7 unicode characters. With the current version of remark-grid-table, we consider it to have width 2. This is arbitrary, it could also be seen as having width 1, 4, 7 or 11. It depends on the editor (or browser) in which you are writing your markdown grid-table, the font you are using, and other factors potentially.

Same goes for ZWNJ, it could either be 0 or 1 width (some editors display "weird" unicode characters as symbols (here it could be a tiny u+200c.))

IMO the priority should be fixing the crash, the string containing the ZWNJ should be eaten properly.

@StaloneLab
Copy link
Member

Do we really need to check for lines width, by the way? We could simply disable the check and then have no issues anymore with the length of unicode characters, relying only on the count of |.

@vhf
Copy link
Contributor

vhf commented Jan 19, 2020

I'm not an expert on grid tables (I always avoided using them because it's too much work for me to write) but I'm a bit worried about cells spanning multiple columns.

Examples (paste it here: https://zestedesavoir.github.io/zmarkdown/ ):

+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
|   | a     |
+---+---+---+
|     b |   |
+-------+---+


+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
|   | a     |
+---+-------+
| b |       |
+---+-------+

What you are suggesting is worth trying IMO (if you have time), all I'm saying is that I cannot tell off the top of my head if it could still handle the above examples.

@StaloneLab
Copy link
Member

I think you're right, we can't distinguish between the two examples without visual alignment... I'll try disabling the global check to see what it gives, and fall back to compensating the offset if it doesn't work.
I also noticed Pandoc was supporting grid tables, so I might have a look at how they're handling this case.

@StaloneLab
Copy link
Member

I've got no failing tests when removing global length test, and it resolves the bug; question is about whether we want to allow cases like this one:

+---+---+---+
|      1 |       2 |         3 |
+---+---+---+

It renders ok, but we might consider it a bad practice. What do you think @artragis ?

@vhf
Copy link
Contributor

vhf commented Jan 20, 2020

I've got no failing tests when removing global length test, and it resolves the bug

Great! Could you please add a test for this case?

+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
|   | a     |
+---+---+---+
|     b |   |
+-------+---+


+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
|   | a     |
+---+-------+
| b |       |
+---+-------+

Because of the current cell-width constraints there was no reason to check this but we should make sure it works properly with your fix. (Many ways to test it, not sure what's the best one, checking that they don't return the same HTML might be enough.)

question is about whether we want to allow cases like this one:

+---+---+---+
|      1 |       2 |         3 |
+---+---+---+

It renders ok, but we might consider it a bad practice. What do you think @artragis ?

My vote goes to allowing it but I don't feel strongly about this. For some reason I doubt people would go out of their way to write something this ugly. 🙃

@artragis
Copy link
Member Author

I think no crash is the only thing that zds needs. This width varying chars are a pain to manage so if you have something stable and predictible I take.
As @vhf is zmd maintainer let's go ahead as he told.

@vhf vhf closed this as completed Jan 21, 2020
@vhf vhf added this to the zmarkdown@7.0.0 milestone Jan 21, 2020
@Drulac
Copy link

Drulac commented Jul 1, 2020

Maybe we could introduce an other char like # (or another) to make a difference between these cases :

+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
|   | a     |
+---+---+---+
|     b |   |
+-------+---+


+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
|   | a     |
+---+-------+
| b |       |
+---+-------+

the new char could tell wich case take more than one column space :

+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
|   | a #   |
+---+---+---+
|   # b |   |
+-------+---+


+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
|   | a #   |
+---+-------+
| b |   #   |
+---+-------+

the user could put it everywhere in the cell (start, in part of the content, end of cell) ?

+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
|   |# a    |
+---+---+---+
| b # b |   |
+---+-------+
| c |      #|
+---+-------+

If we want a 3 columns large cell, juste add an #

+---+---+---+---+
| 1 | 2 | 3 | 4 |
+---+---+---+---+
|   | a #   #   |
+---+---+---+---+
| b # b |   |   |
+---+-------+---+
| c |   #   |   |
+---+-------+---+

# is maybe not the best char for this job 🤷‍♀️

What do you think of this idea ?

@StaloneLab
Copy link
Member

@Drulac : the problem you're refering to is not clear to me; the two test cases you submitted do render differently by now, so I don't know why we would bother adding an extra character for something that already works (you can test it on the live demo).

@Drulac
Copy link

Drulac commented Jul 1, 2020

for example in this case the render is not what I've expected :S

+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| **gh**  | **a**     |
+---+-------+
| b |       |
+---+-------+

of course if I format the table "strictly" it will render as expected :

+---------+---+---+
| 1       | 2 | 3 |
+---------+---+---+
| **gh**  | **a** |
+---------+---+---+
| b       |       |
+---------+---+---+

but it's so longer to write and align perfectly at each modification of the cells

The use of an other char could really improve the speed of writing complex tables

It also could allow to put inline math formules inside complex tables ?

@StaloneLab
Copy link
Member

To me, your first usage is incorrect; we simply allowed it because it was easier for Unicode manipulation of varying-width characters; this is not a bug, and could hardly argue this is a great feature request...

Right now, I'm against having this kind of character that would complexify the parsing (already way too complex). Tables are expected to be written correctly, and your first example is behaviour undefined, maybe one day it'll render the other way around, no one knows 'cause no one is expected to write this kind of stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants