column description bug #386

reza1615 · 2021-01-04T17:49:26Z

In the column description, these metric have bugs

Rows w/ Hidden char: count Arabic charters as hidden char for example count تست as hidden which is not hidden
Rows w/ accent: count as
- Num
- Accent
- Hidden
Rows w/ number: doesn't count number form example if a row is foo123 it doesn't count it as num

Test: on http://alphatechadmin.pythonanywhere.com/dtale/main/4

The correct column description should be

aschonfeld · 2021-01-05T04:24:30Z

So the issue with numeric chars was just a front-end mapping issue. Accent charts seems to be fine using the data you gave me:

This is the regex I'm using to compute hidden characters:

from string import printable
df['var1'].str.count(r'[^{}]+'.format(printable)`).astype(bool).sum()

Looks like it is saying that تست, fooÀ & ZWNJ contain hidden characters. Is there something wrong with string.printable?

reza1615 · 2021-01-05T13:44:34Z

printable just check ASCII table so it is perfect for English language and doesn't support accent characters and Unicode characters

for numeric chars please use [\d]+ it supports all languages numbers like ۱۲۳123
for hidden chars please use this regex

printable =r'\w \!\"#\$%&\'\(\)\*\+,\-\./:;<»«؛،ـ\=>\?@\[\\\]\^_\`\{\|\}~'
df['var1'].str.count(r'[^{}]+'.format(printable)`).astype(bool).sum()

reza1615 · 2021-01-05T21:26:40Z

Also please update the Cleaning function (remove hidden characters)

reza1615 · 2021-01-05T21:27:19Z

And remove numbers cleaning function

aschonfeld · 2021-01-16T15:23:23Z

fixed in v1.31.0

aschonfeld added a commit that referenced this issue Jan 6, 2021

#386: bugfixes with "Rows w/ numeric" & "Rows w/ hidden"

4fa90c0

aschonfeld added a commit that referenced this issue Jan 16, 2021

#386: bugfixes with "Rows w/ numeric" & "Rows w/ hidden"

68f7648

aschonfeld closed this as completed Jan 16, 2021

Provide feedback