Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ascii tag creates empty slug, exception django.urls.exceptions.NoReverseMatch #3721

Closed
candlerb opened this issue Dec 1, 2019 · 18 comments
Labels
status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application

Comments

@candlerb
Copy link
Contributor

candlerb commented Dec 1, 2019

Environment

  • Python version: 3.5.2
  • NetBox version: 2.6.7

Steps to Reproduce

  1. edit a device
  2. add a tag consisting soley of non-ASCII characters, e.g. 台灣
  3. save
  4. go to Organization > Tags (/extras/tags/) in the GUI

Expected Behavior

Tag to be functional

Observed Behavior

I get an exception:

<class 'django.urls.exceptions.NoReverseMatch'>

Reverse for 'tag' with arguments '('',)' not found. 1 pattern(s) tried: ['extras\\/tags\\/(?P<slug>[-a-zA-Z0-9_]+)\\/$']

If I query the database, I find that a tag has been created with an empty slug (hence cannot match the regexp pattern which requires at least one character)

netbox=# select * from extras_tag;
 id | name | slug | color  | comments |  created   |         last_updated
----+------+------+--------+----------+------------+-------------------------------
  1 | temp | temp | 9e9e9e |          | 2019-10-12 | 2019-10-12 16:56:36.745793+00
  2 | 台灣 |      | 9e9e9e |          | 2019-11-18 | 2019-11-18 08:07:13.025115+00
(2 rows)

I also find I get this exception at the home page (/), but not at /dcim/devices/.

Cleanup

netbox=# delete from extras_taggeditem where tag_id in (select id from extras_tag where slug='');
DELETE 1
netbox=# delete from extras_tag where slug='';
DELETE 1

The exception persists for 15 minutes or until you invalidate the cacheops cache:

cd /opt/netbox/netbox
python3 manage.py nbshell
>>> import cacheops
>>> cacheops.invalidate_all()

Links

Reported previously as #2853, #3079, #3717 (insufficient detail to reproduce).

Reported on the google group here, here, here, here.

@DanSheps DanSheps added status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application labels Dec 2, 2019
@DanSheps
Copy link
Member

DanSheps commented Dec 2, 2019

This looks to be a simple missing dependancy of TagBase for "unidecode", as otherwise slugify() in TagBase just returns the "tag" without any cleanup.

Will ponder how to best deal with this.

@kobayashi
Copy link
Contributor

+1 just adding unidecode as a required package. That one is simplest.
An another way is override slugify described in taggit docs though, that will lead lots of work or make it complicated just for setting slug from tag name like some comments here

@candlerb
Copy link
Contributor Author

candlerb commented Dec 5, 2019

Firstly, I think there should be a validation constraint on the tags model and/or table to forbid empty slugs. That would at least prevent getting the database into a broken state.

Next, I note that this problem goes away if tags had to be created before using them (separate feature request #3703). In that case there would be an "Add Tag" page, with a field for the slug, and empty slugs could be rejected before saving them - just like "Add Device Role" currently.

image

Users would then be forced to create a non-empty slug which is meaningful to them.

(I've also checked what happens when you do a CSV-import of device roles: the "slug" column is mandatory. So there's no problem here either: Netbox does not auto-assign slugs from names)

Now, I have to say unidecode is quite impressive, although it's nearly 2MB of dependency:

>>> unidecode.unidecode("台灣")
'Tai Wan '

It could be exposed as an AJAX completion feature. However, once you've decided to create tags in a form, you'd probably be better off using a Javascript unidecode library in the browser instead.

@jeremystretch
Copy link
Member

After digging into this, I see that there are two separate issues:

  1. Tags can be created on-demand with a blank slug. This is not permitted by the model's field definition, but the manner in which tags are being created escapes this validation.
  2. Unicode characters are effectively being stripped out by Tag's slugify() function assuming the unidecode package is not installed.

As an immediate workaround, I can confirm that installing unidecode will enable valid slugification:

$ pip install unidecode
... 
Installing collected packages: unidecode
Successfully installed unidecode-1.1.1
$ ./manage.py nbshell
i### NetBox interactive shell (jstretch-workstation)
### Python 3.6.8 | Django 2.2.6 | NetBox 2.6.8-dev
### lsmodels() will show available models. Use help(<model>) for more info.
>>> t=Tag()
>>> t.slugify('台灣')
'tai-wan'

(This works because django-taggit attempts to import unidecode and utilizes a no-op wrapper function if that fails.)

I'll work on preventing Tags from being created with blank slugs. But as far as the slugification piece goes, I can't speak to whether the behavior provided by unidecode is valuable.

@candlerb
Copy link
Contributor Author

candlerb commented Dec 6, 2019

Clearly some slug is required, so a default of _1 would work (I believe there is existing logic which cycles _2, _3 etc to ensure uniqueness of slugs) Users could edit them afterwards to make them meaningful, if desired.

@DanSheps
Copy link
Member

DanSheps commented Dec 6, 2019

I think using the unidecode as a slugifier would be the better way to go, IMO, rather then going with a standard incremental index (which we already have on the pk and if we can semi-properly decode unicode, why not?)

I will do some more testing with it, but I think it will be a "good enough" effort on our part to close it out.

It will not handle Japanese, however that is just the nature of the language (uses the same characters as Chinese)

### Python 3.6.8 | Django 2.2.4 | NetBox 2.6.7
### lsmodels() will show available models. Use help(<model>) for more info.
>>>
>>> t=Tag()
>>> t.slugify("元気")
'yuan-qi'
>>> t.slugify("人")
'ren'
>>> t.slugify("一")
'yi'
>>> t.slugify("四")
'si'
>>> t.slugify("市")
'shi'
>>> t.slugify("左")
'zuo'
>>> t.slugify("右")
'you'
>>> t.slugify("川")
'chuan'

Again though, I think this is a good enough approach for something like this.

@candlerb
Copy link
Contributor Author

candlerb commented Dec 6, 2019

And to be clear, is #3073 likely to be rejected? (Since if #3073 is accepted, this requirement goes away)

@DanSheps
Copy link
Member

DanSheps commented Dec 7, 2019

Did you mean a different issue? That one is already closed.

@DanSheps
Copy link
Member

DanSheps commented Dec 7, 2019

I think you meant #3703

I can't speculate myself, however #3703 would not necessarily negate the need for proper Unicode handling in the Tags so this requirement would likely still be present.

@candlerb
Copy link
Contributor Author

candlerb commented Dec 7, 2019

Sorry, yes #3703. It would mean that tags would be handled in exactly the same way as device roles, platforms, device types, manufacturers etc: the user would be required to enter a non-empty slug (of their choice) at time of creation of the tag.

@jeremystretch
Copy link
Member

jeremystretch commented Dec 9, 2019

To be clear, #3703 is a feature request separate from the issue being addressed here. This needs be resolved regardless of the response to #3703.

It seems we have two options:

  1. Require the installation of unidecode as discussed above, which will return a phonetic (?) representation of the Unicode string.
  2. Allow Unicode characters in the tag slug. Django's slugify function has an option to permit this:
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')

An example:

>>> slugify('What does 台灣 mean?', allow_unicode=True)
'what-does-台灣-mean'

I don't have a strong opinion either way.

@candlerb
Copy link
Contributor Author

candlerb commented Dec 9, 2019

To be clear, #3703 is a feature request separate from the issue being addressed here

Yes...

This needs be resolved regardless of the response to #3703.

...I disagree, because if #3703 is implemented, then it solves this issue at the same time.

Consider for example Device Roles. You can already create a Device Role whose full name consists entirely of high unicode code points. But there is no need to apply unidecode on them to synthesise a slug; you simply cannot save a Device Role with an empty slug. This forces the user to assign a non-empty, ASCII-compatible slug themselves.

#3703 would make tags handled the same. Tags would be created explicitly before use; tags could not spring into existence any other way.

@jeremystretch
Copy link
Member

This forces the user to assign a non-empty, ASCII-compatible slug themselves.

Right, this is a problem. Slugs should be generated automatically regardless. (This is not limited to tags.)

@candlerb
Copy link
Contributor Author

candlerb commented Dec 9, 2019

Sure, but that's a separate enhancement which ought to have a new ticket. Here you go: #3741

Regarding unicode slugs: they need special encoding for use in URLs. It can be done - but since the main point of slugs is to be embedded in URLs, that's something users will need to be aware of.

@DanSheps
Copy link
Member

Another option would be to use urllib.quote() however the slug this would generate would not be user friendly.

@DanSheps
Copy link
Member

DanSheps commented Jan 7, 2020

>>> t=Tag()
>>> t.slugify("
>>> t.slugify("つす")
'tsusu'
>>> t.slugify("一つす")
'yi-tsusu'

It looks like hiragana works, but intermixing with kanji you get partial chinese and partial japanese. Any idea's @kobayashi, have you dealt with unicode in your work at all?

@DanSheps
Copy link
Member

DanSheps commented Jan 7, 2020

From PyPI:

On the other hand transliteration (i.e., conveying, in Roman letters, the pronunciation expressed by the text in some other writing system) of languages like Chinese, Japanese or Korean is a very complex issue and this library does not even attempt to address it. It draws the line at context-free character-by-character mapping. So a good rule of thumb is that the further the script you are transliterating is from Latin alphabet, the worse the transliteration will be.

So it looks like unidecode will only transliterate the standard Chinese meaning. I would like to hear from people with experience in other languages, and see if there is a reasonable approximation for most other languages.

@kobayashi
Copy link
Contributor

Slugify with allow_unicode=True is the better way If we want to keep the meaning of tags. unidecode handle only Chinese / partial Japanese? meaning.

Here is the sample of Japanese. Modern browsers can be resolve the escapaed URL characters.

>>> Tag.objects.get(pk=1)
<Tag: こんにちは>
>>> Tag.objects.get(pk=1).get_absolute_url()
'/extras/tags/%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF/'
>>>

If there is no objections, I would like to implement it.

@lock lock bot locked as resolved and limited conversation to collaborators Apr 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application
Projects
None yet
Development

No branches or pull requests

4 participants