Skip to content

Commit

Permalink
Add docs on common unicode issues
Browse files Browse the repository at this point in the history
  • Loading branch information
gunthercox committed Jan 10, 2017
1 parent 3e0ac5c commit 81dcf90
Show file tree
Hide file tree
Showing 3 changed files with 96 additions and 1 deletion.
94 changes: 94 additions & 0 deletions docs/encoding.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
======================
Python String Encoding
======================

The Python developer community has published a great article that covers the
details of unicode character processing.

- Python 3: https://docs.python.org/3/howto/unicode.html
- Python 2: https://docs.python.org/2/howto/unicode.html

The following notes are intended to help answer some common questions and issues
that developers frequently encounter while learning to properly work with different
character encodings in Python.

Does ChatterBot handle non-ascii characters?
============================================

ChatterBot is able to handle unicode values correctly. You can pass it
non-encoded data and it should be able to process it properly
(you will need to make sure that you decode the output that is returned).

Bellow is one of ChatterBot's tests from `tests/test_chatbot.py`_,
this is just a simple check that a unicode response can be processed.

.. code-block:: python
def test_get_response_unicode(self):
"""
Test the case that a unicode string is passed in.
"""
response = self.chatbot.get_response(u'سلام')
self.assertGreater(len(response.text), 0)
This test passes in both Python 2.7 and 3.x. It also verifies that
ChatterBot *can* take unicode input without issue.

Fixing encoding errors
======================

When working with string type data in Python, it is possible to encounter errors
such as the following.

.. code-block:: text
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 48: invalid start byte
Depending on what your code looks like, there are a few things that you can do
to prevent errors like this.

Unicode header
--------------

.. code-block:: python
# -*- coding: utf-8 -*-
When to use
+++++++++++

If your strings use escaped unicode characters (they looks like :code:`u'\u00b0C'`) then
you do not need add the header. If your strings like :code:`'ØÆÅ'` then you are required
to use the header.

If you are using this header it must be the first line in your Python file.

Unicode escape characters
-------------------------

.. code-block:: text
>>> print u'\u0420\u043e\u0441\u0441\u0438\u044f'
Россия
When to use
+++++++++++

Prefix your strings with the unicode escape character :code:`u'...'` when you are
using excaped unicode characters.

Import unicode literals from future
-----------------------------------

.. code-block:: python
from __future__ import unicode_literals
When to use
+++++++++++

Use this when you need to make sure that Python 3 code also works in Python 2.

A good article on this can be found here: http://python-future.org/unicode_literals.html

.. _`tests/test_chatbot.py`: https://github.com/gunthercox/ChatterBot/blob/master/tests/test_chatbot.py
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ Contents:
utils
django/index
testing
encoding

Report an Issue
===============
Expand Down
2 changes: 1 addition & 1 deletion docs/logic/create-a-logic-adapter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Example logic adapter
class MyLogicAdapter(LogicAdapter):
def __init__(self, **kwargs):
super(MyLogicAdapter, self).__init__(kwargs)
super(MyLogicAdapter, self).__init__(**kwargs)
def can_process(self, statement):
return True
Expand Down

0 comments on commit 81dcf90

Please sign in to comment.