Add docs on common unicode issues

gunthercox · Jan 10, 2017 · 81dcf90 · 81dcf90
1 parent 3e0ac5c
commit 81dcf90
Show file tree

Hide file tree

Showing 3 changed files with 96 additions and 1 deletion.
diff --git a/docs/encoding.rst b/docs/encoding.rst
@@ -0,0 +1,94 @@
+======================
+Python String Encoding
+======================
+
+The Python developer community has published a great article that covers the
+details of unicode character processing.
+
+- Python 3: https://docs.python.org/3/howto/unicode.html
+- Python 2: https://docs.python.org/2/howto/unicode.html
+
+The following notes are intended to help answer some common questions and issues
+that developers frequently encounter while learning to properly work with different 
+character encodings in Python.
+
+Does ChatterBot handle non-ascii characters?
+============================================
+
+ChatterBot is able to handle unicode values correctly. You can pass it
+non-encoded data and it should be able to process it properly
+(you will need to make sure that you decode the output that is returned).
+
+Bellow is one of ChatterBot's tests from `tests/test_chatbot.py`_,
+this is just a simple check that a unicode response can be processed.
+
+.. code-block:: python
+
+   def test_get_response_unicode(self):
+       """
+       Test the case that a unicode string is passed in.
+       """
+       response = self.chatbot.get_response(u'سلام')
+       self.assertGreater(len(response.text), 0)
+
+This test passes in both Python 2.7 and 3.x. It also verifies that
+ChatterBot *can* take unicode input without issue.
+
+Fixing encoding errors
+======================
+
+When working with string type data in Python, it is possible to encounter errors
+such as the following.
+
+.. code-block:: text
+
+   UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 48: invalid start byte
+
+Depending on what your code looks like, there are a few things that you can do
+to prevent errors like this.
+
+Unicode header
+--------------
+
+.. code-block:: python
+
+   # -*- coding: utf-8 -*-
+
+When to use
++++++++++++
+
+If your strings use escaped unicode characters (they looks like :code:`u'\u00b0C'`) then
+you do not need add the header. If your strings like :code:`'ØÆÅ'` then you are required
+to use the header.
+
+If you are using this header it must be the first line in your Python file.
+
+Unicode escape characters
+-------------------------
+
+.. code-block:: text
+
+   >>> print u'\u0420\u043e\u0441\u0441\u0438\u044f'
+   Россия
+
+When to use
++++++++++++
+
+Prefix your strings with the unicode escape character :code:`u'...'` when you are
+using excaped unicode characters.
+
+Import unicode literals from future
+-----------------------------------
+
+.. code-block:: python
+
+   from __future__ import unicode_literals
+
+When to use
++++++++++++
+
+Use this when you need to make sure that Python 3 code also works in Python 2.
+
+A good article on this can be found here: http://python-future.org/unicode_literals.html
+
+.. _`tests/test_chatbot.py`: https://github.com/gunthercox/ChatterBot/blob/master/tests/test_chatbot.py
diff --git a/docs/index.rst b/docs/index.rst
@@ -70,6 +70,7 @@ Contents:
    utils
    django/index
    testing
+   encoding
 
 Report an Issue
 ===============

diff --git a/docs/logic/create-a-logic-adapter.rst b/docs/logic/create-a-logic-adapter.rst
@@ -22,7 +22,7 @@ Example logic adapter
 
    class MyLogicAdapter(LogicAdapter):
        def __init__(self, **kwargs):
-           super(MyLogicAdapter, self).__init__(kwargs)
+           super(MyLogicAdapter, self).__init__(**kwargs)
 
        def can_process(self, statement):
            return True