-
Notifications
You must be signed in to change notification settings - Fork 10
Lexicons
Avocado defines an abstract class named Lexicon
. It is a common practice when normalizing a data model to break out repeated finite sets of terms within a column into their own table. This is quite obvious for entities such as books and authors, but less so for commonly used or enumerable terms.
id | name | birth_month
---+------+------------
1 Sue May
2 Joe Jun
3 Bo Jan
4 Jane Apr
...
The above shows a table with three columns id
, name
and birth_month
.
There are some inherent issues with birth_month
:
- Months have an arbitrary order which makes it very difficult to order the rows by
birth_month
since they are ordered lexicographically by default - As the table grows (think millions) the few bytes of disk space each repeated string takes up starts having a significant impact
- The cost of querying for the distinct months within the population gets increasingly more expensive as the table grows
- As the table grows, the cost of table scans increases since queries are acting on strings rather than an integer (e.g. a foreign key)
Although the above example is somewhat contrived, the reasons behind this type of normalization are apparent.
To implement, subclass and define the value
and label
fields.
from avocado.lexicon.models import Lexicon
class Month(Lexicon):
label = models.CharField(max_length=20)
value = models.CharField(max_length=20)
A few of the advantages include:
- Define an arbitrary
order
of the items in the lexicon - Define an integer
code
which is useful for downstream clients that prefer working with a enumerable set of values such as SAS or R - Define a verbose/more readable label for each item
- For example map Jan to January
In addition, Avocado treats Lexicon subclasses specially since it is such a common practice to use them. They are used in the following ways:
- Performing an
init
will create aDataField
instance for the primary key of the Lexicon - The
order
field will be used whenever appropriate for ordering the lexicon items - The
label
field will be used when accessingf.labels()
and for free-texting searches usingf.search()
- The
code
field will be used when accessingf.codes()
The Lexicon
class also comes with an extra method on it's manager called reorder
which reorders the items in the lexicon and updates the order
value of each item with the new sort index. This is generally only necessary if items are added to the set and the ordering needs to be updated. The method takes the same arguments as list.sort()
, but key
can also be a string corresponding to a built-in key function.
>> SomeLexicon.objects.reorder(key='coerce_float')
Performance Note: The entire lexicon is loaded into memory, sorted, and each item is saved. This should rarely every be an issue assuming your the lexicon is not millions of items in size.
-
coerce_float
- This relies on the
value
field for each object and attempts to coerce it to a float (in case numbers are represented as strings..) and falls back to itself if aTypeError
orValueError
is raised.
- This relies on the
Contents
- Introduction
- Installation & Setup
- Getting Started
- What Next?
Guides
- Managing your metadata
- Persisting sets of objects
- Writing a custom Interface
- Writing a custom Formatter
- Cookbook
APIs
Proposals
Reference
Developers