Skip to content

Commit 94bd4e7

Browse files
committed
changes to HDFStore:
1. added __str__ (to do __repr__) 2. added __delitem__ to support store deletion syntatic sugar 3. row removal in tables is much faster if rows are consecutive 4. added Term class, refactored Selection (this is backwards compatible) Term is a concise way of specifying conditions for queries, e.g. Term(dict(field = 'index', op = '>', value = '20121114')) Term('index', '20121114') Term('index', '>', '20121114') Term('index', ['20121114','20121114']) Term('index', datetime(2012,11,14)) Term('index>20121114') added alias to the Term class; you can specify the nomial indexers (e.g. index in DataFrame, major_axis/minor_axis or alias in Panel) this should close GH pandas-dev#1996 5. added Col class to manage the column conversions 6. added min_itemsize parameter and checks in pytables to allow setting of indexer columns minimum size 7. added indexing support via method create_table_index (requires 2.3 in PyTables) btw now works quite well as Int64 indicies are used as opposed to the Time64Col which has a bug); includes a check on the pytables version requirement this should close GH pandas-dev#698 8. signficantlly updated docs for pytables to reflect all changes; added docs for Table sections 9. BUG: a store would fail if appending but the a put had not been done before (see test_append) this the result of incompatibility testing on the index_kind 10. BUG: minor change to select and remove: require a table ONLY if where is also provided (and not None) all tests pass; tests added for new features
1 parent 81169f9 commit 94bd4e7

File tree

3 files changed

+625
-171
lines changed

3 files changed

+625
-171
lines changed

doc/source/io.rst

Lines changed: 123 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
12
.. _io:
23

34
.. currentmodule:: pandas
@@ -793,17 +794,75 @@ Objects can be written to the file just like adding key-value pairs to a dict:
793794
major_axis=date_range('1/1/2000', periods=5),
794795
minor_axis=['A', 'B', 'C', 'D'])
795796
797+
# store.put('s', s') is an equivalent method
796798
store['s'] = s
799+
797800
store['df'] = df
801+
798802
store['wp'] = wp
803+
804+
# the type of stored data
805+
store.handle.root.wp._v_attrs.pandas_type
806+
799807
store
800808
801809
In a current or later Python session, you can retrieve stored objects:
802810

803811
.. ipython:: python
804812
813+
# store.get('df') is an equivalent method
805814
store['df']
806815
816+
Deletion of the object specified by the key
817+
818+
.. ipython:: python
819+
820+
# store.remove('wp') is an equivalent method
821+
del store['wp']
822+
823+
store
824+
825+
.. ipython:: python
826+
:suppress:
827+
828+
store.close()
829+
import os
830+
os.remove('store.h5')
831+
832+
833+
These stores are **not** appendable once written (though you can simply remove them and rewrite). Nor are they **queryable**; they must be retrieved in their entirety.
834+
835+
836+
Storing in Table format
837+
~~~~~~~~~~~~~~~~~~~~~~~
838+
839+
``HDFStore`` supports another ``PyTables`` format on disk, the ``table`` format. Conceptually a ``table`` is shaped
840+
very much like a DataFrame, with rows and columns. A ``table`` may be appended to in the same or other sessions.
841+
In addition, delete & query type operations are supported. You can create an index with ``create_table_index``
842+
after data is already in the table (this may become automatic in the future or an option on appending/putting a ``table``).
843+
844+
.. ipython:: python
845+
:suppress:
846+
:okexcept:
847+
848+
os.remove('store.h5')
849+
850+
.. ipython:: python
851+
852+
store = HDFStore('store.h5')
853+
df1 = df[0:4]
854+
df2 = df[4:]
855+
store.append('df', df1)
856+
store.append('df', df2)
857+
858+
store.select('df')
859+
860+
# the type of stored data
861+
store.handle.root.df._v_attrs.pandas_type
862+
863+
store.create_table_index('df')
864+
store.handle.root.df.table
865+
807866
.. ipython:: python
808867
:suppress:
809868
@@ -812,8 +871,68 @@ In a current or later Python session, you can retrieve stored objects:
812871
os.remove('store.h5')
813872
814873
815-
.. Storing in Table format
816-
.. ~~~~~~~~~~~~~~~~~~~~~~~
874+
Querying objects stored in Table format
875+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
876+
877+
``select`` and ``delete`` operations have an optional criteria that can be specified to select/delete only
878+
a subset of the data. This allows one to have a very large on-disk table and retrieve only a portion of the data.
879+
880+
A query is specified using the ``Term`` class under the hood.
881+
882+
- 'index' refers to the index of a DataFrame
883+
- 'major_axis' and 'minor_axis' are supported indexers of the Panel
884+
885+
Valid terms can be created from ``dict, list, tuple, or string``. Objects can be embeded as values. Allowed operations are: ``<, <=, >, >=, =``. ``=`` will be inferred as an implicit set operation (e.g. if 2 or more values are provided). The following are all valid terms.
886+
887+
- ``dict(field = 'index', op = '>', value = '20121114')``
888+
- ``('index', '>', '20121114')``
889+
- ``'index>20121114'``
890+
- ``('index', '>', datetime(2012,11,14))``
891+
- ``('index', ['20121114','20121115'])``
892+
- ``('major', '=', Timestamp('2012/11/14'))``
893+
- ``('minor_axis', ['A','B'])``
894+
895+
Queries are built up using a list of ``Terms`` (currently only **anding** of terms is supported). An example query for a panel might be specified as follows.
896+
``['major_axis>20000102', ('minor_axis', '=', ['A','B']) ]``. This is roughly translated to: `major_axis must be greater than the date 20000102 and the minor_axis must be A or B`
897+
898+
.. ipython:: python
899+
900+
store = HDFStore('store.h5')
901+
store.append('wp',wp)
902+
store.select('wp',[ 'major_axis>20000102', ('minor_axis', '=', ['A','B']) ])
903+
904+
Delete from objects stored in Table format
905+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
906+
907+
.. ipython:: python
908+
909+
store.remove('wp', 'index>20000102' )
910+
store.select('wp')
911+
912+
.. ipython:: python
913+
:suppress:
914+
915+
store.close()
916+
import os
917+
os.remove('store.h5')
817918
818-
.. Querying objects stored in Table format
819-
.. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
919+
Notes & Caveats
920+
~~~~~~~~~~~~~~~
921+
922+
- Selection by items (the top level panel dimension) is not possible; you always get all of the items in the returned Panel
923+
- ``PyTables`` only supports fixed-width string columns in ``tables``. The sizes of a string based indexing column (e.g. *index* or *minor_axis*) are determined as the maximum size of the elements in that axis or by passing the ``min_itemsize`` on the first table creation. If subsequent appends introduce elements in the indexing axis that are larger than the supported indexer, an Exception will be raised (otherwise you could have a silent truncation of these indexers, leading to loss of information).
924+
- Mixed-Type Panels/DataFrames are not currently supported (but coming soon)!
925+
- Once a ``table`` is created its items (Panel) / columns (DataFrame) are fixed; only exactly the same columns can be appended
926+
- You can not append/select/delete to a non-table (table creation is determined on the first append, or by passing ``table=True`` in a put operation)
927+
928+
Performance
929+
~~~~~~~~~~~
930+
931+
- ``Tables`` come with a performance penalty as compared to regular stores. The benefit is the ability to append/delete and query (potentially very large amounts of data).
932+
Write times are generally longer as compared with regular stores. Query times can be quite fast, especially on an indexed axis.
933+
- To delete a lot of data, it is sometimes better to erase the table and rewrite it. ``PyTables`` tends to increase the file size with deletions
934+
- In general it is best to store Panels with the most frequently selected dimension in the minor axis and a time/date like dimension in the major axis, but this is not required. Panels can have any major_axis and minor_axis type that is a valid Panel indexer.
935+
- No dimensions are currently indexed automagically (in the ``PyTables`` sense); these require an explict call to ``create_table_index``
936+
- ``Tables`` offer better performance when compressed after writing them (as opposed to turning on compression at the very beginning)
937+
use the pytables utilities ``ptrepack`` to rewrite the file (and also can change compression methods)
938+
- Duplicate rows can be written, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)

0 commit comments

Comments
 (0)