Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison with HDF5(pytables) #346

Open
graykode opened this issue Jun 19, 2020 · 4 comments
Open

Comparison with HDF5(pytables) #346

graykode opened this issue Jun 19, 2020 · 4 comments

Comments

@graykode
Copy link

graykode commented Jun 19, 2020

I compared numpy array of shapes (200000, 784) with the use of a dense-array in tileDB and the 'create_array' of pytable.
However, the speed of the tileDB's I/O is significantly slower than that of the pytable.

tiledb

import tiledb
import tables as tb
import numpy as np
import datetime as dt

n = 200000

dom = tiledb.Domain(tiledb.Dim(name="a", domain=(0, n-1), tile=n, dtype=np.int32),
                    tiledb.Dim(name="b", domain=(0, 784-1), tile=784, dtype=np.int32))
print(dom)

# Add two attributes "a1" and "a2", so each (i,j) cell can store
# a character on "a1" and a vector of two floats on "a2".
schema = tiledb.ArraySchema(domain=dom, 
                            sparse=False, 
                            attrs=[
                                tiledb.Attr(name="a2", dtype=np.float32)
                            ])

# Create the (empty) array on disk.
array_name = "dense_array"
tiledb.DenseArray.create(array_name, schema)

%%time
ran = np.random.rand(n, 784)
with tiledb.DenseArray(array_name, mode='w') as A:
    A[:,:] = ran

CPU times: user 3.4 s, sys: 3.09 s, total: 6.49 s
Wall time: 11.7 s

pytable

!rm -f data/array.h5

import tables as tb
import numpy as np
from os.path import expanduser

h5 = tb.open_file("data/array.h5", 'w') 

n = 200000

ear = h5.create_array(h5.root, 'sub',
                       atom=tb.Float64Atom(),
                       shape=(n, 784))

%%time
ran = np.random.rand(n, 784)

ear[:,:,:,:] = ran

CPU times: user 1.88 s, sys: 1.15 s, total: 3.03 s
Wall time: 5.66 s

I'd like to know the cause.
I look forward to hearing some solutions. @ihnorton

@stavrospapadopoulos
Copy link
Member

stavrospapadopoulos commented Jun 20, 2020

Hi @graykode, thanks for posting this issue!

Quick note: In your example above, you may want to the following modifications:

  • In the TileDB schema definition, change tiledb.Attr(name="a2", dtype=np.float32) to tiledb.Attr(name="a2", dtype=np.float64) (i.e., use float64) for fairness to what you do with HDF5.
  • Move ran = np.random.rand(n, 784) outside the timing for both TileDB and HDF5

How TileDB performs writes

TileDB follows a general algorithm that applies to both dense and sparse arrays, with any arbitrary filter (e.g., compression, encryption, etc) and all layouts. It works as follows:

  1. Receive the data to be written directly from the numpy internal buffer
  2. Prepare the tiles, by potentially reshuffling the data in case the layout to be written on disk is different from the layout inside the numpy buffers. This involves 1 full copy.
  3. Filter each tile, by applying further chunking and running the filter pipeline. This involves another full copy.
  4. Write the filtered tiles to disk.

Everything internally is heavily parallelized with TBB, but still the algorithm needs to perform the 2 fully copies I mention above.

What happens in your example

Your example is a special scenario where you have a single tile and, therefore, the cells in the numpy buffer have the same layout as the one that will be written to disk. Moreover, you are not specifying any filter. Consequently, it is possible to avoid the 2 extra copies, which is what I assume that HDF5 does. Hence the difference in performance.

The solution

We need to optimize for special scenarios like this. It is fairly easy before executing the query to determine whether it is a special scenario where no cell shuffling and filtering is involved. In those cases we should resort to directly writing from the user buffers (e.g., numpy arrays) to disk, bypassing the tile preparation and filtering steps.

Thanks for putting this on our radar. We'll implement the optimizations and provide updates soon.

@ihnorton @joe-maley @Shelnutt2

@graykode
Copy link
Author

graykode commented Jun 20, 2020

  • np.float32

Thanks for your reply
First, I cannot create a schema dimension of float type. TileDBError: [TileDB::ArraySchema] Error: Cannot set domain; Dense arrays do not support dimension datatype 'FLOAT32 or 64'
Second, Removing ran = np.random.rand(n, 784) out of timings does not make a significant difference in time with existing results.

I read your solution and understand that it simply means to save it in the form of a buffer, just np.save(). Do you have an example code for this scenario using tiledb?

Thanks @stavrospapadopoulos

@stavrospapadopoulos
Copy link
Member

The float32 -> float64 is for the attribute a2, not the dimensions. You currently set the type to np.float32 (32-bit float), whereas in PyTables you use Float64Atom (64-bit float). In other words, you write less data in TileDB than HDF5. Here is all you need to do:

schema = tiledb.ArraySchema(domain=dom, 
                            sparse=False, 
                            attrs=[
                                tiledb.Attr(name="a2", dtype=np.float64) # <- this is the change here
                            ])

Regarding the solution, you do not need to do anything :-). We need to implement the solution I suggested inside TileDB core and once we merge a patch, you will just get the performance boost without changing your code.

I hope this helps.

@graykode
Copy link
Author

graykode commented Jun 20, 2020

Regarding the solution, you do not need to do anything :-). We need to implement the solution I suggested inside TileDB core and once we merge a patch, you will just get the performance boost without changing your code.

I understand what you're saying. Then I will wait for your patch. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants