Blaze support #10

s-celles · 2015-07-24T07:18:16Z

Hello,

Blaze http://blaze.pydata.org/ is very efficient when you want to connect to a database and
want to retrieve data from a very long table.

It will be nice if gtabview (and maybe tabview also) could display blaze.interactive.InteractiveSymbol (named dat here or ds sometimes)

http://blaze.pydata.org/en/latest/quickstart.html
http://blaze.pydata.org/en/latest/rosetta-pandas.html

With Blaze, you can connect to a database table using

import blaze as bz
table_uri = 'dialect+driver://user:password@host:port/databasename::tablename'
dat = bz.Data(table_uri)

it's much more efficient than

import pandas as pd
import sqlalchemy
db_uri = 'dialect+driver://user:password@host:port/databasename'
engine = sqlalchemy.create_engine(db_uri)
query = 'SELECT * FROM tablename'
df = pd.read_sql(query, con=engine)

which will retrieve the whole table into memory.

Passing a table_uri to gtabview will display a part of table content (without retrieving the whole table into memory)

Blaze comes with a very convenient tool named odo http://odo.readthedocs.org/

DataFrame(s) can be contruct by chunk using `odo``

from blaze import odo, chunks
chunksize = 500
for chunk in odo(ds, chunks(pd.DataFrame), chunksize=chunksize):
    print(chunk)

with odo(..., chunks(pd.DataFrame)) you only have one chunk in memory at a time.

I can provide you a quite big MySQL table with Poitiers weather conditions (from 2011-03-07 to 2015-06-02 every 10 minutes - more than 200'000 rows) to try if you don't have a quite long table.

Kind regards

The text was updated successfully, but these errors were encountered:

wavexx · 2015-07-24T07:46:28Z

On 24/07/15 09:18, scls19fr wrote:

Hello,

Blaze http://blaze.pydata.org/ is very efficient when you want to
connect to a database and
want to retrieve data from a very long table.

Blaze would indeed be nice to support.

It would also be convenient to read xls files as well.

s-celles · 2015-07-24T07:47:44Z

Yes or also very big CSV file but maybe the way you are reading CSV is also efficient.

wavexx · 2015-07-24T07:51:48Z

On 24/07/15 09:47, scls19fr wrote:

Yes or also very big CSV file

Does blaze do anything in that regard?
AFAIK, most projects I've ever seen, load the csv into memory.

The notable exception is tabix, where an index needs to be constructed
first.

s-celles · 2015-07-24T08:07:32Z

I thinks Blaze don't load the whole CSV into memory

In [25]: %time df = pd.read_csv("meteo.csv")
CPU times: user 2.4 s, sys: 264 ms, total: 2.66 s
Wall time: 2.78 s

In [26]: %time dat = bz.Data("meteo.csv")
CPU times: user 71.1 ms, sys: 20.5 ms, total: 91.6 ms
Wall time: 106 ms

s-celles · 2015-07-24T08:21:46Z

I'm not sure Blaze supports Excel files. Pandas does (.xls and .xlsx) but the whole content need to be in memory.

In [32]: %time dat = bz.Data("meteo.xlsx")

raises

NotImplementedError: Unable to parse uri to data resource: meteo.xlsx

Reading Excel files with Pandas can be long

In [33]: %time df = pd.read_excel("meteo.xlsx")
CPU times: user 2min 22s, sys: 4.97 s, total: 2min 27s
Wall time: 2min 48s

wavexx · 2015-07-24T09:04:52Z

On 24/07/15 10:21, scls19fr wrote:

I'm not sure Blaze support Excel files. Pandas does (.xls and .xlsx) but
the whole content need to be in memory.

Yes, pandas use xlrd, which doesn't do anything special.

There is also pyExcelerator, but I never used it to see if it supports
partial reading. xlsx files internally are xml-based, so I doubt.

s-celles · 2015-07-24T10:34:37Z

Anyway this could (should) be done inside Blaze not gtabview

wavexx · 2015-07-24T11:33:34Z

On 24/07/15 12:34, scls19fr wrote:

Anyway this could (should) be done inside Blaze not gtabview

Well I added some simple support using "xlrd" now.
If xlrd is installed (likely, if you already have matplotlib), you can
now also read excel files.

I added a --sheet/-S flag to select the sheet.

s-celles · 2015-07-24T12:07:54Z

That's always a good thing but

$ gtabview random.xlsx

raises:

Traceback (most recent call last):
  File "//anaconda/bin/gtabview", line 4, in <module>
    __import__('pkg_resources').run_script('gtabview==0.1', 'gtabview')
  File "//anaconda/lib/python3.4/site-packages/setuptools-18.0.1-py3.4.egg/pkg_resources/__init__.py", line 735, in run_script
  File "//anaconda/lib/python3.4/site-packages/setuptools-18.0.1-py3.4.egg/pkg_resources/__init__.py", line 1659, in run_script
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/EGG-INFO/scripts/gtabview", line 88, in <module>
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/gtabview/__init__.py", line 111, in view
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/gtabview/dataio.py", line 80, in read_table
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/gtabview/dataio.py", line 68, in read_xlrd
  File "//anaconda/lib/python3.4/site-packages/xlrd/book.py", line 432, in sheet_by_index
    return self._sheet_list[sheetx] or self.get_sheet(sheetx)
TypeError: list indices must be integers, not NoneType

wavexx · 2015-07-24T12:10:15Z

On 24/07/15 14:07, scls19fr wrote:

That's always a good thing but

|$ gtabview random.xlsx |

Send me the file if you can.

s-celles · 2015-07-24T12:10:47Z

You might also fix

--start_pos START_POS, -s START_POS
                    Initial cursor display position. Single number for
                    just y (row) position, or two comma-separated numbers
                    (--start_pos 2,3) for both. Alternatively, you can
                    pass the numbers in the more classic +y:[x] format
                    without the --start_pos label. Like 'tabview <fn>
                    +5:10'

replace tabview by gtabview

s-celles · 2015-07-24T12:13:01Z

File(s) was created using:

df = pd.DataFrame(np.random.random((5,3)), columns=['A', 'B', 'C'])
df.to_excel("random.xls")
df.to_excel("random.xlsx")

wavexx · 2015-07-24T12:16:25Z

On 24/07/15 14:10, scls19fr wrote:

You might also fix

|--start_pos START_POS, -s START_POS Initial cursor display position.
Single number for just y (row) position, or two comma-separated numbers
(--start_pos 2,3) for both. Alternatively, you can pass the numbers in
the more classic +y:[x] format without the --start_pos label. Like
'tabview +5:10' |

Both should be fixed.

s-celles · 2015-07-24T12:20:20Z

$ gtabview random.xlsx
Traceback (most recent call last):
  File "//anaconda/bin/gtabview", line 4, in <module>
    __import__('pkg_resources').run_script('gtabview==0.1', 'gtabview')
  File "//anaconda/lib/python3.4/site-packages/setuptools-18.0.1-py3.4.egg/pkg_resources/__init__.py", line 735, in run_script
  File "//anaconda/lib/python3.4/site-packages/setuptools-18.0.1-py3.4.egg/pkg_resources/__init__.py", line 1659, in run_script
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/EGG-INFO/scripts/gtabview", line 88, in <module>
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/gtabview/__init__.py", line 111, in view
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/gtabview/dataio.py", line 99, in read_table
TypeError: object of type 'float' has no len()

wavexx · 2015-07-24T12:39:36Z

On 24/07/15 14:20, scls19fr wrote:

|$ gtabview random.xlsx Traceback (most recent call last): File

Hopefully fixed.

Thanks for sending me the file, at the moment the xls writer in pandas
doesn't work for me (some exception in openpyxl).

s-celles · 2015-07-24T12:44:46Z

It works with random.xls and random.xlsx

meteo.xlsx that I send you by email doesn't open in an acceptable time.
(Excel is able to open this file in less than 1 minute)

wavexx · 2015-07-24T12:47:32Z

On 24/07/15 14:44, scls19fr wrote:

It works with |random.xls| and |random.xlsx|

|meteo.xlsx| that I send you by email doesn't open in an acceptable time.
(Excel is able to open this file in less than 1 minute)

I cannot test that file yet, since I cannot download large stuff from here.

How many rows/columns is that? Does it fit in a xls file? I could try to
generate one here.

I doubt it's a problem in gtabview itself, xlrd itself seems to be
really slow. Just opening a <1mb file takes seconds on my system.

Maybe openpyxl is faster in that regard.

s-celles · 2015-07-24T12:57:38Z

It's a 19.6 Mb file with 23 columns and 208655 rows

wavexx · 2015-07-24T12:58:52Z

On 24/07/15 14:57, scls19fr wrote:

It's a 19.6 Mb file with 23 columns and 208655 rows

if you can squeeze that into <1mb, I can try to get it.

s-celles · 2015-07-24T13:01:38Z

zip reduces size to 17.3 Mb

openpyxl seems to support "big" files
https://openpyxl.readthedocs.org/en/latest/optimized.html

wavexx · 2015-07-24T13:16:47Z

On 24/07/15 14:47, Yuri D'Elia wrote:

I doubt it's a problem in gtabview itself, xlrd itself seems to be
really slow. Just opening a <1mb file takes seconds on my system.

Maybe openpyxl is faster in that regard.

I just tried with the openpyxl "optimized" reader, and on a 1mb file
it's 1 second slower than xlrd.

pd.read_excel/gtabview seem to be identical in timing as well.. so I
don't think there's much we can improve on this.

wavexx · 2015-07-28T11:40:50Z

I'll need some help to get up-to-speed with blaze.
What's the difference between blaze.Data.fields and columns?
How do I access a column by index efficiently?

s-celles · 2015-07-28T16:59:03Z

Hello,

sorry I wasn't here

 dat[dat.columns[3]]

can access to the third column

but I can't say if that's efficient.

Kind regards

PS:

dat[dat.fields[3]]

returns same

In [26]: type(dat[dat.fields[3]])
Out[26]: blaze.expr.expressions.Field

In [27]: type(dat[dat.columns[3]])
Out[27]: blaze.expr.expressions.Field

wavexx · 2015-07-28T17:28:27Z

On 28/07/15 18:59, scls19fr wrote:

Hello,

sorry I was here

|dat[dat.fields[3]] |

can access to the third columns

but I can't say if that's efficient.

Do you also know if there's a way to know the number of the results in
advance?

Even for a csv, dshape doesn't contain the row count.

s-celles · 2015-07-28T18:32:31Z

I don't know Blaze a lot but with IPython

dat. <tab> helps me to find:

dat.nrows

or

dat.nelements()

s-celles · 2015-07-28T18:33:51Z

and

len(dat.columns)

to know number of columns

s-celles · 2015-07-28T18:39:27Z

caution about this

In [72]: type(dat.nrows)
Out[72]: blaze.expr.reductions.nelements

So it may be better to cast to int

int(dat.nrows)

wavexx · 2015-07-28T19:07:04Z

On 28/07/15 20:39, scls19fr wrote:

caution about this

|In [72]: type(dat.nrows) Out[72]: blaze.expr.reductions.nelements |

So it may be better to cast to int

I've got some basics going, but it seems that blaze itself doesn't
perform any caching about the result.

Querying list(data[col][0:1]) to get one element from a column, for
example, would issue two sql queries if done twice.

Are you aware if there's some built-in caching mechanism in blaze, if it
can be turned on, or if I need to build one?

s-celles · 2015-07-28T19:17:55Z

My idea is you might get a DataFrame from Blaze.
But not the whole data... just what can be view in gtabview.

import pandas as pd
from odo import odo
df = odo(dat[dat.columns[0:5]][10:15], pd.DataFrame)

it should avoid several queries.

wavexx · 2015-07-28T19:26:39Z

On 28/07/15 21:17, scls19fr wrote:

My idea is you might get a DataFrame from Blaze.
But not the whole data... just what can be view in gtabview.

|import pandas as pd from odo import odo df =
odo(dat[dat.columns[0:5]][10:15], pd.DataFrame) |

it should avoid several queries.

Yes, however when scrolling is involved and a new row is in view, you
don't want to re-fetch the previous ones.

Making the query itself in blaze, even for one row is quite expensive,
so some sort of caching is required to make viewing "decent".

s-celles · 2015-07-29T06:38:51Z

Why not trying first without this cache mechanism ? and if that's really too long, implement this cache mechanism.

I always try to recall Donald Knuth sentence: "premature optimization is the root of all evil"

On my computer screen I can display less than 23 columns and 25 rows

23x25=575

being pessimistic (very big screen) and considering 1000 elements.

import pandas as pd
from sqlalchemy import create_engine
import numpy as np
import blaze as bz
from odo import odo

N = 1000

def pandas_read_sql(N):
    db_uri = "mysql+mysqlconnector://root:root@127.0.0.1:3306/meteo"
    engine = create_engine(db_uri)
    df = pd.read_sql("SELECT * FROM data LIMIT %d" % N, engine)
    return df

def blaze_odo(N):
    table_uri = "mysql+mysqlconnector://root:root@127.0.0.1:3306/meteo::data"
    dat = bz.Data(table_uri)
    df = odo(dat[0:N], pd.DataFrame)
    return df

In [26]: %time df = pandas_read_sql(N)
CPU times: user 132 ms, sys: 14 ms, total: 146 ms
Wall time: 151 ms

In [33]: %time df = blaze_odo(N)
//anaconda/lib/python3.4/site-packages/blaze/compute/sql.py:856: UserWarning: The order of the result set from a Slice expression computed against the SQL backend is not deterministic.
warnings.warn('The order of the result set from a Slice expression '
CPU times: user 158 ms, sys: 8.26 ms, total: 167 ms

Maybe we can accept waiting less than 0.2 s before fetching a new row (and also fetching again rows we have ever fetched)

wavexx · 2015-07-30T10:10:51Z

On 29/07/15 08:38, scls19fr wrote:

Why not trying first without this cache mechanism ? and if that's really
too long, implement this cache mechanism.

It's really too slow.

Since I'm fetching each cell independently, the blaze overhead for a
screenful (30x20) is ~15 seconds.

Also, this breaks completely column auto-sizing.

Blaze really requires some batch manipulation, and it's definitely
slower for datasets that fit into memory.

wavexx · 2015-07-30T16:19:21Z

It's in.

import blaze as bz
from gtabview import view
data = bz.Data('data_ohlcv.csv')
view(data)

I tested it against some large postesql tables and also seems ok (although blaze reports a warning about non-deterministic slicing).

s-celles · 2015-07-30T17:19:16Z

Nice!

I've just test it with a MySQL table with 200 000 rows and that's very convenient.

Thanks a lot.

I think next feature should be to detect that view parameter is a table URI
and to automatically create a blaze.interactive.InteractiveSymbol from this table URI.

So it will be possible to do:

view(table_uri)

and with console

$ gtabview dialect+driver://user:password@host:port/databasename::tablename

CAUTION: I think this last command could lead some security problem because of Bash history
so maybe you should provide a command like:

$ gtabview secure

which open a window to paste URI (with password) or anything that view can accept. So no one could see password in history.

s-celles · 2015-07-30T17:38:13Z

I think you will have to detect that parameter is a table URI using a regex.

regex101 is great for this https://regex101.com/

an other approach could be to split filename and extension

filename, file_extension = os.path.splitext('/path/to/somefile.ext')

and if file_extension is an empty string dispatch to DB table viewer.

wavexx · 2015-07-30T17:41:02Z

What about a blaze+[blaze uri] ?
This would make loading always un-ambiguous.

gtabview blaze+file.csv would load the file through blaze as opposed to anything else, which is a good thing since for small csv files blaze is actually slower.

As a convenience, if the argument looks like an url, would could also try blaze:

If url contains blaze+: try blaze
elif url looks like url: try blaze
load normally

Loading an hdf file directly wouldn't work out of the box (you'd still need blaze+file.hdf), but using blaze by default sounds like a bad idea.

s-celles · 2015-07-30T17:52:41Z

I really want to be able to directly open table URI without specifying blaze+ (it seems to be your idea with "elif url looks like url: try blaze")

You are right, you could load CSV using "classic" method or using blaze.

For this I think an other parameter should be given

it's much more explicit
and it could lead to some mistakes
- 'blaze+dialect+driver://user:password@host:port/databasename::tablename'
- 'blaze+dialect://user:password@host:port/databasename::tablename'

Load csv using Blaze

$ gtabview file.csv --engine blaze

Load csv using classic method

$ gtabview file.csv

$ gtabview file.csv --engine default

s-celles · 2015-07-30T17:57:09Z

Some database URI examples

http://docs.sqlalchemy.org/en/rel_1_0/core/engines.html

sqlite:///:memory: is also a database URI

table URI are database URI with ::tablename

wavexx · 2015-07-30T18:00:11Z

I'd like to keep gtabview.view() (the function) and gtabview (the
command) identical at least in behavior.

However, this would imply an extra parameter to view, which I would like
to avoid.

In gtabview.view() there's also less need to use blaze under the hood.
If you want blaze, you can supply a blaze object directly to view().

Somehow, blaze refuses to load file://[path], which would have make the
behavior perfect.

I tried with a couple other syntaxes, but no success.

s-celles · 2015-07-30T19:54:21Z

Personally I would prefer gtabview.view() to (also) accept table URI and build bz.Data(...) inside.
That's the spirit of my PR pandas-dev/pandas#10666 which allow pandas.read_sql and DataFrame.to_sql to accept a db URI as con and build a SQLAlchemy engine inside.

wavexx · 2015-07-31T12:42:46Z

I committed some changes which should be reasonable enough:

If the path looks like an URI, use Blaze, otherwise read/handle it normally.

So gtabview file would read it directly, while gtabview file://file would pass through blaze. This is always un-ambiguous. Other uri-like strings will be handled by blaze, so you can read DB uris directly.

s-celles · 2015-07-31T14:16:03Z

Yes, that's a reasonable approach of the problem.

Maybe that's time to publish a new package version of gtabview.

A nice feature will be to add all your work to tabview (curses) because displaying
throught SSH content of a table can be very useful.

s-celles · 2015-07-31T14:23:54Z

When you write:

$ gtabview file://filename

Is filename a relative path to filename or an absolute ?

if that's a relative path to filename, how an absolute path should be given ?

wavexx · 2015-07-31T14:30:28Z

On 31/07/15 16:23, scls19fr wrote:

When you write:

|$ gtabview file://filename |

Is filename a relative path to filename or an absolute ?

if that's a relative path to filename, how an absolute path should be
given ?

It's relative. The path starts after ://.

If you want absolute, simply use file:///

s-celles · 2015-07-31T14:31:38Z

ok thanks

wavexx · 2015-07-31T18:47:30Z

Closing

wavexx · 2015-07-31T18:51:52Z

On 31/07/15 16:16, scls19fr wrote:

Maybe that's time to publish a new package version of gtabview.

Done right now.

I wanted to complete at least some docstrings in view() itself this time.

s-celles · 2015-07-31T19:13:34Z

Thanks. This is a really great feature.

s-celles mentioned this issue Jul 31, 2015

Blaze support TabViewer/tabview#127

Open

wavexx closed this as completed Jul 31, 2015

Blaze support #10

Blaze support #10

Comments

s-celles commented Jul 24, 2015

wavexx commented Jul 24, 2015

Uh oh!

s-celles commented Jul 24, 2015

Uh oh!

wavexx commented Jul 24, 2015

Uh oh!

s-celles commented Jul 24, 2015

Uh oh!

s-celles commented Jul 24, 2015

Uh oh!

wavexx commented Jul 24, 2015

Uh oh!

s-celles commented Jul 24, 2015

Uh oh!

wavexx commented Jul 24, 2015

Uh oh!

s-celles commented Jul 24, 2015

Uh oh!

wavexx commented Jul 24, 2015

Uh oh!

s-celles commented Jul 24, 2015

Uh oh!

s-celles commented Jul 24, 2015

Uh oh!

wavexx commented Jul 24, 2015

Uh oh!

s-celles commented Jul 24, 2015

Uh oh!

wavexx commented Jul 24, 2015

Uh oh!

s-celles commented Jul 24, 2015

Uh oh!

wavexx commented Jul 24, 2015

Uh oh!

s-celles commented Jul 24, 2015

Uh oh!

wavexx commented Jul 24, 2015

Uh oh!

s-celles commented Jul 24, 2015

Uh oh!

wavexx commented Jul 24, 2015

Uh oh!

wavexx commented Jul 28, 2015

Uh oh!

s-celles commented Jul 28, 2015

Uh oh!

wavexx commented Jul 28, 2015

Uh oh!

s-celles commented Jul 28, 2015

Uh oh!

s-celles commented Jul 28, 2015

Uh oh!

s-celles commented Jul 28, 2015

Uh oh!

wavexx commented Jul 28, 2015

Uh oh!

s-celles commented Jul 28, 2015

Uh oh!

wavexx commented Jul 28, 2015

Uh oh!

s-celles commented Jul 29, 2015

Uh oh!

wavexx commented Jul 30, 2015

Uh oh!

wavexx commented Jul 30, 2015

Uh oh!

s-celles commented Jul 30, 2015

Uh oh!

s-celles commented Jul 30, 2015

Uh oh!

wavexx commented Jul 30, 2015

Uh oh!

s-celles commented Jul 30, 2015

Uh oh!

s-celles commented Jul 30, 2015

Uh oh!