Skip to content

Blaze support #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
s-celles opened this issue Jul 24, 2015 · 48 comments
Closed

Blaze support #10

s-celles opened this issue Jul 24, 2015 · 48 comments

Comments

@s-celles
Copy link

Hello,

Blaze http://blaze.pydata.org/ is very efficient when you want to connect to a database and
want to retrieve data from a very long table.

It will be nice if gtabview (and maybe tabview also) could display blaze.interactive.InteractiveSymbol (named dat here or ds sometimes)

http://blaze.pydata.org/en/latest/quickstart.html
http://blaze.pydata.org/en/latest/rosetta-pandas.html

With Blaze, you can connect to a database table using

import blaze as bz
table_uri = 'dialect+driver://user:password@host:port/databasename::tablename'
dat = bz.Data(table_uri)

it's much more efficient than

import pandas as pd
import sqlalchemy
db_uri = 'dialect+driver://user:password@host:port/databasename'
engine = sqlalchemy.create_engine(db_uri)
query = 'SELECT * FROM tablename'
df = pd.read_sql(query, con=engine)

which will retrieve the whole table into memory.

Passing a table_uri to gtabview will display a part of table content (without retrieving the whole table into memory)

Blaze comes with a very convenient tool named odo http://odo.readthedocs.org/

DataFrame(s) can be contruct by chunk using `odo``

from blaze import odo, chunks
chunksize = 500
for chunk in odo(ds, chunks(pd.DataFrame), chunksize=chunksize):
    print(chunk)

with odo(..., chunks(pd.DataFrame)) you only have one chunk in memory at a time.

I can provide you a quite big MySQL table with Poitiers weather conditions (from 2011-03-07 to 2015-06-02 every 10 minutes - more than 200'000 rows) to try if you don't have a quite long table.

Kind regards

@wavexx
Copy link
Member

wavexx commented Jul 24, 2015

On 24/07/15 09:18, scls19fr wrote:

Hello,

Blaze http://blaze.pydata.org/ is very efficient when you want to
connect to a database and
want to retrieve data from a very long table.

Blaze would indeed be nice to support.

It would also be convenient to read xls files as well.

@s-celles
Copy link
Author

Yes or also very big CSV file but maybe the way you are reading CSV is also efficient.

@wavexx
Copy link
Member

wavexx commented Jul 24, 2015

On 24/07/15 09:47, scls19fr wrote:

Yes or also very big CSV file

Does blaze do anything in that regard?
AFAIK, most projects I've ever seen, load the csv into memory.

The notable exception is tabix, where an index needs to be constructed
first.

@s-celles
Copy link
Author

I thinks Blaze don't load the whole CSV into memory

In [25]: %time df = pd.read_csv("meteo.csv")
CPU times: user 2.4 s, sys: 264 ms, total: 2.66 s
Wall time: 2.78 s

In [26]: %time dat = bz.Data("meteo.csv")
CPU times: user 71.1 ms, sys: 20.5 ms, total: 91.6 ms
Wall time: 106 ms

@s-celles
Copy link
Author

I'm not sure Blaze supports Excel files. Pandas does (.xls and .xlsx) but the whole content need to be in memory.

In [32]: %time dat = bz.Data("meteo.xlsx")

raises

NotImplementedError: Unable to parse uri to data resource: meteo.xlsx

Reading Excel files with Pandas can be long

In [33]: %time df = pd.read_excel("meteo.xlsx")
CPU times: user 2min 22s, sys: 4.97 s, total: 2min 27s
Wall time: 2min 48s

@wavexx
Copy link
Member

wavexx commented Jul 24, 2015

On 24/07/15 10:21, scls19fr wrote:

I'm not sure Blaze support Excel files. Pandas does (.xls and .xlsx) but
the whole content need to be in memory.

Yes, pandas use xlrd, which doesn't do anything special.

There is also pyExcelerator, but I never used it to see if it supports
partial reading. xlsx files internally are xml-based, so I doubt.

@s-celles
Copy link
Author

Anyway this could (should) be done inside Blaze not gtabview

@wavexx
Copy link
Member

wavexx commented Jul 24, 2015

On 24/07/15 12:34, scls19fr wrote:

Anyway this could (should) be done inside Blaze not gtabview

Well I added some simple support using "xlrd" now.
If xlrd is installed (likely, if you already have matplotlib), you can
now also read excel files.

I added a --sheet/-S flag to select the sheet.

@s-celles
Copy link
Author

That's always a good thing but

$ gtabview random.xlsx

raises:

Traceback (most recent call last):
  File "//anaconda/bin/gtabview", line 4, in <module>
    __import__('pkg_resources').run_script('gtabview==0.1', 'gtabview')
  File "//anaconda/lib/python3.4/site-packages/setuptools-18.0.1-py3.4.egg/pkg_resources/__init__.py", line 735, in run_script
  File "//anaconda/lib/python3.4/site-packages/setuptools-18.0.1-py3.4.egg/pkg_resources/__init__.py", line 1659, in run_script
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/EGG-INFO/scripts/gtabview", line 88, in <module>
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/gtabview/__init__.py", line 111, in view
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/gtabview/dataio.py", line 80, in read_table
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/gtabview/dataio.py", line 68, in read_xlrd
  File "//anaconda/lib/python3.4/site-packages/xlrd/book.py", line 432, in sheet_by_index
    return self._sheet_list[sheetx] or self.get_sheet(sheetx)
TypeError: list indices must be integers, not NoneType

@wavexx
Copy link
Member

wavexx commented Jul 24, 2015

On 24/07/15 14:07, scls19fr wrote:

That's always a good thing but

|$ gtabview random.xlsx |

Send me the file if you can.

@s-celles
Copy link
Author

You might also fix

--start_pos START_POS, -s START_POS
                    Initial cursor display position. Single number for
                    just y (row) position, or two comma-separated numbers
                    (--start_pos 2,3) for both. Alternatively, you can
                    pass the numbers in the more classic +y:[x] format
                    without the --start_pos label. Like 'tabview <fn>
                    +5:10'

replace tabview by gtabview

@s-celles
Copy link
Author

File(s) was created using:

df = pd.DataFrame(np.random.random((5,3)), columns=['A', 'B', 'C'])
df.to_excel("random.xls")
df.to_excel("random.xlsx")

@wavexx
Copy link
Member

wavexx commented Jul 24, 2015

On 24/07/15 14:10, scls19fr wrote:

You might also fix

|--start_pos START_POS, -s START_POS Initial cursor display position.
Single number for just y (row) position, or two comma-separated numbers
(--start_pos 2,3) for both. Alternatively, you can pass the numbers in
the more classic +y:[x] format without the --start_pos label. Like
'tabview +5:10' |

Both should be fixed.

@s-celles
Copy link
Author

$ gtabview random.xlsx
Traceback (most recent call last):
  File "//anaconda/bin/gtabview", line 4, in <module>
    __import__('pkg_resources').run_script('gtabview==0.1', 'gtabview')
  File "//anaconda/lib/python3.4/site-packages/setuptools-18.0.1-py3.4.egg/pkg_resources/__init__.py", line 735, in run_script
  File "//anaconda/lib/python3.4/site-packages/setuptools-18.0.1-py3.4.egg/pkg_resources/__init__.py", line 1659, in run_script
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/EGG-INFO/scripts/gtabview", line 88, in <module>
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/gtabview/__init__.py", line 111, in view
  File "//anaconda/lib/python3.4/site-packages/gtabview-0.1-py3.4.egg/gtabview/dataio.py", line 99, in read_table
TypeError: object of type 'float' has no len()

@wavexx
Copy link
Member

wavexx commented Jul 24, 2015

On 24/07/15 14:20, scls19fr wrote:

|$ gtabview random.xlsx Traceback (most recent call last): File

Hopefully fixed.

Thanks for sending me the file, at the moment the xls writer in pandas
doesn't work for me (some exception in openpyxl).

@s-celles
Copy link
Author

It works with random.xls and random.xlsx

meteo.xlsx that I send you by email doesn't open in an acceptable time.
(Excel is able to open this file in less than 1 minute)

@wavexx
Copy link
Member

wavexx commented Jul 24, 2015

On 24/07/15 14:44, scls19fr wrote:

It works with |random.xls| and |random.xlsx|

|meteo.xlsx| that I send you by email doesn't open in an acceptable time.
(Excel is able to open this file in less than 1 minute)

I cannot test that file yet, since I cannot download large stuff from here.

How many rows/columns is that? Does it fit in a xls file? I could try to
generate one here.

I doubt it's a problem in gtabview itself, xlrd itself seems to be
really slow. Just opening a <1mb file takes seconds on my system.

Maybe openpyxl is faster in that regard.

@s-celles
Copy link
Author

It's a 19.6 Mb file with 23 columns and 208655 rows

@wavexx
Copy link
Member

wavexx commented Jul 24, 2015

On 24/07/15 14:57, scls19fr wrote:

It's a 19.6 Mb file with 23 columns and 208655 rows

if you can squeeze that into <1mb, I can try to get it.

@s-celles
Copy link
Author

zip reduces size to 17.3 Mb

openpyxl seems to support "big" files
https://openpyxl.readthedocs.org/en/latest/optimized.html

@wavexx
Copy link
Member

wavexx commented Jul 24, 2015

On 24/07/15 14:47, Yuri D'Elia wrote:

I doubt it's a problem in gtabview itself, xlrd itself seems to be
really slow. Just opening a <1mb file takes seconds on my system.

Maybe openpyxl is faster in that regard.

I just tried with the openpyxl "optimized" reader, and on a 1mb file
it's 1 second slower than xlrd.

pd.read_excel/gtabview seem to be identical in timing as well.. so I
don't think there's much we can improve on this.

@wavexx
Copy link
Member

wavexx commented Jul 28, 2015

I'll need some help to get up-to-speed with blaze.
What's the difference between blaze.Data.fields and columns?
How do I access a column by index efficiently?

@s-celles
Copy link
Author

Hello,

sorry I wasn't here

 dat[dat.columns[3]]

can access to the third column

but I can't say if that's efficient.

Kind regards

PS:

dat[dat.fields[3]]

returns same

In [26]: type(dat[dat.fields[3]])
Out[26]: blaze.expr.expressions.Field

In [27]: type(dat[dat.columns[3]])
Out[27]: blaze.expr.expressions.Field

@wavexx
Copy link
Member

wavexx commented Jul 28, 2015

On 28/07/15 18:59, scls19fr wrote:

Hello,

sorry I was here

|dat[dat.fields[3]] |

can access to the third columns

but I can't say if that's efficient.

Do you also know if there's a way to know the number of the results in
advance?

Even for a csv, dshape doesn't contain the row count.

@s-celles
Copy link
Author

I don't know Blaze a lot but with IPython

dat. <tab> helps me to find:

dat.nrows

or

dat.nelements()

@s-celles
Copy link
Author

and

len(dat.columns)

to know number of columns

@s-celles
Copy link
Author

caution about this

In [72]: type(dat.nrows)
Out[72]: blaze.expr.reductions.nelements

So it may be better to cast to int

int(dat.nrows)

@wavexx
Copy link
Member

wavexx commented Jul 28, 2015

On 28/07/15 20:39, scls19fr wrote:

caution about this

|In [72]: type(dat.nrows) Out[72]: blaze.expr.reductions.nelements |

So it may be better to cast to int

I've got some basics going, but it seems that blaze itself doesn't
perform any caching about the result.

Querying list(data[col][0:1]) to get one element from a column, for
example, would issue two sql queries if done twice.

Are you aware if there's some built-in caching mechanism in blaze, if it
can be turned on, or if I need to build one?

@s-celles
Copy link
Author

My idea is you might get a DataFrame from Blaze.
But not the whole data... just what can be view in gtabview.

import pandas as pd
from odo import odo
df = odo(dat[dat.columns[0:5]][10:15], pd.DataFrame)

it should avoid several queries.

@wavexx
Copy link
Member

wavexx commented Jul 28, 2015

On 28/07/15 21:17, scls19fr wrote:

My idea is you might get a DataFrame from Blaze.
But not the whole data... just what can be view in gtabview.

|import pandas as pd from odo import odo df =
odo(dat[dat.columns[0:5]][10:15], pd.DataFrame) |

it should avoid several queries.

Yes, however when scrolling is involved and a new row is in view, you
don't want to re-fetch the previous ones.

Making the query itself in blaze, even for one row is quite expensive,
so some sort of caching is required to make viewing "decent".

@s-celles
Copy link
Author

Why not trying first without this cache mechanism ? and if that's really too long, implement this cache mechanism.

I always try to recall Donald Knuth sentence: "premature optimization is the root of all evil"

On my computer screen I can display less than 23 columns and 25 rows

23x25=575

being pessimistic (very big screen) and considering 1000 elements.

import pandas as pd
from sqlalchemy import create_engine
import numpy as np
import blaze as bz
from odo import odo

N = 1000

def pandas_read_sql(N):
    db_uri = "mysql+mysqlconnector://root:root@127.0.0.1:3306/meteo"
    engine = create_engine(db_uri)
    df = pd.read_sql("SELECT * FROM data LIMIT %d" % N, engine)
    return df

def blaze_odo(N):
    table_uri = "mysql+mysqlconnector://root:root@127.0.0.1:3306/meteo::data"
    dat = bz.Data(table_uri)
    df = odo(dat[0:N], pd.DataFrame)
    return df

In [26]: %time df = pandas_read_sql(N)
CPU times: user 132 ms, sys: 14 ms, total: 146 ms
Wall time: 151 ms

In [33]: %time df = blaze_odo(N)
//anaconda/lib/python3.4/site-packages/blaze/compute/sql.py:856: UserWarning: The order of the result set from a Slice expression computed against the SQL backend is not deterministic.
warnings.warn('The order of the result set from a Slice expression '
CPU times: user 158 ms, sys: 8.26 ms, total: 167 ms

Maybe we can accept waiting less than 0.2 s before fetching a new row (and also fetching again rows we have ever fetched)

@wavexx
Copy link
Member

wavexx commented Jul 30, 2015

On 29/07/15 08:38, scls19fr wrote:

Why not trying first without this cache mechanism ? and if that's really
too long, implement this cache mechanism.

It's really too slow.

Since I'm fetching each cell independently, the blaze overhead for a
screenful (30x20) is ~15 seconds.

Also, this breaks completely column auto-sizing.

Blaze really requires some batch manipulation, and it's definitely
slower for datasets that fit into memory.

@wavexx
Copy link
Member

wavexx commented Jul 30, 2015

It's in.

import blaze as bz
from gtabview import view
data = bz.Data('data_ohlcv.csv')
view(data)

I tested it against some large postesql tables and also seems ok (although blaze reports a warning about non-deterministic slicing).

@s-celles
Copy link
Author

Nice!

I've just test it with a MySQL table with 200 000 rows and that's very convenient.

Thanks a lot.

I think next feature should be to detect that view parameter is a table URI
and to automatically create a blaze.interactive.InteractiveSymbol from this table URI.

So it will be possible to do:

view(table_uri)

and with console

$ gtabview dialect+driver://user:password@host:port/databasename::tablename

CAUTION: I think this last command could lead some security problem because of Bash history
so maybe you should provide a command like:

$ gtabview secure

which open a window to paste URI (with password) or anything that view can accept. So no one could see password in history.

@s-celles
Copy link
Author

I think you will have to detect that parameter is a table URI using a regex.

regex101 is great for this https://regex101.com/

an other approach could be to split filename and extension

filename, file_extension = os.path.splitext('/path/to/somefile.ext')

and if file_extension is an empty string dispatch to DB table viewer.

@wavexx
Copy link
Member

wavexx commented Jul 30, 2015

What about a blaze+[blaze uri] ?
This would make loading always un-ambiguous.

gtabview blaze+file.csv would load the file through blaze as opposed to anything else, which is a good thing since for small csv files blaze is actually slower.

As a convenience, if the argument looks like an url, would could also try blaze:

  • If url contains blaze+: try blaze
  • elif url looks like url: try blaze
  • load normally

Loading an hdf file directly wouldn't work out of the box (you'd still need blaze+file.hdf), but using blaze by default sounds like a bad idea.

@s-celles
Copy link
Author

I really want to be able to directly open table URI without specifying blaze+ (it seems to be your idea with "elif url looks like url: try blaze")

You are right, you could load CSV using "classic" method or using blaze.

For this I think an other parameter should be given

  • it's much more explicit
  • and it could lead to some mistakes
    • 'blaze+dialect+driver://user:password@host:port/databasename::tablename'
    • 'blaze+dialect://user:password@host:port/databasename::tablename'

Load csv using Blaze

$ gtabview file.csv --engine blaze

Load csv using classic method

$ gtabview file.csv

$ gtabview file.csv --engine default

@s-celles
Copy link
Author

Some database URI examples

http://docs.sqlalchemy.org/en/rel_1_0/core/engines.html

sqlite:///:memory: is also a database URI

table URI are database URI with ::tablename

@wavexx
Copy link
Member

wavexx commented Jul 30, 2015

I'd like to keep gtabview.view() (the function) and gtabview (the
command) identical at least in behavior.

However, this would imply an extra parameter to view, which I would like
to avoid.

In gtabview.view() there's also less need to use blaze under the hood.
If you want blaze, you can supply a blaze object directly to view().

Somehow, blaze refuses to load file://[path], which would have make the
behavior perfect.

I tried with a couple other syntaxes, but no success.

@s-celles
Copy link
Author

Personally I would prefer gtabview.view() to (also) accept table URI and build bz.Data(...) inside.
That's the spirit of my PR pandas-dev/pandas#10666 which allow pandas.read_sql and DataFrame.to_sql to accept a db URI as con and build a SQLAlchemy engine inside.

@wavexx
Copy link
Member

wavexx commented Jul 31, 2015

I committed some changes which should be reasonable enough:

If the path looks like an URI, use Blaze, otherwise read/handle it normally.

So gtabview file would read it directly, while gtabview file://file would pass through blaze. This is always un-ambiguous. Other uri-like strings will be handled by blaze, so you can read DB uris directly.

@s-celles
Copy link
Author

Yes, that's a reasonable approach of the problem.

Maybe that's time to publish a new package version of gtabview.

A nice feature will be to add all your work to tabview (curses) because displaying
throught SSH content of a table can be very useful.

@s-celles
Copy link
Author

When you write:

$ gtabview file://filename

Is filename a relative path to filename or an absolute ?

if that's a relative path to filename, how an absolute path should be given ?

@wavexx
Copy link
Member

wavexx commented Jul 31, 2015

On 31/07/15 16:23, scls19fr wrote:

When you write:

|$ gtabview file://filename |

Is filename a relative path to filename or an absolute ?

if that's a relative path to filename, how an absolute path should be
given ?

It's relative. The path starts after ://.

If you want absolute, simply use file:///

@s-celles
Copy link
Author

ok thanks

@wavexx
Copy link
Member

wavexx commented Jul 31, 2015

Closing

@wavexx wavexx closed this as completed Jul 31, 2015
@wavexx
Copy link
Member

wavexx commented Jul 31, 2015

On 31/07/15 16:16, scls19fr wrote:

Maybe that's time to publish a new package version of gtabview.

Done right now.

I wanted to complete at least some docstrings in view() itself this time.

@s-celles
Copy link
Author

Thanks. This is a really great feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants