-
Notifications
You must be signed in to change notification settings - Fork 3
Blaze support #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
On 24/07/15 09:18, scls19fr wrote:
Blaze would indeed be nice to support. It would also be convenient to read xls files as well. |
Yes or also very big CSV file but maybe the way you are reading CSV is also efficient. |
On 24/07/15 09:47, scls19fr wrote:
Does blaze do anything in that regard? The notable exception is tabix, where an index needs to be constructed |
I thinks Blaze don't load the whole CSV into memory
|
I'm not sure Blaze supports Excel files. Pandas does (.xls and .xlsx) but the whole content need to be in memory.
raises
Reading Excel files with Pandas can be long
|
On 24/07/15 10:21, scls19fr wrote:
Yes, pandas use xlrd, which doesn't do anything special. There is also pyExcelerator, but I never used it to see if it supports |
Anyway this could (should) be done inside Blaze not gtabview |
On 24/07/15 12:34, scls19fr wrote:
Well I added some simple support using "xlrd" now. I added a --sheet/-S flag to select the sheet. |
That's always a good thing but
raises:
|
On 24/07/15 14:07, scls19fr wrote:
Send me the file if you can. |
You might also fix
replace |
File(s) was created using:
|
On 24/07/15 14:10, scls19fr wrote:
Both should be fixed. |
|
On 24/07/15 14:20, scls19fr wrote:
Hopefully fixed. Thanks for sending me the file, at the moment the xls writer in pandas |
It works with
|
On 24/07/15 14:44, scls19fr wrote:
I cannot test that file yet, since I cannot download large stuff from here. How many rows/columns is that? Does it fit in a xls file? I could try to I doubt it's a problem in gtabview itself, xlrd itself seems to be Maybe openpyxl is faster in that regard. |
It's a 19.6 Mb file with 23 columns and 208655 rows |
On 24/07/15 14:57, scls19fr wrote:
if you can squeeze that into <1mb, I can try to get it. |
zip reduces size to 17.3 Mb openpyxl seems to support "big" files |
On 24/07/15 14:47, Yuri D'Elia wrote:
I just tried with the openpyxl "optimized" reader, and on a 1mb file pd.read_excel/gtabview seem to be identical in timing as well.. so I |
I'll need some help to get up-to-speed with blaze. |
Hello, sorry I wasn't here
can access to the third column but I can't say if that's efficient. Kind regards PS:
returns same
|
On 28/07/15 18:59, scls19fr wrote:
Do you also know if there's a way to know the number of the results in Even for a csv, dshape doesn't contain the row count. |
I don't know Blaze a lot but with IPython
or
|
and
to know number of columns |
caution about this
So it may be better to cast to int
|
On 28/07/15 20:39, scls19fr wrote:
I've got some basics going, but it seems that blaze itself doesn't Querying list(data[col][0:1]) to get one element from a column, for Are you aware if there's some built-in caching mechanism in blaze, if it |
My idea is you might get a DataFrame from Blaze.
it should avoid several queries. |
On 28/07/15 21:17, scls19fr wrote:
Yes, however when scrolling is involved and a new row is in view, you Making the query itself in blaze, even for one row is quite expensive, |
Why not trying first without this cache mechanism ? and if that's really too long, implement this cache mechanism. I always try to recall Donald Knuth sentence: "premature optimization is the root of all evil" On my computer screen I can display less than 23 columns and 25 rows 23x25=575 being pessimistic (very big screen) and considering 1000 elements.
Maybe we can accept waiting less than 0.2 s before fetching a new row (and also fetching again rows we have ever fetched) |
On 29/07/15 08:38, scls19fr wrote:
It's really too slow. Since I'm fetching each cell independently, the blaze overhead for a Also, this breaks completely column auto-sizing. Blaze really requires some batch manipulation, and it's definitely |
It's in.
I tested it against some large postesql tables and also seems ok (although blaze reports a warning about non-deterministic slicing). |
Nice! I've just test it with a MySQL table with 200 000 rows and that's very convenient. Thanks a lot. I think next feature should be to detect that view parameter is a table URI So it will be possible to do:
and with console
CAUTION: I think this last command could lead some security problem because of Bash history
which open a window to paste URI (with password) or anything that view can accept. So no one could see password in history. |
I think you will have to detect that parameter is a table URI using a regex. regex101 is great for this https://regex101.com/ an other approach could be to split filename and extension
and if |
What about a blaze+[blaze uri] ?
As a convenience, if the argument looks like an url, would could also try blaze:
Loading an hdf file directly wouldn't work out of the box (you'd still need blaze+file.hdf), but using blaze by default sounds like a bad idea. |
I really want to be able to directly open table URI without specifying You are right, you could load CSV using "classic" method or using blaze. For this I think an other parameter should be given
Load csv using Blaze
Load csv using classic method
|
Some database URI examples http://docs.sqlalchemy.org/en/rel_1_0/core/engines.html
table URI are database URI with |
I'd like to keep gtabview.view() (the function) and gtabview (the However, this would imply an extra parameter to view, which I would like In gtabview.view() there's also less need to use blaze under the hood. Somehow, blaze refuses to load file://[path], which would have make the I tried with a couple other syntaxes, but no success. |
Personally I would prefer |
I committed some changes which should be reasonable enough: If the path looks like an URI, use Blaze, otherwise read/handle it normally. So |
Yes, that's a reasonable approach of the problem. Maybe that's time to publish a new package version of gtabview. A nice feature will be to add all your work to |
When you write:
Is filename a relative path to filename or an absolute ? if that's a relative path to filename, how an absolute path should be given ? |
On 31/07/15 16:23, scls19fr wrote:
It's relative. The path starts after ://. If you want absolute, simply use file:/// |
ok thanks |
Closing |
On 31/07/15 16:16, scls19fr wrote:
Done right now. I wanted to complete at least some docstrings in view() itself this time. |
Thanks. This is a really great feature. |
Hello,
Blaze http://blaze.pydata.org/ is very efficient when you want to connect to a database and
want to retrieve data from a very long table.
It will be nice if
gtabview
(and maybetabview
also) could displayblaze.interactive.InteractiveSymbol
(nameddat
here ords
sometimes)http://blaze.pydata.org/en/latest/quickstart.html
http://blaze.pydata.org/en/latest/rosetta-pandas.html
With Blaze, you can connect to a database table using
it's much more efficient than
which will retrieve the whole table into memory.
Passing a table_uri to gtabview will display a part of table content (without retrieving the whole table into memory)
Blaze comes with a very convenient tool named
odo
http://odo.readthedocs.org/DataFrame(s) can be contruct by chunk using `odo``
with
odo(..., chunks(pd.DataFrame))
you only have one chunk in memory at a time.I can provide you a quite big MySQL table with Poitiers weather conditions (from 2011-03-07 to 2015-06-02 every 10 minutes - more than 200'000 rows) to try if you don't have a quite long table.
Kind regards
The text was updated successfully, but these errors were encountered: