-
Notifications
You must be signed in to change notification settings - Fork 81
Fix a bug in the passing around of the csv delimiter #83
Conversation
Apparently, the Pandas DataFrame type is special in that it only supports the keyword argument 'sep' for specifying the delimiter used in a CSV file, and not the keyword argument 'delimiter'. (Note that the builtin 'csv' package only supports the keyword argument 'delimiter' and not 'sep', so we cannot normalize on a single keyword). Furthermore, somewhere deep in the DataFrame code it checks if the type of that argument is 'str', which will succeed in Python 3, but not in Python 2, as we are using Python 3-compatible strings which have a type of 'unicode', and were overriding the 'str' method with a Python 3-compatible one which returns values with type 'newstr'. All of this means that in order to pass a single character argument to the method, we had to distinguish between all of the calls where a Python 3 compatible str was expected with this one call that requires an argument with type 'str'. That distinction is being made by changing the 'from builtins import str' to 'from builtins import str as newstr'.
@@ -176,4 +176,4 @@ def sample_to(self, count, skip_header_rows, strategy, target): | |||
datalab.utils.gcs_copy_file(f.name, target) | |||
else: | |||
with open(target, 'w') as f: | |||
df.to_csv(f, header=False, index=False, delimiter=self._delimiter) | |||
df.to_csv(f, header=False, index=False, sep=str(',')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be sep=self._delimiter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should. I changed things up when I was debugging why simply casting to str
wasn't working (because an import was overriding str
), and I forgot to change it back.
Thanks for catching that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
I'm running into this issue now. Do you mind describing (preferably in a more public forum like the README or the official Datalab docs) how changes like this make their way into the Docker container? And assuming I don't want to wait for another Datalab version release, what's the best way to get the code from |
Take a look at the Dockerfile for datalab's base image here, the pydatalab repo is cloned (unless you point to a local copy of it, read below) everytime the datalab image is built, this is how these changes get there. I agree this needs to be cleared up in the documentation, perhaps in the install script. If what you need is to get your own copy of this repo (pydatalab) into your locally-built datalab image then it's easier to do it while building the datalab image. To do this you can pass the path to your local pydatalab to the build script at containers/datalab/build.sh, for example: The install instructions on this repo are meant for building Jupyter nbextensions, which can then be used from any Jupyter instance. |
Thank you. What version of Python should I be running Datalab and this extension with or does it not matter? Also, what I'm going to try to do is install this locally into site packages and load it in the kernel — is that a bad idea? Or you recommend rebuilding the entire image in that case? |
I often use this approach because I find it faster than rebuilding the image, although it may not be supported officially. Keep in mind that this approach may not always work (things may break due to version conflicts). I use python 2.7. Here are the steps:
I hope this helps. |
Apparently, the Pandas DataFrame type is special in that it only
supports the keyword argument 'sep' for specifying the delimiter used in
a CSV file, and not the keyword argument 'delimiter'. (Note that the
builtin 'csv' package only supports the keyword argument 'delimiter' and
not 'sep', so we cannot normalize on a single keyword).
Furthermore, somewhere deep in the DataFrame code it checks if the type
of that argument is 'str', which will succeed in Python 3, but not in
Python 2, as we are using Python 3-compatible strings which have a type
of 'unicode', and were overriding the 'str' method with a Python
3-compatible one which returns values with type 'newstr'.
All of this means that in order to pass a single character argument to
the method, we had to distinguish between all of the calls where a
Python 3 compatible str was expected with this one call that requires an
argument with type 'str'. That distinction is being made by changing
the 'from builtins import str' to 'from builtins import str as newstr'.