Skip to content
This repository has been archived by the owner on Sep 3, 2022. It is now read-only.

Fix a bug in the passing around of the csv delimiter #83

Merged
merged 2 commits into from
Oct 7, 2016

Conversation

ojarjur
Copy link
Contributor

@ojarjur ojarjur commented Oct 7, 2016

Apparently, the Pandas DataFrame type is special in that it only
supports the keyword argument 'sep' for specifying the delimiter used in
a CSV file, and not the keyword argument 'delimiter'. (Note that the
builtin 'csv' package only supports the keyword argument 'delimiter' and
not 'sep', so we cannot normalize on a single keyword).

Furthermore, somewhere deep in the DataFrame code it checks if the type
of that argument is 'str', which will succeed in Python 3, but not in
Python 2, as we are using Python 3-compatible strings which have a type
of 'unicode', and were overriding the 'str' method with a Python
3-compatible one which returns values with type 'newstr'.

All of this means that in order to pass a single character argument to
the method, we had to distinguish between all of the calls where a
Python 3 compatible str was expected with this one call that requires an
argument with type 'str'. That distinction is being made by changing
the 'from builtins import str' to 'from builtins import str as newstr'.

Apparently, the Pandas DataFrame type is special in that it only
supports the keyword argument 'sep' for specifying the delimiter used in
a CSV file, and not the keyword argument 'delimiter'. (Note that the
builtin 'csv' package only supports the keyword argument 'delimiter' and
not 'sep', so we cannot normalize on a single keyword).

Furthermore, somewhere deep in the DataFrame code it checks if the type
of that argument is 'str', which will succeed in Python 3, but not in
Python 2, as we are using Python 3-compatible strings which have a type
of 'unicode', and were overriding the 'str' method with a Python
3-compatible one which returns values with type 'newstr'.

All of this means that in order to pass a single character argument to
the method, we had to distinguish between all of the calls where a
Python 3 compatible str was expected with this one call that requires an
argument with type 'str'. That distinction is being made by changing
the 'from builtins import str' to 'from builtins import str as newstr'.
@@ -176,4 +176,4 @@ def sample_to(self, count, skip_header_rows, strategy, target):
datalab.utils.gcs_copy_file(f.name, target)
else:
with open(target, 'w') as f:
df.to_csv(f, header=False, index=False, delimiter=self._delimiter)
df.to_csv(f, header=False, index=False, sep=str(','))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be sep=self._delimiter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. I changed things up when I was debugging why simply casting to str wasn't working (because an import was overriding str), and I forgot to change it back.

Thanks for catching that.

Copy link
Contributor

@qimingj qimingj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

@wafisher
Copy link

wafisher commented Oct 10, 2016

I'm running into this issue now. Do you mind describing (preferably in a more public forum like the README or the official Datalab docs) how changes like this make their way into the Docker container?

And assuming I don't want to wait for another Datalab version release, what's the best way to get the code from master into my locally-run container? The README says to run the install-* scripts but the Wiki says to run python setup.py install. Also, is there a version of Python that's required to run these?

@yebrahim
Copy link
Contributor

Take a look at the Dockerfile for datalab's base image here, the pydatalab repo is cloned (unless you point to a local copy of it, read below) everytime the datalab image is built, this is how these changes get there. I agree this needs to be cleared up in the documentation, perhaps in the install script.

If what you need is to get your own copy of this repo (pydatalab) into your locally-built datalab image then it's easier to do it while building the datalab image. To do this you can pass the path to your local pydatalab to the build script at containers/datalab/build.sh, for example:
./build.sh ../../pydatalab/ < note the slash in the end

The install instructions on this repo are meant for building Jupyter nbextensions, which can then be used from any Jupyter instance.

@wafisher
Copy link

wafisher commented Oct 10, 2016

Thank you. What version of Python should I be running Datalab and this extension with or does it not matter?

Also, what I'm going to try to do is install this locally into site packages and load it in the kernel — is that a bad idea? Or you recommend rebuilding the entire image in that case?

@parthea
Copy link
Contributor

parthea commented Oct 10, 2016

Also, what I'm going to try to do is install this locally into site packages and load it in the kernel — is that a bad idea?

I often use this approach because I find it faster than rebuilding the image, although it may not be supported officially. Keep in mind that this approach may not always work (things may break due to version conflicts). I use python 2.7.

Here are the steps:

  • Clone pydatalab using git clone https://github.com/googledatalab/pydatalab.git
  • cd to the folder and run either ./install-no-virtualenv.sh or install-virtualenv.sh depending on your environment. Make sure the install completes successfully without errors.
  • Replace the copy of pydatalab in the container with your local copy, by using : docker run -it -p "127.0.0.1:8081:8080" -v "${HOME}:/content" -e "PROJECT_ID=<your-project>" -v "<path-to-pydatalab-clone>/datalab/:/usr/local/lib/python2.7/dist-packages/datalab/" gcr.io/cloud-datalab/datalab:local

I hope this helps.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants