TST: Create compressed salary testing data #14587

dhimmel · 2016-11-04T15:43:04Z

Create compressed versions of the salary dataset for testing #14576.

Rename salary.table.csv to salaries.tsv because the dataset is tab rather than comma delimited. Remove the word table because it's implied by the extension. Rename salary.table.gz to salaries.tsv.gz, since compressed files should append to not strip the original extension.

Created new files by running the following commands:

cd pandas/io/tests/parser/data
bzip2 --keep salaries.tsv
xz --keep salaries.tsv
zip salaries.tsv.zip salaries.tsv

There will be CI testing failures since the modified files are not yet in master or on amazon S3. @jorisvandenbossche how do we deal with this?

Create compressed versions of the salary dataset for testing pandas-dev#14576. Rename `salary.table.csv` to `salaries.tsv` because the dataset is tab rather than comma delimited. Remove the word table because it's implied by the extension. Rename `salary.table.gz` to `salaries.tsv.gz`, since compressed files should append to not strip the original extension. Created new files by running the following commands: ```sh cd pandas/io/tests/parser/data bzip2 --keep salaries.tsv xz --keep salaries.tsv zip salaries.tsv.zip salaries.tsv ```

jreback · 2016-11-04T15:46:38Z

don't use tsv for extension, use csv

dhimmel · 2016-11-04T15:54:24Z

don't use tsv for extension, use csv

Why is that? The table isn't comma separated. This leads to issues like GitHub failing to render the table:

jreback · 2016-11-04T16:08:10Z

because you are changing something that is standard

if you wish to do that make a new PR but to be honest it's pretty non standard

dhimmel · 2016-11-04T17:30:12Z

because you are changing something that is standard

@jreback hmm. Presently, the CSV standard is being violated as RFC 4180 specifies:

Within the header and each record, there may be one or more fields, separated by commas.

However, salary.table.csv does meet the TSV standard:

Fields in a record are separated from each other by a tab character.

So I'm happy to save this for another pull request, but I thought it would be most efficient to address it here... especially since:

we're already standardizing the file names in other ways -- like removing .table and adding the extension to proceed .gz.
changing the name of testing files creates temporary testing failures... so might as well do two birds with one stone?

Just let me know what's the best way forward.

dhimmel · 2016-11-08T17:23:45Z

Tagging @jreback @sinhrks @jorisvandenbossche -- let me know what to do next here, so I can continue with #14576.

jorisvandenbossche · 2016-11-08T20:30:26Z

@dhimmel

let's keep the files as .csv for now. Although those files meet the tsv standard, this standard is certainly not followed in all fields, and also in the pandas codebase we use the .csv file extension for all text files at the moment, no matter the delimiter that is used.
standardizing the file names in the other ways is certainly fine! (the compression suffixes and leaving out the 'table')

For the CI failures, those are indeed not avoidable. But since it doesn't change code, only the file names, it's OK if you can confirm those specific tests pass locally.

See pandas-dev#14587 (comment)

codecov-io · 2016-11-09T15:42:06Z

Current coverage is 85.27% (diff: 100%)

Merging #14587 into master will decrease coverage by <.01%

@@             master     #14587   diff @@
==========================================
  Files           140        140          
  Lines         50693      50693          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43230      43229     -1   
- Misses         7463       7464     +1   
  Partials          0          0

Powered by Codecov. Last update 7a2bcb6...759d48d

dhimmel · 2016-11-09T15:49:24Z

let's keep the files as .csv for now.

Done in 24341b5.

For the CI failures, those are indeed not avoidable. But since it doesn't change code, only the file names, it's OK if you can confirm those specific tests pass locally.

@jorisvandenbossche, here're the errors I get when I run nosetests pandas locally in the development environment (full log):

ERROR: test_url_gz (pandas.io.tests.parser.test_network.TestUrlGz)
ERROR: test_url_gz_infer (pandas.io.tests.parser.test_network.TestUrlGz)
ERROR: test_url (pandas.io.tests.parser.test_parsers.TestCParserHighMemory)
ERROR: test_url (pandas.io.tests.parser.test_parsers.TestCParserLowMemory)
ERROR: test_url (pandas.io.tests.parser.test_parsers.TestPythonParser)

It appears to me that all errors are from remote URLs not resolving... but how do I "confirm these specific tests pass locally". Should I locally change the URLs?

jorisvandenbossche · 2016-11-09T15:59:44Z

It appears to me that all errors are from remote URLs not resolving... but how do I "confirm these specific tests pass locally". Should I locally change the URLs?

Ah, yes, of course you have locally the same problem .. No problem, it's ok like this.

jorisvandenbossche · 2016-11-11T08:53:58Z

@dhimmel Thanks!

jorisvandenbossche · 2016-11-11T14:00:28Z

I made a small correction: ed21736

dhimmel · 2016-11-11T15:07:16Z

I made a small correction: ed21736

Yeah I should have more explicit that https://s3.amazonaws.com/pandas-test/salary.table.gz would have to be renamed on S3.

So in the longer term, here's what I'm thinking. First, is the S3 test actually valuable in addition to the GitHub URL test?

If yes, then we should add the four compressed variants of salaries.csv to S3 using the new filename.
If no, then we should only test the GitHub URLs, thereby removing the S3 (non-automated) dependency. As of 272669b, Refactor compression code to expand URL support #14576 removes the S3 test.

So if the S3 tests are valuable, then I'd suggest uploading salaries.csv* to S3 and I'll add those tests to #14576 accordingly.

jorisvandenbossche · 2016-11-17T23:57:36Z

@dhimmel that sounds right (the S3 here is just an url (not an S3 test), so github url is just as fine)

dhimmel · 2016-12-04T00:04:51Z

@dhimmel that sounds right (the S3 here is just an url (not an S3 test), so github url is just as fine)

@jorisvandenbossche, based on how the reading from URL works, we actually will need to also test reading compressed files from S3 URLs. See bff0f3e

So would it be possible to upload salaries.csv, salaries.csv.bz2, salaries.csv.gz, salaries.csv.xz, and salaries.csv.zip to the S3?

jorisvandenbossche · 2016-12-04T18:10:19Z

@jreback Do you know who has access to the amazon data?

@dhimmel Are you that it is not just an url in this case? (the link you gave does not lead to visible code ("We went looking everywhere, but couldn’t find those commits"), probably because you had to force push after a rebase or so)

dhimmel · 2016-12-04T18:23:29Z

probably because you had to force push after a rebase or so

Yep I rebased. See c042391. The point is that a different implementation is used to read URLs that are S3.

jorisvandenbossche · 2016-12-04T18:32:17Z

Yes, but I think this url is not seen as a 'S3 url', but just as a plain url (if you enter this url in a browser, you just get to download the file). A S3 url would more look like "s3://pandas-test/somefile.csv". What you link to is in the s3.py file, not in the main parsing file.

jorisvandenbossche · 2016-12-04T18:33:13Z

For example:

In [14]: pd.io.common._is_s3_url('https://s3.amazonaws.com/pandas-test/salary.table.gz')
Out[14]: False

In [15]: pd.io.common._is_s3_url('s3://pandas-test/salary.table.gz')
Out[15]: True

(cherry picked from commit 85a6464)

(cherry picked from commit ed21736)

Version 0.19.2 * tag 'v0.19.2': (78 commits) RLS: v0.19.2 DOC: update release notes for 0.19.2 TST: skip gbq upload test as flakey DOC: clean-up v0.19.2 whatsnew DOC: update Pandas Cheat Sheet (GH13202) DOC: Pandas Cheat Sheet TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957) flake8 fix import Remove test - from 0.20.0 PR slipped in PERF: fix getitem unique_check / initialization issue cache and remove boxing (pandas-dev#14931) CLN: Resubmit of GH14700. Fixes GH14554. Errors other than Indexing… Clean up construction of Series with dictionary and datetime index BUG: .fillna() for datetime64 with tz is passing thru floats BUG: Patch read_csv NA values behaviour ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936) [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886) Fixed KDE Plot to drop the missing values (pandas-dev#14820) ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531) TST: correct url for test file on s3 (xref pandas-dev#14587) ...

* releases: (78 commits) RLS: v0.19.2 DOC: update release notes for 0.19.2 TST: skip gbq upload test as flakey DOC: clean-up v0.19.2 whatsnew DOC: update Pandas Cheat Sheet (GH13202) DOC: Pandas Cheat Sheet TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957) flake8 fix import Remove test - from 0.20.0 PR slipped in PERF: fix getitem unique_check / initialization issue cache and remove boxing (pandas-dev#14931) CLN: Resubmit of GH14700. Fixes GH14554. Errors other than Indexing… Clean up construction of Series with dictionary and datetime index BUG: .fillna() for datetime64 with tz is passing thru floats BUG: Patch read_csv NA values behaviour ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936) [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886) Fixed KDE Plot to drop the missing values (pandas-dev#14820) ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531) TST: correct url for test file on s3 (xref pandas-dev#14587) ...

sinhrks added Testing pandas testing functions or related to the test suite IO CSV read_csv, to_csv labels Nov 7, 2016

jorisvandenbossche added this to the 0.20.0 milestone Nov 8, 2016

Revert to CSV from TSV extension

24341b5

See pandas-dev#14587 (comment)

jorisvandenbossche changed the title ~~Create compressed salary testing data~~ TST: Create compressed salary testing data Nov 11, 2016

jorisvandenbossche merged commit 85a6464 into pandas-dev:master Nov 11, 2016

jorisvandenbossche added a commit that referenced this pull request Nov 11, 2016

TST: correct url for test file on s3 (xref #14587)

ed21736

dhimmel added a commit to dhimmel/pandas that referenced this pull request Nov 11, 2016

Update salaries.csv URL post pandas-dev#14587

d78cce1

dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 4, 2016

Update salaries.csv URL post pandas-dev#14587

cd012e5

dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 12, 2016

Update salaries.csv URL post pandas-dev#14587

6230dfb

dhimmel mentioned this pull request Dec 13, 2016

Create tests for reading compressed tables from S3 URLs #14875

Closed

dhimmel mentioned this pull request Dec 13, 2016

Refactor compression code to expand URL support #14576

Closed

jorisvandenbossche pushed a commit that referenced this pull request Dec 15, 2016

TST: Create compressed salary testing data (#14587)

42bc79c

(cherry picked from commit 85a6464)

jorisvandenbossche added a commit that referenced this pull request Dec 15, 2016

TST: correct url for test file on s3 (xref #14587)

f1d43a4

(cherry picked from commit ed21736)

dhimmel mentioned this pull request Jun 5, 2017

Request for DataFrame.to_tsv() for reading tab delimited text #10327

Closed

Uh oh!

TST: Create compressed salary testing data #14587

TST: Create compressed salary testing data #14587

Uh oh!

Conversation

dhimmel commented Nov 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Nov 4, 2016

Uh oh!

dhimmel commented Nov 4, 2016

Uh oh!

jreback commented Nov 4, 2016

Uh oh!

dhimmel commented Nov 4, 2016

Uh oh!

dhimmel commented Nov 8, 2016

Uh oh!

jorisvandenbossche commented Nov 8, 2016

Uh oh!

codecov-io commented Nov 9, 2016

Current coverage is 85.27% (diff: 100%)

Uh oh!

dhimmel commented Nov 9, 2016

Uh oh!

jorisvandenbossche commented Nov 9, 2016

Uh oh!

jorisvandenbossche commented Nov 11, 2016

Uh oh!

jorisvandenbossche commented Nov 11, 2016

Uh oh!

dhimmel commented Nov 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Nov 17, 2016

Uh oh!

dhimmel commented Dec 4, 2016

Uh oh!

jorisvandenbossche commented Dec 4, 2016

Uh oh!

dhimmel commented Dec 4, 2016

Uh oh!

jorisvandenbossche commented Dec 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Dec 4, 2016

Uh oh!

Uh oh!

dhimmel commented Nov 4, 2016 •

edited

Loading

dhimmel commented Nov 11, 2016 •

edited

Loading

jorisvandenbossche commented Dec 4, 2016 •

edited

Loading