-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST: Create compressed salary testing data #14587
Conversation
Create compressed versions of the salary dataset for testing pandas-dev#14576. Rename `salary.table.csv` to `salaries.tsv` because the dataset is tab rather than comma delimited. Remove the word table because it's implied by the extension. Rename `salary.table.gz` to `salaries.tsv.gz`, since compressed files should append to not strip the original extension. Created new files by running the following commands: ```sh cd pandas/io/tests/parser/data bzip2 --keep salaries.tsv xz --keep salaries.tsv zip salaries.tsv.zip salaries.tsv ```
don't use tsv for extension, use csv |
Why is that? The table isn't comma separated. This leads to issues like GitHub failing to render the table: |
because you are changing something that is standard if you wish to do that make a new PR but to be honest it's pretty non standard |
@jreback hmm. Presently, the CSV standard is being violated as RFC 4180 specifies:
However,
So I'm happy to save this for another pull request, but I thought it would be most efficient to address it here... especially since:
Just let me know what's the best way forward. |
Tagging @jreback @sinhrks @jorisvandenbossche -- let me know what to do next here, so I can continue with #14576. |
For the CI failures, those are indeed not avoidable. But since it doesn't change code, only the file names, it's OK if you can confirm those specific tests pass locally. |
Current coverage is 85.27% (diff: 100%)@@ master #14587 diff @@
==========================================
Files 140 140
Lines 50693 50693
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 43230 43229 -1
- Misses 7463 7464 +1
Partials 0 0
|
Done in 24341b5.
@jorisvandenbossche, here're the errors I get when I run
It appears to me that all errors are from remote URLs not resolving... but how do I "confirm these specific tests pass locally". Should I locally change the URLs? |
Ah, yes, of course you have locally the same problem .. No problem, it's ok like this. |
@dhimmel Thanks! |
I made a small correction: ed21736 |
Yeah I should have more explicit that https://s3.amazonaws.com/pandas-test/salary.table.gz would have to be renamed on S3. So in the longer term, here's what I'm thinking. First, is the S3 test actually valuable in addition to the GitHub URL test?
So if the S3 tests are valuable, then I'd suggest uploading |
@dhimmel that sounds right (the S3 here is just an url (not an S3 test), so github url is just as fine) |
@jorisvandenbossche, based on how the reading from URL works, we actually will need to also test reading compressed files from S3 URLs. See bff0f3e So would it be possible to upload |
Yep I rebased. See c042391. The point is that a different implementation is used to read URLs that are S3. |
Yes, but I think this url is not seen as a 'S3 url', but just as a plain url (if you enter this url in a browser, you just get to download the file). A S3 url would more look like |
For example:
|
(cherry picked from commit 85a6464)
(cherry picked from commit ed21736)
Version 0.19.2 * tag 'v0.19.2': (78 commits) RLS: v0.19.2 DOC: update release notes for 0.19.2 TST: skip gbq upload test as flakey DOC: clean-up v0.19.2 whatsnew DOC: update Pandas Cheat Sheet (GH13202) DOC: Pandas Cheat Sheet TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957) flake8 fix import Remove test - from 0.20.0 PR slipped in PERF: fix getitem unique_check / initialization issue cache and remove boxing (pandas-dev#14931) CLN: Resubmit of GH14700. Fixes GH14554. Errors other than Indexing… Clean up construction of Series with dictionary and datetime index BUG: .fillna() for datetime64 with tz is passing thru floats BUG: Patch read_csv NA values behaviour ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936) [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886) Fixed KDE Plot to drop the missing values (pandas-dev#14820) ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531) TST: correct url for test file on s3 (xref pandas-dev#14587) ...
* releases: (78 commits) RLS: v0.19.2 DOC: update release notes for 0.19.2 TST: skip gbq upload test as flakey DOC: clean-up v0.19.2 whatsnew DOC: update Pandas Cheat Sheet (GH13202) DOC: Pandas Cheat Sheet TST: matplotlib 2.0 fix in log limits for barplot (GH14808) (pandas-dev#14957) flake8 fix import Remove test - from 0.20.0 PR slipped in PERF: fix getitem unique_check / initialization issue cache and remove boxing (pandas-dev#14931) CLN: Resubmit of GH14700. Fixes GH14554. Errors other than Indexing… Clean up construction of Series with dictionary and datetime index BUG: .fillna() for datetime64 with tz is passing thru floats BUG: Patch read_csv NA values behaviour ENH: merge_asof() has type specializations and can take multiple 'by' parameters (pandas-dev#13936) [Backport pandas-dev#14886] BUG: regression in DataFrame.combine_first with integer columns (GH14687) (pandas-dev#14886) Fixed KDE Plot to drop the missing values (pandas-dev#14820) ENH: merge_asof() has left_index/right_index and left_by/right_by (pandas-dev#14253) (pandas-dev#14531) TST: correct url for test file on s3 (xref pandas-dev#14587) ...
Create compressed versions of the salary dataset for testing #14576.
Rename
salary.table.csv
tosalaries.tsv
because the dataset is tab rather than comma delimited. Remove the word table because it's implied by the extension. Renamesalary.table.gz
tosalaries.tsv.gz
, since compressed files should append to not strip the original extension.Created new files by running the following commands:
cd pandas/io/tests/parser/data bzip2 --keep salaries.tsv xz --keep salaries.tsv zip salaries.tsv.zip salaries.tsv
There will be CI testing failures since the modified files are not yet in master or on amazon S3. @jorisvandenbossche how do we deal with this?