-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for DataFrame.to_tsv() for reading tab delimited text #10327
Comments
In addition to being just |
The pandas API is already cluttered with an excess of rarely used convenience methods. I really don't think adding another one is a good idea. |
I agree with @shoyer here. All functionality is there to do this within I am closing this (we have too many open issues ..), but discussion can certainly continue if needed. |
+1. As practitioner, I would highly appreciate |
IMO convenience is worthy justification (people tend to write many text files, so However, my main motivation is a disdain for the CSV format. It pains me to see people still using CSV over TSV. Obviously excel/database support has a role to play. But a project like pandas should strive to make the best practices the easiest to implement. |
though this not a major issue for me currently, |
While I can understand the sentiment put forward by @shoyer, I agree with @dhimmel. It is my experience that TSV is much more of a standard format for data analysis than CSV. There are many use cases where the TSV format is a requirement, whereas I am not familiar with any for CSV format (there a couple examples of common usages here). TSV also has an advantage in that the raw text is easily readable, and avoids the issues with quoting as mentioned by @dhimmel. |
I am only slightly opposed to adding |
CSV and TSV are both well supported and widely used in data science. CSV is more of a legacy format, thus many backwards-focused projects default to CSV. However, I think forwards-focused projects should default to TSV, as it's better for data science. Since there is no default |
Please excuse my ignorance on the matter but apart from being easier to read as a human, if and only if the column headers have roughly the same characters as their corresponding data which is not always the case, what advantages does TSV provide over CSV? Honestly curious if there is a performance difference between the two, I use TSV right now but honestly only because the data files I am working with came in that format so I left them in the same format. |
@Starkiller4011 tabs are a more natural separator for columnar data. They require less quoting, since values rarely contain tabs but often contain commas.
I'd expect the performance difference is trivial. However, like most things in data science, the real type of performance that matters is programmer efficiency. And I think TSVs are nicer to work with than CSVs. |
Not everyone agrees that tab separation is superior to csv -- I don't, for example. As Python programmers, we know that whitespace isn't always preserved across different operations, like copying and pasting. Those of us who answer a lot of questions on SO, for example, regularly have to use If we want to add a |
I think it's worth taking a step back and recognizing that a function like |
For the record, I think both CSV and TSV and acceptable and good formats. They should both be supported. @dsm054 brings up some compelling advantages to non-whitespace delimiters. A bigger issue in my opinion is using the |
Just starting to use pandas dataframes coming from R + tidyverse/readr and first thing I was negatively impressed by is the lack of consistent read/write methods like: read_csv()/write_csv(): comma separated (CSV) files In 20 years doing data science in genomics I never encountered a csv file, most data exists in tsv (or white-space delimited) format. Having to specify sep and quoting argument using df.to_csv() to write a tsv (or white-space delimited) file is inconvenient to say the least. Having df.read_tsv() df.to_tsv() for tab-delimited files and df.read_table() df.to_table() for white-space delimited files would be very helpful for people coming to pandas from R. |
As of pandas 0.24,
On the plus side, removing |
I also strongly agree with @marcora's comment. In genomics, since most data exist in TSV format, there is an inconvenience of having to use a And here is a tip for those of you who can't wait for pandas to support that feature, you can use the following method if needed. Alternatively, you can simply make your own methods by doing this (monkey patching): from functools import partial, partialmethod
import pandas as pd
pd.read_tsv = partial(pd.read_csv, sep='\t')
pd.DataFrame.to_tsv = partialmethod(pd.DataFrame.to_csv, sep='\t', index=False) And you can call # Load a dataframe from TSV file
df = pd.read_tsv(...)
# Write a dataframe to TSV file
df.to_tsv(...) |
I propose a function, which can be called on a DataFrame, named to_tsv or to_table. The function is the equivalent of
to_csv()
with the argumentsep='\t'
. Whileto_tsv()
contains the functionality to write tsv files, I find it annoying to always have to specify an additional argument. I prefer tsv files to csv files because tabs more rarely occur and therefore decrease the need for escaping. I also find the plain-text rendering more readable. I worry that the lack of a dedicatedto_tsv()
function encourages the use of csv over tsv. Currentlyread_table()
defaults to tab separators, but there is no equivalent function for writing.The text was updated successfully, but these errors were encountered: