-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove mandatory dependency on pandas #3133
Conversation
The repository requires Python>=3.9, so this is obsolete.
Pandas is significantly faster for reading large CSV files. I'd prefer to keep it. |
Is it? For a ~2GB CSV file generated with: import pandas as pd
N = 10000000
df = pd.DataFrame()
df["filename"] = pd._testing.rands_array(100, N)
df["transcript"] = pd._testing.rands_array(100, N)
df["speaker"] = pd._testing.rands_array(10, N)
df["emotion"] = pd._testing.rands_array(10, N)
df.to_csv("random.csv", sep=",") Loading with: import pandas as pd
import time
import csv
input_file = "random.csv"
start_time = time.time()
data = csv.reader(open(input_file))
for row in data:
pass
print("csv.reader took %s seconds" % (time.time() - start_time))
start_time = time.time()
data = csv.DictReader(open(input_file))
for row in data:
pass
print("csv.DictReader took %s seconds" % (time.time() - start_time))
start_time = time.time()
data = pd.read_csv(input_file)
for row in data:
pass
print("pd.read_csv took %s seconds" % (time.time() - start_time)) I get:
In any case, this is done only once before training, which takes much longer, so I don't think it's significant either way. |
Yeah, Pandas is definitely not faster if you're just using it to read or write CSV – especially the Not pulling in Pandas as a dependency is a win for CI times and saving-the-planet too. In addition, using |
This is on my side csv.reader took 14.11574387550354 seconds I am against changing anything that already works fine without any practical reason. Any unnoticed problem at loading affects the entire training result, and I am weary of that. I'd rather struggle with deps than take that risk. |
I guess one solution to the issue is separating deps for inference and training. Given more people are interested in using the models, they would not go into the trouble of dealing with extra deps. |
on my machine (repeatedly).
How do we know that code works fine? If there's a test for loading those datasets, it should catch any problems, right? |
I can add some tests as well if that helps (and might factor it out into
That would be very useful and I can work on this. How about also making the other language-specific deps optional like already done for Japanese? |
@eginhard if someone is willing to do that, I am totally in but it's a hell that I can't budge right now:) about testing, I rather keep pandas. So the problem with pandas is dep conflicts, right? If so how about fixing that? Would that work for you? |
I could also take a shot at that (been trying to do that before...), but would such a PR get accepted? I mean, I'm still kinda waiting for @erogol to take a look at #3004 too, which is less intrusive than that 😉
Generally speaking, the problem is downloading and installing in a whole host of dependencies when they're not needed, contributing to e.g. slower CI times (which, I understand, is a problem in this repo) and more disk space consumed. (Anecdotally, in a Python 3.11 Docker container, installing For any extra dependency, there's of course added surface for supply chain attacks. That's likely not an issue for
Why's that? I'm not sure I'm following. |
In addition to what @akx said, for us it would be to use TTS as a library without having to pull in so many unnecessary deps. A default version of the package that only includes the inference deps would work for that as well if you really want to keep pandas. I'll probably look into better logging (#1691) first though. |
@eginhard FWIW, if you're going to work on fixing the logging,
(I did just that in Stability-AI/generative-models#77) |
pandas
was only used in a few places to read or write CSV files, which can just as well be done with the standard library. I've changed that accordingly and movedpandas
to the optional notebooks requirements.I also removed the separate
numba
pin for Python < 3.9 because only Python >= 3.9 is supported now.