-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read_parquet is slower than expected with S3 #9619
Comments
Note that you should already be able to do this by passing import dask.dataframe as dd
import pyarrow as pa
import pyarrow.fs as pa_fs
path = "s3://ursa-labs-taxi-data/2009/01/data.parquet"
fs = pa_fs.S3FileSystem(anonymous=True)
ddf = dd.read_parquet(
path,
engine="pyarrow",
storage_options={"anon": True},
open_file_options={
"open_file_func": fs.open_input_file,
},
)
ddf.partitions[0].compute() Using |
If we are talking about IO latency problems, then obviously the chunksize and caching strategy in the fsspec filelike will be very important, as pyarrow treats it like a regular file with many small reads. This is why we created fsspec.parquet to preemptively fetch all the bytes you will be needing before handing them to arrow. |
Pyarrow actually doesn't do this anymore. They now expose a |
@rjzamora - worth testing! I wonder if they read the buffers concurrently. |
I'll be dumb for a moment. If there aren't any backend options specified, and if arrow is present, then should we switch to using Arrow by default for things like this? |
The issue is that Dask has adopted Although, I guess you are explicitly asking "If there aren't any backend options specified". So, I suppose that case may work. |
If storage_options was empty could we safely do this?
…On Thu, Nov 3, 2022 at 4:28 PM Richard (Rick) Zamora < ***@***.***> wrote:
If there aren't any backend options specified, and if arrow is present,
then should we switch to using Arrow by default for things like this?
The issue is that Dask has adopted fsspec as it's standard filesystem
interface, and the fsspec API is not always aligned with the pyarrow.fs
API. Therefore, the user would still need to pass in fsspec-based
storage_options to read_parquet, and those options may be slightly
different than the options needed to initialize an s3fs instance. This is
why the code I shared above requires the user to create the pyarrow
filesystem themselves.
—
Reply to this email directly, view it on GitHub
<#9619 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTAP7HDJIHWAEDS6B6DWGQU7FANCNFSM6AAAAAARWEBXU4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Yes. Sorry - I hit enter too early. We would still need to use fsspec for graph construction, but would be able to use pyarrow to open the file at data-reading time if Note that we could certainly add the option to use a pyarrow filesystem throught the |
A 50% speedup seems worth a non-trivial change to me? @rjzamora thanks for sharing the snippet to do this. We should benchmark this on Coiled as well to get some more data points. I wonder if there's some subset of commonly-used fsspec options that we could easily translate into arrow? |
It's worth turning on logging fsspec.utils.setup_logging(logger_name="s3fs") creating dataframe:
(4 head calls, 4 get calls, all the same range) and read
(fetched 266MB in 16 serial, but unordered calls in 30s, including parse time, which is near my available bandwidth) When loading with pyarrow's FS using the snippet above, I get
The time to download the whole file with s3fs in a single continuous call is 27.7s.
takes 29s with s3fs, with two concurrent reads (35s if actually parsing the data) Please run these on your machines closer to the data! Note that Rick's blog ( https://developer.nvidia.com/blog/optimizing-access-to-parquet-data-with-fsspec/ ) specifically measured s3fs with various caching versus arrow's FS versus fsspec.parquet. |
Totally agreed that folks shouldn't jump to conclusions. @martindurant have you had a chance to look at this video. I would be curious about your opinion on what is going on there. Also, if you have a chance to look at @gjoseph92 's pyspy profiles (I click "Left Heavy" and find that that helps with interpretability). |
If I had to guess (and I only have like 20% confidence here) it would be that while fsspec is good on its own and dask is good on its own there is some negative interaction when having both event loops doing their thing at once, and for some reason things get sticky. Just a total guess though. If that guess is anything near to correct though then I think we would want to evaluate things in the wild, using fsspec/arrow + dask + s3, ideally on cloud machines. If you'd like your very own cloud machines I'm happy to set you up with a Coiled account. It's a pretty smooth experience today. I'm also happy to pay if folks don't already have AWS credentials through work. |
NB CLI curl took 22s to fetch the file from the HTTP endpoint. |
I will also do my own measurements n AWS, but certainly not today. I can't say there's too much rush. I would prefer if we can improve s3fs rather than splitting the FS backends. |
Feel free to see if this makes any difference --- a/s3fs/core.py
+++ b/s3fs/core.py
@@ -2231,7 +2231,11 @@ def _fetch_range(fs, bucket, key, version_id, start, end, req_kw=None):
)
return b""
logger.debug("Fetch: %s/%s, %s-%s", bucket, key, start, end)
- resp = fs.call_s3(
+ return sync(fs.loop, _inner_fetch, fs, bucket, key, version_id, start, end, req_kw)
+
+
+async def _inner_fetch(fs, bucket, key, version_id, start, end, req_kw=None):
+ resp = await fs._call_s3(
"get_object",
Bucket=bucket,
Key=key,
@@ -2239,4 +2243,4 @@ def _fetch_range(fs, bucket, key, version_id, start, end, req_kw=None):
**version_id_kw(version_id),
**req_kw,
)
- return sync(fs.loop, resp["Body"].read)
+ return await resp["Body"].read() |
Can you motivate this a bit more? What are some of the downsides of using Arrow filesystems in specialized cases like above? |
To dask: some additional complexity and development, as well as possible confusion in users when multiple backends are used in conjunction. |
Sure, and I'm not currently planning to invest any development resources in improving arrow. However, if today Arrow's reading is better than s3fs's reading in a particular case, and if we know that we are in a situation where we can swap the two safely, then it seems to me like we should do so. To be clear, I'm not saying "let's rip out S3" I'm saying "In this restricted but common case where we know it's safe, maybe we can automatically swap things and get a speed boost" |
You might also be underestimating the importance of this speed boost. We've just discovered a way for Dask to go 2x faster on very common workloads. This is HUGE for a lot of users. |
(if indeed, this speedup exists (which currently our benchmarking shows that it does in the wild, but we should do more homework)) |
We built everything on |
On ec2:
So s3fs matches wget and curl for throughput. With dask in the loop, exactly the same
So all saturate at about 100MB/s, as expected for a .medium EC2 instance. Do I need something bigger? Why does pyarrow consistently fail? pyarrow 8.0.0 |
I haven't seen any failures on pyarrow-9.0.0 (using
My network is too noisy right now to run any useful experiments. However, it may be the case that the data-transfer time is indeed optimal (or close to it) for |
The env was created like
under AWS linux from fresh miniconda, nothing special. |
I'd expect plain I think the problem here is about the interaction—particularly regarding concurrency—between s3fs and dask (and pyarrow?), as Matt is saying. s3fs seems perfectly fast on its own, but when you have multiple Arrow threads calling into single-threaded, GIL-locked Python to do their reads, which then call As an example, please take a look at the py-spy profile I recorded of multi-threaded dask workers on a real cluster loading Parquet data: tls-10_0_0_177-42425.json If you look at the This is the convoy effect https://bugs.python.org/issue7946; you can read more here dask/distributed#6325. It means that async networking performance will be significantly degraded if you're doing stuff in other threads that hold the GIL. In Dask, we're using async networking to load parquet, and we have other worker threads doing stuff that's likely to hold the GIL (plenty of pandas operations hold the GIL at least some of the time). This is just an inherent limitation of Python right now, there's not much fsspec can do about it. I'm not saying this is the only problem. I also would imagine (unfounded) that arrow can be more memory-efficient and reduce the number of copies if it to manage its own IO without leaving the C++ world. I'd hope it could read from network directly into an arrow-managed buffer—ideally zero-copy Broadly, I just find it weird to see 4 Arrow threads all calling into Python to do their IO. I would expect, intuitively, that a performance and memory sensitive C++ library like Arrow can perform better if it gets to manage its own IO, also in C++. |
Yeah, just to be clear, what I'm interested in here isn't "the relative performance of s3fs vs pyarrow s3" it's the "relative performance of these libraries in the context of Dask". What I'm seeing here (thanks @gjoseph92 for the profiling) is that s3fs is 2x slower in the context of Dask in a specific but very common use case. Unless we see an easy development path towards making this much faster (and @gjoseph92 's comment leads me to believe that such a path will be hard to find), I think that we should probably move in the direction of swapping things out when we know that it is safe to do so. |
quick addendum: "*at least in the short term" |
Note that I don't actually know what code was run, the basis of this thread. I have been using Rick's snippet, which does not show the problem. The report shows ddf_q1 = ddf[["id1", "v1"]].astype({"id1": "string[pyarrow]"})
ddf_q1.groupby("id1", dropna=False, observed=True).agg({"v1": "sum"}).compute() (is it not possible to get arrow's parquet reader to produce arrow-string columns?!) |
@ncclementi can you give @martindurant a quick code snippet that he can run that reads the right parqet file (or any parquet file I suspect). @martindurant I think that this is being pulled from https://github.com/coiled/coiled-runtime/blob/main/tests/benchmarks/h2o/test_h2o_benchmarks.py It looks like this: dd.read_parquet(
request.param, engine="pyarrow", storage_options={"anon": True}
) I suspect with this parameter,
Yup. Current source of frustration. See #9631 |
Never mind, here is an entirely different computation with the same result: import coiled
from dask.distributed import Client
import dask.dataframe as dd
df = dd.read_parquet("s3://coiled-datasets/mrocklin/nyc-taxi-fhv",
storage_options={"anon": True})
cluster = coiled.Cluster(package_sync=True)
client = Client(cluster)
with coiled.performance_report():
df[["trip_miles", "trip_time", "base_passenger_fare"]].sum().compute() |
In contrast: import pyarrow.fs as pa_fs
fs = pa_fs.S3FileSystem(anonymous=True)
df = dd.read_parquet(
"s3://coiled-datasets/mrocklin/nyc-taxi-fhv",
storage_options={"anon": True},
open_file_options={
"open_file_func": fs.open_input_file,
},
)
with coiled.performance_report():
df[["trip_miles", "trip_time", "base_passenger_fare"]].sum().compute() |
Saw this a bit late, for reference things in the
So an example for the 5GB dataset would be, import coiled
from dask.distributed import Client
import dask.dataframe as dd
cluster = coiled.Cluster( n_workers=4, package_sync=True)
client = Client(cluster)
ddf = dd.read_parquet(
"s3://coiled-datasets/h2o-benchmark/N_1e8_K_1e2_parquet/*.parquet",
engine="pyarrow",
storage_options={"anon": True},
)
with performance_report():
ddf_q1 = ddf[["id1", "v1"]]
ddf_q1.groupby("id1", dropna=False, observed=True).agg({"v1": "sum"}).compute() here you can find multiple performance reports for all different queries. But q1 is the one, for this example, and I believe the same one that Gabe run in his py-spy report |
Did anyone try uvloop with s3fs? |
I haven't, no. |
I also haven't. But uvloop suffers from the convoy effect just like asyncio does so I'd be surprised to see much difference. A thing that could help is if operations on nonblocking sockets kept the GIL. Since they should be nonblocking, there should be no reason to release—and then have to immediately re-acquire—the GIL. Getting this change into CPython would be slow and maybe not possible: https://bugs.python.org/issue45819. The main appeal of uvloop is that getting such a change MagicStack/uvloop@124c2c3 merged is probably a lot more feasible. A bit of the motivation for the asyncio TCP comm dask/distributed#5450 was for this purpose—so we could then go fix the convoy effect in uvloop, and use the new comm to take advantage of it—but that effort fizzled out. (It probably is worth trying the asyncio comm though. Re-reading the description of that PR, it seems like there might be a few advantages over Tornado even with the convoy effect in play. But that's a tangent to this conversation—comms aren't being used by fsspec to read parquet from S3, that's only for worker<->worker data transfer.) |
Even if uvloop solved the problem I would still push for this change. Many people don't use uvloop and if we can give those people a 2x speedup for presumably no cost then we should. |
If we can offer a speedup for people using s3fs without arrow, or on backends not supported by arrow, we should do that too |
Yes. That would be great.
…On Thu, Nov 17, 2022, 7:20 PM Martin Durant ***@***.***> wrote:
If we can offer a speedup for people using s3fs without arrow, or on
backends not supported by arrow, we should do that too
—
Reply to this email directly, view it on GitHub
<#9619 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEFYSKIF7DML53KGG3WI3KWBANCNFSM6AAAAAARWEBXU4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
For fun, I grabbed the 5GB (2.8GB on disk) data set and
Note particularly the CPU time - pyarrow was using threads and still came in way behind. On Intel Core i5-9400F, 6-core, 16GB RAM and SSD storage. My GPU is not big or new enough to bother trying with that. |
This is a great find! Just wanted to mention something I'm running into when using this, not as a problem for anyone else to solve, but to highlight a potential situation to consider if this is implemented as the default behavior. It may well be I've hit an edge case, it's related to one of the other libraries I'm using, or I'm doing something wrong. I use a prefect-dask workflow that calls
import os
import dask.dataframe as dd
import numpy as np
import pandas as pd
import pyarrow.fs as pa_fs
from dask.distributed import LocalCluster
from prefect import flow, task
from prefect_dask import DaskTaskRunner
fs = pa_fs.S3FileSystem(
access_key=os.environ["AWS_ACCESS_KEY_ID"],
secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
)
@task
def my_task(part: int) -> pd.DataFrame:
df = dd.read_parquet(
f"s3://<MY_BUCKET>/<MY_FILE_PREFIX>/part_{part:04d}.parquet",
engine="pyarrow",
filters= < SOME_FILTERS >,
open_file_options={
"open_file_func": fs.open_input_file,
},
)
return df.compute()
@flow(
task_runner=DaskTaskRunner(
cluster_class=LocalCluster,
cluster_kwargs={"processes": True, "n_workers": 8},
),
)
def my_flow(partitions: list[int]):
my_task.map(np.array_split(partitions, 8))
if __name__ == "__main__":
my_flow(1000) When there's 1000 partitions, each process get through maybe half of them before inevitably encountering this error
Fortunately, I can simply remove the |
Thanks for the feedback @the-matt-morris. I'm curious if you run into the same problem if you use the new df = dd.read_parquet(
f"s3://<MY_BUCKET>/<MY_FILE_PREFIX>/part_{part:04d}.parquet",
engine="pyarrow",
filters= < SOME_FILTERS >,
filesystem=fs, # pass your pyarrow filesystem object here
) |
That solved my problem, no longer seeing the error in my use case. Thanks @jrbourbeau ! |
Awesome, thanks for trying that out @the-matt-morris! Glad to hear things are working. Is using the |
Hey @jrbourbeau , sorry for the delay. You know what, I'm seeing nearly identical performance between both methods. I suppose it might make sense to benchmark this against publicly available data. Looks like there was a coiled-datasets parquet file that was being used previously in this thread. I can do some trials on that with this change. |
It is my first experience with Dask. After hours of investigation, I found this issue, and it solves the problem for me. In contrast to the problem reported by @mrocklin, Dask became nearly impossible to use when I'm trying to load files from AWS S3. Also, I have concerns that ineffective S3 usage increases the requests rate which should lead to service cost increase. Therefore, I would say that it is a serious performance issue. Based on the difference in latency, the simple min/max calculation might take from 20 seconds up to 2 minutes for a pretty small parquet file, 500MB (snappy compressed) Dask version:
|
@dbalabka , thanks for your report. It would be very useful to know more details about your workflow and setup. How many threads versus processes are you using, for example? How big are your typical chunks in bytes? If you turn on the "s3fs" logger, you can also know what calls it is making. You suggest there are too many, so we need to know which could in theory be eliminated. Could excessive calls by themselves lead to your slowdown due to AWS' throttling? |
@martindurant, I apologise for the late response. I had an issue that was resolved, but it was not related to the problem we previously discussed. I believe the update for dask up until The loading of files from S3 is speedy enough. However, following your advice, I utilized s3fs debug logging to study the effect on performance. I hope these logs will be useful in your investigation. Logs of loading 500MB file from the S3:
|
We do see that there are some extraneous head_object calls here (cost up to 100ms each) that would be worth tracking down. I suspect that if there were an initial Also, it is interesting that the range reads are in reverse order - maybe this is pyarrow reading the parquet footer and finding 64KB isn't big enough (but it happens more than once). This is probably not something s3fs can do anything about. |
I was looking at a read_parquet profile with @th3ed @ncclementi and @gjoseph92
Looking at this performance report: https://raw.githubusercontent.com/coiled/h2o-benchmarks/main/performance-reports-pyarr_str-50GB/q1_50GB_pyarr.html
I see the following analysis (two minute video): https://www.loom.com/share/4c8ad1c5251a4e658c1c47ee2113f34a
We're spending only about 20-25% of our time reading from S3, and about 5% of our time converting data to Pandas. We're spending a lot of our time doing something else.
@gjoseph92 took a look at this with pyspy and generated reports like the following: tls-10_0_0_177-42425.json
I'm copying a note from him below:
It looks like we could make things a lot faster. I'm curious about the right steps to isolate the problem further.
cc'ing @martindurant @rjzamora @ritchie46 @fjetter
The text was updated successfully, but these errors were encountered: