-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing #4677
Conversation
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1, sorry.
Do not go near this unless you can show that the current `listFiles(path, recursive)' is inadequate. Which I do not believe it is.
If you can make the case that it doesn't change it then you have to look very closely at the Javadocs at the top of FileSystem and any recent changes to the API to see how they are managed. Vectored IO for example. also look at HADOOP-16898 and HADOOP-16898 to see their listing changes including my unhappiness about something going in without more publicity across the different teams.
Any change in that API is public facing and has to be maintained forever. It needs to be supported effectively in HDFS and in cloud storage. That means you're going to have to do a full api specification, write contract tests, implement those contact tests on in hadoop-aws and azure, and ideally anywhere else (google gcs). then make sure that you don't break the external libs named in the javadocs.
Assume that I will automatically veto any new list method returning an array. It hits scale problems on HDFS -lock duration, size of responses to marshall- and prevents us doing things in the object stores including prefetching, IOStatistics collection and supporting close(). Also using builder APIs and returning a CompletableFuture.
Look at the s3a and abfs listing code to see how implement listFiles, and the s3a and manifest I committed to see how they are effectively used. we kick off operations (treewalk, file loading) while waiting for next page of responses to come in, ideally swallowing the entire latency of each list call.
Note also that because listFiles only returns files, not directories, we can do O(files/page size) deep list calls against s3.
If the justification is that we need path filtering, see HADOOP-16673 Add filter parameter to FileSystem>>listFiles to see why that doesn't work in cloud and hence closed as WONTFIX.
I think a more manageable focus of this work would be to see how FileInputFormat could be speeded up by using the existing APIs, I am at with all work done knowing that many external libraries subclass that. For example, Parquet, Avro and ORC. Any incompatible change will stop them upgrading and we cannot do that.
Am I being very negative here? Yes I am. If you do want to change the Apis then you need to start talking about it on the HDFS and common lists, show that it delivers tangible benefit on-prem and in cloud, and undertake the extensive piece of work needed to implement in the primary cloud stores to show it is performant.
Finally, when you consider that the future of tables is one of manifest files (iceberg, hudi, delta lake), IMO it is better to focus on making workign with those formats faster. treewalk listing may be slow with hive partitioned data, but they are so pathologically bad in cloud for commit as well as query planning, that new code is moving beyond them
* The input filter that can be used to filter files/dirs. | ||
* @throws IOException | ||
*/ | ||
protected void addInputPathRecursively(List<FileStatus> result, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't remove this as it breaks methods external classes may use
🎊 +1 overall
This message was automatically generated. |
Description of PR
Optimize liststatus for better performance by using recursive listing.
JIRA - MAPREDUCE-7401
How was this patch tested?
Unit tests
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?