MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing #4677

ashutoshcipher · 2022-08-02T21:26:43Z

Description of PR

Optimize liststatus for better performance by using recursive listing.

JIRA - MAPREDUCE-7401

How was this patch tested?

Unit tests

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

…ecursive listing

hadoop-yetus · 2022-08-03T02:09:13Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	23m 37s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 5 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	14m 57s		Maven dependency ordering for branch
+1 💚	mvninstall	29m 5s		trunk passed
+1 💚	compile	25m 34s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	22m 7s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	4m 28s		trunk passed
+1 💚	mvnsite	3m 18s		trunk passed
+1 💚	javadoc	2m 30s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	2m 0s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	4m 58s		trunk passed
+1 💚	shadedclient	25m 30s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 55s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 47s		the patch passed
+1 💚	compile	24m 35s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	24m 35s		the patch passed
+1 💚	compile	21m 57s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	21m 57s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	4m 28s	/results-checkstyle-root.txt	root: The patch generated 12 new + 438 unchanged - 5 fixed = 450 total (was 443)
+1 💚	mvnsite	3m 17s		the patch passed
+1 💚	javadoc	2m 21s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	2m 1s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	5m 9s		the patch passed
+1 💚	shadedclient	25m 28s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	18m 22s	/patch-unit-hadoop-common-project_hadoop-common.txt	hadoop-common in the patch passed.
+1 💚	unit	7m 31s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	asflicense	1m 17s		The patch does not generate ASF License warnings.
		281m 8s

Reason	Tests
Failed junit tests	hadoop.fs.TestFilterFileSystem

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/1/artifact/out/Dockerfile
GITHUB PR	#4677
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux bdb77f7fe201 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `e8cc761`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/1/testReport/
Max. process+thread count	3144 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran

-1, sorry.

Do not go near this unless you can show that the current `listFiles(path, recursive)' is inadequate. Which I do not believe it is.

If you can make the case that it doesn't change it then you have to look very closely at the Javadocs at the top of FileSystem and any recent changes to the API to see how they are managed. Vectored IO for example. also look at HADOOP-16898 and HADOOP-16898 to see their listing changes including my unhappiness about something going in without more publicity across the different teams.

Any change in that API is public facing and has to be maintained forever. It needs to be supported effectively in HDFS and in cloud storage. That means you're going to have to do a full api specification, write contract tests, implement those contact tests on in hadoop-aws and azure, and ideally anywhere else (google gcs). then make sure that you don't break the external libs named in the javadocs.

Assume that I will automatically veto any new list method returning an array. It hits scale problems on HDFS -lock duration, size of responses to marshall- and prevents us doing things in the object stores including prefetching, IOStatistics collection and supporting close(). Also using builder APIs and returning a CompletableFuture.

Look at the s3a and abfs listing code to see how implement listFiles, and the s3a and manifest I committed to see how they are effectively used. we kick off operations (treewalk, file loading) while waiting for next page of responses to come in, ideally swallowing the entire latency of each list call.

Note also that because listFiles only returns files, not directories, we can do O(files/page size) deep list calls against s3.

If the justification is that we need path filtering, see HADOOP-16673 Add filter parameter to FileSystem>>listFiles to see why that doesn't work in cloud and hence closed as WONTFIX.

I think a more manageable focus of this work would be to see how FileInputFormat could be speeded up by using the existing APIs, I am at with all work done knowing that many external libraries subclass that. For example, Parquet, Avro and ORC. Any incompatible change will stop them upgrading and we cannot do that.

Am I being very negative here? Yes I am. If you do want to change the Apis then you need to start talking about it on the HDFS and common lists, show that it delivers tangible benefit on-prem and in cloud, and undertake the extensive piece of work needed to implement in the primary cloud stores to show it is performant.

Finally, when you consider that the future of tables is one of manifest files (iceberg, hudi, delta lake), IMO it is better to focus on making workign with those formats faster. treewalk listing may be slow with hive partitioned data, but they are so pathologically bad in cloud for commit as well as query planning, that new code is moving beyond them

steveloughran · 2022-08-03T10:06:09Z

...ent/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java

-   *          The input filter that can be used to filter files/dirs. 
-   * @throws IOException
-   */
-  protected void addInputPathRecursively(List<FileStatus> result,


you can't remove this as it breaks methods external classes may use

hadoop-yetus · 2022-08-03T14:17:47Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 40s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 5 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	15m 22s		Maven dependency ordering for branch
+1 💚	mvninstall	28m 23s		trunk passed
+1 💚	compile	23m 29s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	compile	20m 44s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	4m 28s		trunk passed
+1 💚	mvnsite	3m 44s		trunk passed
+1 💚	javadoc	2m 53s		trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	2m 32s		trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	5m 17s		trunk passed
+1 💚	shadedclient	23m 45s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 31s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 44s		the patch passed
+1 💚	compile	24m 36s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javac	24m 36s		the patch passed
+1 💚	compile	22m 51s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	javac	22m 51s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	4m 13s		root: The patch generated 0 new + 437 unchanged - 6 fixed = 437 total (was 443)
+1 💚	mvnsite	3m 33s		the patch passed
+1 💚	javadoc	2m 18s		the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚	javadoc	2m 13s		the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	5m 19s		the patch passed
+1 💚	shadedclient	23m 13s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	19m 11s		hadoop-common in the patch passed.
+1 💚	unit	7m 36s		hadoop-mapreduce-client-core in the patch passed.
+1 💚	asflicense	1m 21s		The patch does not generate ASF License warnings.
		253m 52s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/2/artifact/out/Dockerfile
GITHUB PR	#4677
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux a2681ddc1b47 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `4843c21`
Default Java	Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/2/testReport/
Max. process+thread count	3103 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/2/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

MAPREDUCE-7401. Optimize liststatus for better performance by using r…

e8cc761

…ecursive listing

ashutoshcipher changed the title ~~MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing~~ [Draft] MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing Aug 2, 2022

Fixing style check

4843c21

steveloughran requested changes Aug 3, 2022

View reviewed changes

ashutoshcipher marked this pull request as ready for review August 3, 2022 21:01

ashutoshcipher changed the title ~~[Draft] MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing~~ MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing Aug 3, 2022

steveloughran closed this Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing #4677

MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing #4677

ashutoshcipher commented Aug 2, 2022 •

edited

Loading

hadoop-yetus commented Aug 3, 2022

steveloughran left a comment •

edited

Loading

steveloughran Aug 3, 2022

hadoop-yetus commented Aug 3, 2022

MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing #4677

MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing #4677

Conversation

ashutoshcipher commented Aug 2, 2022 • edited Loading

Description of PR

How was this patch tested?

For code changes:

hadoop-yetus commented Aug 3, 2022

steveloughran left a comment • edited Loading

Choose a reason for hiding this comment

steveloughran Aug 3, 2022

Choose a reason for hiding this comment

hadoop-yetus commented Aug 3, 2022

ashutoshcipher commented Aug 2, 2022 •

edited

Loading

steveloughran left a comment •

edited

Loading