Skip to content

Conversation

@mridulm
Copy link
Contributor

@mridulm mridulm commented May 6, 2022

What changes were proposed in this pull request?

Fix test failure in build.
Depending on the umask of the process running tests (which is typically inherited from the user's default umask), the group writable bit for the files/directories could be set or unset. The test was assuming that by default the umask will be restrictive (and so files/directories wont be group writable). Since this is not a valid assumption, we use jnr to change the umask of the process to be more restrictive - so that the test can validate the behavior change - and reset it back once the test is done.

Why are the changes needed?

Fix test failure in build

Does this PR introduce any user-facing change?

No

Adds jnr as a test scoped dependency, which does not bring in any other new dependency (asm is already a dep in spark).

[INFO] +- com.github.jnr:jnr-posix:jar:3.0.9:test
[INFO] |  +- com.github.jnr:jnr-ffi:jar:2.0.1:test
[INFO] |  |  +- com.github.jnr:jffi:jar:1.2.7:test
[INFO] |  |  +- com.github.jnr:jffi:jar:native:1.2.7:test
[INFO] |  |  +- org.ow2.asm:asm:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-commons:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-analysis:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-tree:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-util:jar:5.0.3:test
[INFO] |  |  \- com.github.jnr:jnr-x86asm:jar:1.0.2:test
[INFO] |  \- com.github.jnr:jnr-constants:jar:0.8.6:test

How was this patch tested?

Modification to existing test.
Tested on Linux, skips test when native posix env is not found.

@mridulm
Copy link
Contributor Author

mridulm commented May 6, 2022

Tested locally, will wait for GA to also complete.

@mridulm
Copy link
Contributor Author

mridulm commented May 6, 2022

+CC @srowen, @Kimahriman

@mridulm mridulm changed the title [SPARK-37618][CORE][Followup] Fix test failure [SPARK-37618][CORE][Followup] Support cleaning up shuffle blocks from external shuffle service May 6, 2022
@Kimahriman
Copy link
Contributor

Looks good to me, thanks! Figured posix permissions would do something weird on someone's machine eventually

// Use jnr to get and override the current process umask.
// Expects the input mask to be an octal number
private def getAndSetUmask(posix: POSIX, mask: String): String = {
val prev = posix.umask(BigInt(mask, 8).toInt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any existing utility for setting umask? I thought Hadoop APIs had this somewhere and that we use it. No big deal if not. But if we have other places we use umask, could be good to standardize

Copy link
Contributor Author

@mridulm mridulm May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont currently get/set process umask. I did see use of jna in hadoop, but that is shaded now.
Any library which allows us to set process umask would do actually, jnr simply seemed to be more easy to use across platforms (this is for tests anyway).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place I know of is in the native code for the container executor

@mridulm
Copy link
Contributor Author

mridulm commented May 6, 2022

+CC @MaxGekk

@mridulm
Copy link
Contributor Author

mridulm commented May 6, 2022

Any other comments @srowen ?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mridulm . Could you elaborate a little more about the condition where this happens in the PR description? It doesn't happen in Apache Spark GitHub Action and Apple Silicon Farm.

@mridulm
Copy link
Contributor Author

mridulm commented May 6, 2022

Sure, will do. Thx for the review @dongjoon-hyun, PTAL at the changed description.

@mridulm
Copy link
Contributor Author

mridulm commented May 6, 2022

My local machine seems to have gotten hosed, can you please merge it to master/branch-3.3 @dongjoon-hyun or @srowen ? Thx !

@dongjoon-hyun
Copy link
Member

While testing this PR for merging, I made a PR to you. Could you review that, @mridulm ?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 7, 2022

IIUC, that PR will minimize the diff for backporting.

   test("SPARK-37618: Sub dirs are group writable when removing from shuffle service enabled") {
     val conf = testConf.clone
     conf.set("spark.local.dir", rootDirs)
     conf.set("spark.shuffle.service.enabled", "true")
     conf.set("spark.shuffle.service.removeShuffle", "false")
+    val posix = POSIXFactory.getPOSIX
+    assume(posix.isNative, "Skipping test for SPARK-37618, native posix support not found")
+
+    val oldUmask = getAndSetUmask(posix, "077")
     val diskBlockManager = new DiskBlockManager(conf, deleteFilesOnStop = true, isDriver = false)
     val blockId = new TestBlockId("test")
     val newFile = diskBlockManager.getFile(blockId)
@@ -163,6 +175,7 @@ class DiskBlockManagerSuite extends SparkFunSuite with BeforeAndAfterEach with B
     assert(parentDir2.exists && parentDir2.isDirectory)
     val permission2 = Files.getPosixFilePermissions(parentDir2.toPath)
     assert(permission2.contains(PosixFilePermission.GROUP_WRITE))
+    getAndSetUmask(posix, oldUmask)
   }

@Kimahriman
Copy link
Contributor

Kimahriman commented May 7, 2022

Maybe dumb question... But could this have been used to make the actual logic a little simpler? If we could change the umask to be more permissive (allow group write) when creating blockmgr sub dirs, we wouldn't need to do quite as much permission changing after the fact. I didn't know there was a way to change it in Java land without forking a subprocess

@mridulm
Copy link
Contributor Author

mridulm commented May 7, 2022

@dongjoon-hyun I wanted to make sure that the process umask is restored in a try/finally.
With this change, if anything else throws an exception/assertion failure, we end up with a modified umask, right ?
(The assume part is a nifty trick, I like that - had not used it before, thx !)

@mridulm
Copy link
Contributor Author

mridulm commented May 7, 2022

@Kimahriman That is a possibility, but I did not want to make functional changes as part of RC voting :)
We will need further testing and validation to see how we can leverage jnr for solving the initial issue - something for 3.3.1/master ?

@Kimahriman
Copy link
Contributor

@Kimahriman That is a possibility, but I did not want to make functional changes as part of RC voting :) We will need further testing and validation to see how we can leverage jnr for solving the initial issue - something for 3.3.1/master ?

Yeah wasn't suggesting doing it now! Hah just wanted to bring it up. I can try to play around with it at some point

@mridulm
Copy link
Contributor Author

mridulm commented May 7, 2022

Any other comments @srowen, @dongjoon-hyun ? Thx

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM again. No other comments from my side, @mridulm .

@srowen srowen closed this in 3174071 May 8, 2022
srowen pushed a commit that referenced this pull request May 8, 2022
… external shuffle service

### What changes were proposed in this pull request?

Fix test failure in build.
Depending on the umask of the process running tests (which is typically inherited from the user's default umask), the group writable bit for the files/directories could be set or unset. The test was assuming that by default the umask will be restrictive (and so files/directories wont be group writable). Since this is not a valid assumption, we use jnr to change the umask of the process to be more restrictive - so that the test can validate the behavior change - and reset it back once the test is done.

### Why are the changes needed?

Fix test failure in build

### Does this PR introduce _any_ user-facing change?
No

Adds jnr as a test scoped dependency, which does not bring in any other new dependency (asm is already a dep in spark).
```
[INFO] +- com.github.jnr:jnr-posix:jar:3.0.9:test
[INFO] |  +- com.github.jnr:jnr-ffi:jar:2.0.1:test
[INFO] |  |  +- com.github.jnr:jffi:jar:1.2.7:test
[INFO] |  |  +- com.github.jnr:jffi:jar:native:1.2.7:test
[INFO] |  |  +- org.ow2.asm:asm:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-commons:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-analysis:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-tree:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-util:jar:5.0.3:test
[INFO] |  |  \- com.github.jnr:jnr-x86asm:jar:1.0.2:test
[INFO] |  \- com.github.jnr:jnr-constants:jar:0.8.6:test
```

### How was this patch tested?
Modification to existing test.
Tested on Linux, skips test when native posix env is not found.

Closes #36473 from mridulm/fix-SPARK-37618-test.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 3174071)
Signed-off-by: Sean Owen <srowen@gmail.com>
@srowen
Copy link
Member

srowen commented May 8, 2022

Merged to master/3.3

wangyum pushed a commit that referenced this pull request May 26, 2023
…g the shuffle service for released executors (#1041)

* [SPARK-37618][CORE] Remove shuffle blocks using the shuffle service for released executors

Add support for removing shuffle files on released executors via the external shuffle service. The shuffle service already supports removing shuffle service cached RDD blocks, so I reused this mechanism to remove shuffle blocks as well, so as not to require updating the shuffle service itself.

To support this change functioning in a secure Yarn environment, I updated permissions on some of the block manager folders and files. Specifically:
- Block manager sub directories have the group write posix permission added to them. This gives the shuffle service permission to delete files from within these folders.
- Shuffle files have the world readable posix permission added to them. This is because when the sub directories are marked group writable, they lose the setgid bit that gets set in a secure Yarn environment. Without this, the permissions on the files would be `rw-r-----`, and since the group running Yarn (and therefore the shuffle service), is no longer the group owner of the file, it does not have access to read the file. The sub directories still do not have world execute permissions, so there's no security issue opening up these files.

Both of these changes are done after creating a file so that umasks don't affect the resulting permissions.

External shuffle services are very useful for long running jobs and dynamic allocation. However, currently if an executor is removed (either through dynamic deallocation or through some error), the shuffle files created by that executor will live until the application finishes. This results in local disks slowly filling up over time, eventually causing problems for long running applications.

No.

New unit test. Not sure if there's a better way I could have tested for the files being deleted or any other tests I should add.

Closes #35085 from Kimahriman/shuffle-service-remove-shuffle-blocks.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>

* [SPARK-37618][CORE][FOLLOWUP] Support cleaning up shuffle blocks from external shuffle service

Fix test failure in build.
Depending on the umask of the process running tests (which is typically inherited from the user's default umask), the group writable bit for the files/directories could be set or unset. The test was assuming that by default the umask will be restrictive (and so files/directories wont be group writable). Since this is not a valid assumption, we use jnr to change the umask of the process to be more restrictive - so that the test can validate the behavior change - and reset it back once the test is done.

Fix test failure in build

No

Adds jnr as a test scoped dependency, which does not bring in any other new dependency (asm is already a dep in spark).
```
[INFO] +- com.github.jnr:jnr-posix:jar:3.0.9:test
[INFO] |  +- com.github.jnr:jnr-ffi:jar:2.0.1:test
[INFO] |  |  +- com.github.jnr:jffi:jar:1.2.7:test
[INFO] |  |  +- com.github.jnr:jffi:jar:native:1.2.7:test
[INFO] |  |  +- org.ow2.asm:asm:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-commons:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-analysis:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-tree:jar:5.0.3:test
[INFO] |  |  +- org.ow2.asm:asm-util:jar:5.0.3:test
[INFO] |  |  \- com.github.jnr:jnr-x86asm:jar:1.0.2:test
[INFO] |  \- com.github.jnr:jnr-constants:jar:0.8.6:test
```

Modification to existing test.
Tested on Linux, skips test when native posix env is not found.

Closes #36473 from mridulm/fix-SPARK-37618-test.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

* Fix ut failure

Co-authored-by: Adam Binford <adamq43@gmail.com>
Co-authored-by: Mridul Muralidharan <mridulatgmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants