Skip to content

Conversation

@clems4ever
Copy link

@clems4ever clems4ever commented Feb 11, 2019

This PR is made of 2 patches:

  1. Support for disk resource that is required to run Spark on Mesos at Criteo since we enabled the XFS isolator that requires that any container should have disk reserved.
  2. Support for network bandwidth, otherwise Mesos will allocate a default amount based on the number of CPU. The user should be able to customize in order for the executor not to be killed by Mesos if the limit is exceeded and packets are dropped too abruptly.

Fix 1. is a backport in Spark 2.3 from a PR in 3.0 (apache#23758)
Fix 2. is completely Criteo specific since it relies on a resource that we introduced ourselves.

Backport in Spark 2.3

Before this change, there was no way to allocate a given amount of
disk when using Mesos scheduler. It's good enough when using default isolation
options but not when enabling the XFS isolator with hard limit in order to
properly isolate all containers. In that case, the executor is killed by Mesos
during the download of the Spark executor archive.

Therefore, this change introduces a configuration flag, specific to Mesos, to
declare the amount of disk required by the executors and therefore prevent
Mesos from killing the container because the XFS hard limit has been exceeded.
@clems4ever clems4ever changed the title [SPARK-17454][MESOS] Use Mesos disk resources for executors. Add support for disk and network_bandwidth resources to run in Mesos at Criteo Feb 11, 2019
At criteo, containers are network isolated and if no network bandwidth
resource is allocated, Mesos will reserve a default amount. This
patch provide a way for user to specify the amount of network bandwidth
required by the executors and therefore used by the tasks.

This fix is Criteo specific as network_bandwidth is a custom resource
we introduced in our clusters and the code of the isolator is not
public (yet).
@clems4ever clems4ever force-pushed the mesos-disk-criteo-2.3 branch from 9bf9816 to 44981a7 Compare February 11, 2019 21:35
@clems4ever
Copy link
Author

I ran the Mesos related tests on my machine to confirm my fix is ok since the CI build is failing for other reasons.

@Willymontaz Willymontaz closed this Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants