Skip to content

Conversation

@pwendell
Copy link
Contributor

@pwendell pwendell commented Sep 3, 2014

During regression tests of Spark 1.1 we discovered perf issues with
PVM instances when running PySpark. This reverts a change added in #1156
which changed the default type for m3 instances to PVM.

During regression tests of Spark 1.1 we discovered perf issues with
PVM instances when running PySpark. This reverts a change added in apache#1156
which changed the default type for m3 instances to PVM.
@JoshRosen
Copy link
Contributor

This looks good to me, especially since the m3.* instances used HVM AMIs in 1.0.2.

@shivaram
Copy link
Contributor

shivaram commented Sep 3, 2014

Ah interesting. One more thing is that m3 doesn't mount the SSDs by default (there was a recent spark_ec2.py change to fix this). The regression could have been due to using EBS instead of SSDs for shuffle ?

@JoshRosen
Copy link
Contributor

I observed a large performance difference on a microbenchmark that only called os.fork() in Python, plus the script in SPARK-3333 didn't move much data during the shuffle (since the RDD only contained 3 items total), so I think it's more likely that the performance difference is due to the virtualization technique than the disks. Also, the cross-version comparisons were run on the same m3 nodes, so they should have both been using the same disk setup.

@pwendell
Copy link
Contributor Author

pwendell commented Sep 3, 2014

@shivaram yeah we tested this including the SSD fix. We were able to narrow it down fairly closely to os.fork() issues, which others have documented have issues with certain instance types.

@pwendell
Copy link
Contributor Author

pwendell commented Sep 3, 2014

Okay guys I'm pulling this in for a new RC hopefully everyone is okay with it.

@asfgit asfgit closed this in c64cc43 Sep 3, 2014
asfgit pushed a commit that referenced this pull request Sep 3, 2014
During regression tests of Spark 1.1 we discovered perf issues with
PVM instances when running PySpark. This reverts a change added in #1156
which changed the default type for m3 instances to PVM.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #2244 from pwendell/ec2-hvm and squashes the following commits:

1342d7e [Patrick Wendell] SPARK-3358: [EC2] Switch back to HVM instances for m3.X.
@shivaram
Copy link
Contributor

shivaram commented Sep 3, 2014

Sounds good. Nice find on os.fork !

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
During regression tests of Spark 1.1 we discovered perf issues with
PVM instances when running PySpark. This reverts a change added in apache#1156
which changed the default type for m3 instances to PVM.

Author: Patrick Wendell <pwendell@gmail.com>

Closes apache#2244 from pwendell/ec2-hvm and squashes the following commits:

1342d7e [Patrick Wendell] SPARK-3358: [EC2] Switch back to HVM instances for m3.X.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants