[SPARK-23292][TEST] always run python tests #20465

cloud-fan · 2018-02-01T04:52:39Z

What changes were proposed in this pull request?

We should not skip any tests, otherwise we can't trust the jenkins report.

This PR make the pandas related tests always be run, so that we can be confident about pandas related features in Spark.

How was this patch tested?

N/A

cloud-fan · 2018-02-01T04:53:03Z

cc @yhuai @icexelloss @BryanCutler @ueshin @shaneknapp

cloud-fan · 2018-02-01T04:54:08Z

python/pyspark/sql/tests.py

        self.assertEquals(types[5], 'datetime64[ns]')

-    @unittest.skipIf(not _have_old_pandas, "Old Pandas not installed")
-    def test_to_pandas_old(self):


I have no idea how to test it in jenkins, as jenkins should always have pandas and pyarrow installed with required versions. I think we can only test it manually.

jenkins does indeed have pandas and pyarrow installed, btw.

cloud-fan · 2018-02-01T04:54:19Z

python/pyspark/sql/tests.py

        self.assertTrue(isinstance(df.schema['d'].dataType, DateType))

-    @unittest.skipIf(not _have_old_pandas, "Old Pandas not installed")
-    def test_create_dataframe_from_old_pandas(self):


SparkQA · 2018-02-01T05:11:20Z

Test build #86910 has finished for PR 20465 at commit 8aba4f5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-02-01T05:50:30Z

hmm, I think there are some values in having a way to run python tests without Arrow? I mean the test.py is not just for Jenkins but for everyone consuming the Spark release... unless we are saying Arrow is required now?

And in Jenkins we shouldn't be skipping any of these tests anyway? Is there a reason we need to change that if Jenkins isn't affected (and if I recall there is a way to check if running under Jenkins - we could always make Arrow tests not skipped in Jenkins)

yhuai · 2018-02-01T05:54:00Z

@felixcheung jenkins is actually skipping those tests (see the failure of this pr). It makes sense to provide a way to allow developers to not run those tests. But, I'd prefer that we run those tests by default. So, we can make sure that jenkins is doing the right thing.

cloud-fan · 2018-02-01T06:03:26Z

I think there are some values in having a way to run python tests without Arrow?

I agree, but the more important thing is to make sure jenkins runs everything, so that we can be confident about our release.

felixcheung · 2018-02-01T06:05:05Z

sure, that's ok, I think we can revisit later (ie. next release) if we want to add an env switch or something to make them optional

felixcheung · 2018-02-01T06:06:12Z

ImportError: Pandas >= 0.19.2 must be installed on calling Python process; however, your version was 0.16.0.

I guess the RISELab boxes will need some updates...

BryanCutler · 2018-02-01T07:03:23Z

Looking back at when pyarrow was last upgraded in #19884, pandas and pyarrow were upgraded on all workers for python 3, but there were maybe some concerns or difficulties with upgrading for python 2 and pypy environments at that time. That is why the above failure is from python 2 with an older version of pandas.

HyukjinKwon · 2018-02-01T07:09:57Z

Yup, there was a related discussion already. See this #19884 (comment) and #19884 (comment). We shouldn't run this for now. Also these are technically not hard dependencies.

yhuai · 2018-02-01T07:17:14Z

So, jenkins jobs run those tests with python3? If so, I feel better because those tests are not completely skipped in Jenkins. If it is hard to make them run with python 2. Let’s have a log to explicitly show if we are going to run tests using pandas/pyarrow, which will help us confirm if they get exercised with python 3 in Jenkins or not.

BryanCutler · 2018-02-01T07:32:38Z

Yes, the tests are being run with python3. I do prefer to have these conditional skips removed because sometimes it is hard to tell if everything passed or was just skipped. But since pandas and pyarrow are optional dependencies, there should be some way for the user to skip with an environment variable or something. At the very least, being able to verify they were run in a log would be good.

HyukjinKwon · 2018-02-01T10:33:31Z

Yup, explicitly logging sounds fine for now so that we can easily check.

I do prefer to have these conditional skips removed because sometimes it is hard to tell if everything passed or was just skipped

To be clear, I think it's more because our own testing script doesn't show the skipped tests output from unittests in the console.

Also, I think it's more because we couldn't make sure Pandas and Arrow were installed properly in testing env, Jenkins but not because we skip tests related with extra dependencies when they are not installed. Making them as required dependencies is a big deal IMHO.

FYI, I tried to install PyArrow with PyPy last time and I failed. I wonder if we can easily install it.

HyukjinKwon · 2018-02-01T10:47:59Z

Also, if we should go in this way, I think we should enable some tests with PyPy too if I understood correctly and there isn't another problem I maybe missed:

spark/dev/sparktestsupport/modules.py

Line 457 in 9623a98

    
           "PyPy"  # Skip these tests under PyPy since they require numpy and it isn't available there

At least, PyPy in my local has numpy.

cloud-fan · 2018-02-01T10:56:11Z

I agree that pandas and pyarrow should not be a hard requirement for users, and this is what it is today: PySpark only throws exception when users try to use pandas related functions without pandas/pyarrow installed.

My proposal is, pandas and pyarrow should be a hard requirement for our jenkins, to make sure the features are well tested.

If there is a way to prove that py3 tests run well, and the environment issue is hard to fix, then maybe we can deal with it later, after 2.3.

HyukjinKwon · 2018-02-01T11:01:13Z

Thank you for bearing with me @cloud-fan. I agree with it.

BTW, are you working on the logging thing BTW? I was thinking the simplest way to check is just print out once if PyArrow / Pandas are installed or not.

HyukjinKwon · 2018-02-01T11:05:46Z

My proposal is, pandas and pyarrow should be a hard requirement for our jenkins, to make sure the features are well tested.

If this is a goal, I think another simple way is just to use an env set in Jenkins and throw an exception if both PyArrow or Pandas are not installed in the future.

To be more specific, like ..

spark/dev/run-tests-jenkins

Lines 28 to 32 in 9623a98

    
           PYTHON_VERSION_CHECK=$(python -c 'import sys; print(sys.version_info < (2, 7, 0))') 
        
           if [[ "$PYTHON_VERSION_CHECK" == "True" ]]; then 
        
             echo "Python versions prior to 2.7 are not supported." 
        
             exit -1 
        
           fi

cloud-fan · 2018-02-01T11:12:01Z

I've not worked on the logging stuff yet, feel free to take it, thanks!

HyukjinKwon · 2018-02-01T11:16:59Z

@cloud-fan, will try it. Thank you sincerely.

cloud-fan · 2018-02-08T11:18:38Z

The logging approach has been merged and I'm closing this one.

always run python tests

8aba4f5

cloud-fan commented Feb 1, 2018

View reviewed changes

cloud-fan closed this Feb 8, 2018

HyukjinKwon mentioned this pull request Apr 3, 2018

[SPARK-23776][python][test] Check for needed components/files before running pyspark-sql tests #20909

Closed

[SPARK-23292][TEST] always run python tests #20465

[SPARK-23292][TEST] always run python tests #20465

Uh oh!

Conversation

cloud-fan commented Feb 1, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Feb 1, 2018

Uh oh!

cloud-fan Feb 1, 2018

Choose a reason for hiding this comment

Uh oh!

shaneknapp Feb 1, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 1, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 1, 2018

Uh oh!

felixcheung commented Feb 1, 2018

Uh oh!

yhuai commented Feb 1, 2018

Uh oh!

cloud-fan commented Feb 1, 2018

Uh oh!

felixcheung commented Feb 1, 2018

Uh oh!

felixcheung commented Feb 1, 2018

Uh oh!

BryanCutler commented Feb 1, 2018

Uh oh!

HyukjinKwon commented Feb 1, 2018

Uh oh!

yhuai commented Feb 1, 2018 via email

Uh oh!

BryanCutler commented Feb 1, 2018

Uh oh!

HyukjinKwon commented Feb 1, 2018

Uh oh!

HyukjinKwon commented Feb 1, 2018

Uh oh!

cloud-fan commented Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Feb 1, 2018

Uh oh!

HyukjinKwon commented Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Feb 1, 2018

Uh oh!

cloud-fan commented Feb 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cloud-fan commented Feb 1, 2018 •

edited

Loading

HyukjinKwon commented Feb 1, 2018 •

edited

Loading

cloud-fan commented Feb 1, 2018 •

edited

Loading