Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

We should not skip any tests, otherwise we can't trust the jenkins report.

This PR make the pandas related tests always be run, so that we can be confident about pandas related features in Spark.

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

self.assertEquals(types[5], 'datetime64[ns]')

@unittest.skipIf(not _have_old_pandas, "Old Pandas not installed")
def test_to_pandas_old(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea how to test it in jenkins, as jenkins should always have pandas and pyarrow installed with required versions. I think we can only test it manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jenkins does indeed have pandas and pyarrow installed, btw.

self.assertTrue(isinstance(df.schema['d'].dataType, DateType))

@unittest.skipIf(not _have_old_pandas, "Old Pandas not installed")
def test_create_dataframe_from_old_pandas(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86910 has finished for PR 20465 at commit 8aba4f5.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

hmm, I think there are some values in having a way to run python tests without Arrow? I mean the test.py is not just for Jenkins but for everyone consuming the Spark release... unless we are saying Arrow is required now?

And in Jenkins we shouldn't be skipping any of these tests anyway? Is there a reason we need to change that if Jenkins isn't affected (and if I recall there is a way to check if running under Jenkins - we could always make Arrow tests not skipped in Jenkins)

@yhuai
Copy link
Contributor

yhuai commented Feb 1, 2018

@felixcheung jenkins is actually skipping those tests (see the failure of this pr). It makes sense to provide a way to allow developers to not run those tests. But, I'd prefer that we run those tests by default. So, we can make sure that jenkins is doing the right thing.

@cloud-fan
Copy link
Contributor Author

I think there are some values in having a way to run python tests without Arrow?

I agree, but the more important thing is to make sure jenkins runs everything, so that we can be confident about our release.

@felixcheung
Copy link
Member

sure, that's ok, I think we can revisit later (ie. next release) if we want to add an env switch or something to make them optional

@felixcheung
Copy link
Member

ImportError: Pandas >= 0.19.2 must be installed on calling Python process; however, your version was 0.16.0.

I guess the RISELab boxes will need some updates...

@BryanCutler
Copy link
Member

Looking back at when pyarrow was last upgraded in #19884, pandas and pyarrow were upgraded on all workers for python 3, but there were maybe some concerns or difficulties with upgrading for python 2 and pypy environments at that time. That is why the above failure is from python 2 with an older version of pandas.

@HyukjinKwon
Copy link
Member

Yup, there was a related discussion already. See this #19884 (comment) and #19884 (comment). We shouldn't run this for now. Also these are technically not hard dependencies.

@yhuai
Copy link
Contributor

yhuai commented Feb 1, 2018 via email

@BryanCutler
Copy link
Member

Yes, the tests are being run with python3. I do prefer to have these conditional skips removed because sometimes it is hard to tell if everything passed or was just skipped. But since pandas and pyarrow are optional dependencies, there should be some way for the user to skip with an environment variable or something. At the very least, being able to verify they were run in a log would be good.

@HyukjinKwon
Copy link
Member

Yup, explicitly logging sounds fine for now so that we can easily check.

I do prefer to have these conditional skips removed because sometimes it is hard to tell if everything passed or was just skipped

To be clear, I think it's more because our own testing script doesn't show the skipped tests output from unittests in the console.

Also, I think it's more because we couldn't make sure Pandas and Arrow were installed properly in testing env, Jenkins but not because we skip tests related with extra dependencies when they are not installed. Making them as required dependencies is a big deal IMHO.

FYI, I tried to install PyArrow with PyPy last time and I failed. I wonder if we can easily install it.

@HyukjinKwon
Copy link
Member

Also, if we should go in this way, I think we should enable some tests with PyPy too if I understood correctly and there isn't another problem I maybe missed:

"PyPy" # Skip these tests under PyPy since they require numpy and it isn't available there

At least, PyPy in my local has numpy.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Feb 1, 2018

I agree that pandas and pyarrow should not be a hard requirement for users, and this is what it is today: PySpark only throws exception when users try to use pandas related functions without pandas/pyarrow installed.

My proposal is, pandas and pyarrow should be a hard requirement for our jenkins, to make sure the features are well tested.

If there is a way to prove that py3 tests run well, and the environment issue is hard to fix, then maybe we can deal with it later, after 2.3.

@HyukjinKwon
Copy link
Member

Thank you for bearing with me @cloud-fan. I agree with it.

BTW, are you working on the logging thing BTW? I was thinking the simplest way to check is just print out once if PyArrow / Pandas are installed or not.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Feb 1, 2018

My proposal is, pandas and pyarrow should be a hard requirement for our jenkins, to make sure the features are well tested.

If this is a goal, I think another simple way is just to use an env set in Jenkins and throw an exception if both PyArrow or Pandas are not installed in the future.

To be more specific, like ..

PYTHON_VERSION_CHECK=$(python -c 'import sys; print(sys.version_info < (2, 7, 0))')
if [[ "$PYTHON_VERSION_CHECK" == "True" ]]; then
echo "Python versions prior to 2.7 are not supported."
exit -1
fi

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Feb 1, 2018

I've not worked on the logging stuff yet, feel free to take it, thanks!

@HyukjinKwon
Copy link
Member

@cloud-fan, will try it. Thank you sincerely.

@cloud-fan
Copy link
Contributor Author

The logging approach has been merged and I'm closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants