Skip to content

Conversation

@WweiL
Copy link
Contributor

@WweiL WweiL commented Apr 6, 2023

What changes were proposed in this pull request?

Enable unit tests and doctests for streaming queries. A lot are skipped and needs to be un-skipped as the development goes on.

Note that I also separated the {foreach, foreachBatch} tests from the original test suite. Because currently they are not implemented in connect and it seems unnecessary to manually add a skip for all of them in StreamingParityTests. Also it doesn't hurt to separate them I think as these tests are large enough already.

Why are the changes needed?

More tests is always better than less.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

It's test itself.

return self.load(path=path, format="json")
else:
raise TypeError("path can be only a single string")

Copy link
Contributor Author

@WweiL WweiL Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore the change in this file. They are added and to be reviewed in #40689.

@WweiL WweiL requested a review from HyukjinKwon April 7, 2023 19:36
@WweiL
Copy link
Contributor Author

WweiL commented Apr 7, 2023

CC @rangadi @pengzhon-db

Copy link

@rangadi rangadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Made a few comments.

from pyspark.testing.sqlutils import ReusedSQLTestCase


class StreamingTestsForeachFamilyMixin:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why 'foreach family'. What are the other foreach() methods tested here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I put both foreach and foreachBatch to this class. My naming sense isn't that great so I'm very welcome to any suggestion on it's name here...



class StreamingTestsForeachFamilyMixin:
class ForeachWriterTester:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried these tests, I got a serialization error after moving this to 'Mixin'. We could handle it when run into it with connect tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get all tests passed by running python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming_foreach_family. I renamed the mixin part though in the new commit

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move foreach back to its original place. Foreach() and foreachbath() are very different. The later takes a dataframe, the former is more like a UDF.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add foreach() api to one of the jiras or a new jira.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you could keep here. Either is ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! I could make another class for foreach then. The jira for foreach support is already added by me IIRC, in the epic

)
self.assertTrue(df.isStreaming)
self.assertEqual(df.schema.simpleString(), "struct<data:string>")
# TODO: Moving this outside of with block will trigger the following error,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not need to be a TODO. Testing it inside with is the right thing.
Calls like schema are executed on server with the updated state of the conf.

Copy link
Contributor Author

@WweiL WweiL Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'll remove the commetns. So this is a behavior change for migrating to connect?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Something like schema is evaluated with the conf at the time of schema call.

q.stop()
shutil.rmtree(tmpPath)

class ForeachWriterTester:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving foreach tests to a different files SGTM.

Return whether the query has terminated or not within 5 seconds
>>> sq.awaitTermination(5)
>>> sq.awaitTermination(5) # doctest: +SKIP
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these needed? This won't be tested in connect yet, right?

Copy link
Contributor Author

@WweiL WweiL Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.. This actually silents this line in the doctest for both connect and non-connect I believe (@HyukjinKwon please correct me if I'm wrong) , just some temporary pain we have to bear with as connect doesn't support this yet

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the pain? What fails if you remove this diff?

Copy link
Contributor Author

@WweiL WweiL Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing fails but we skip some tests for also non-connect scenario

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not skip any tests. Any skip should have a TODO comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I've moved the TODO to the top of these methods to avoid them showing up in the PySpark doc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're absolutely sure that we will enable this back very soon, I am fine. As @rangadi said, I tried my best to avoid skipping the tests, but I did few times when I am about to enable it back very soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think I'll just remove all of the SKIP flags and disable the doctest for now. Some of the functionalities might not be supported that soon I think.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better solution is to comment out setting __doc__ for awaitTermination() in connect's query.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done

>>> sq.stop()
>>> sq.isActive
>>> sq.isActive # doctest: +SKIP
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this required? Why is it skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling isActive after the query is stopped will throw error right now before the better session management is implemented

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is see, please add a TODO here with a brief comment.

@WweiL WweiL requested a review from rangadi April 10, 2023 21:11
HyukjinKwon pushed a commit that referenced this pull request Apr 10, 2023
### What changes were proposed in this pull request?

This PR adds the `orc`, `parquet`, and `text` APIs in connect's DataStreamReader

### Why are the changes needed?

Part of Streaming Connect project.

### Does this PR introduce _any_ user-facing change?

Yes, now the three APIs are enabled. But everything is pretty much still under developed so far.

### How was this patch tested?

Manually tested, unit tests will be added in SPARK-43031 as a follow-up PR #40691.

Closes #40689 from WweiL/SPARK-42951-reader-apis.

Authored-by: Wei Liu <wei.liu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@WweiL
Copy link
Contributor Author

WweiL commented Apr 11, 2023

Hi @HyukjinKwon could you please take another look? Thanks!

>>> with tempfile.TemporaryDirectory() as d:
... # Create a table with Rate source.
... df.writeStream.toTable(
... "my_table", checkpointLocation=d) # doctest: +ELLIPSIS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what does # doctest: +ELLIPSIS mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that in doctest the result will be kind of regex checked if this flag is set. Like the line below right now is <...streaming.query.StreamingQuery object at 0x...>, but before it was <pyspark.sql.streaming.query.StreamingQuery object at 0x...>, which would conflict with connect's test, which returns <pyspark.sql.connect.streaming.query.StreamingQuery object at 0x...>

So to make this test works for both connect and non-connect, we enable this regex-like check and replace below with ...
But it doesn't matter anyway, as we enabled the flag in test options in the __main__ method below

from pyspark.testing.connectutils import ReusedConnectTestCase


class StreamingParityTests(StreamingTestsMixin, ReusedConnectTestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand the intent of this file. Looks all tests in this file are skipped. Where do we test already supported spark streaming connect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhoh, the tests here shouldn't be skipped.

Copy link
Contributor Author

@WweiL WweiL Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file marks all tests that should be skipped currently because these functionalities are not supported yet in connect.

It executes all tests from StreamingTestsMixin, except these below marked skipped.

When anyone adds support to corresponding functions, they should delete the @unittest.skip and function, unless they want to do some change specific for connect.

So ideally in the end, this class should only be like

class StreamingParityTests(StreamingTestsMixin, ReusedConnectTestCase):
    pass

And then it runs all tests from StreamingTestsMixin

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of the skipped tests here has a SPARK ticket associated with it. We will enable these soon. LGTM.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine from a cursory look

Copy link

@rangadi rangadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @WweiL ping Hyukjin when this is ready to be merged.

from pyspark.testing.connectutils import ReusedConnectTestCase


class StreamingParityTests(StreamingTestsMixin, ReusedConnectTestCase):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of the skipped tests here has a SPARK ticket associated with it. We will enable these soon. LGTM.

@WweiL
Copy link
Contributor Author

WweiL commented Apr 12, 2023

@HyukjinKwon can you merge this? Thank you!

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants