-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds partition support to SparkHiveDataSet #745
Adds partition support to SparkHiveDataSet #745
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your work here and for detailing where you've got to so well! This looks like a great start. I've left some minor comments, but I'm afraid I don't know enough about spark to consider the problems you've raised. Let me try and get some people who are more knowledgeable about spark to take a look 🙂
self._partition = partition | ||
|
||
# get the name of each partition | ||
self._partitions: List[str] = [] | ||
if self._partition is not None: | ||
for partition_set in self._partition.split(","): | ||
self._partitions.append(partition_set.split("=")[0].strip()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to just directly make the function argument partition: List[str] = None
, as is done for table_pk
? This would do away with the need for string manipulation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, but with just the list of column names, we can't identify a partition. Maybe we can make partition
get a list of tuples, where each position of the list identifies the column name and value. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. That could indeed work, but now I'm wondering exactly what the partition
string in the insert could look like. e.g. we can have partition_1 = x, partition_2 = y
but we could also just have partition_1, partition_2
right? Do you know what is the full specification of the syntax of this string?
Now I'm actually also wondering why SparkHiveDataSet
is using dynamically built SQL queries in the first place rather than the Python API, which would presumably make this sort of thing much easier. Is that impossible when doing the hive thing? @jiriklein
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @AntonyMilneQB, sorry for the late reply.
That's right, the partition
string will be something like partition_1 = x, partition_2 = y
if the partition type is static or partition_1, partition_2
if the partition is dynamic.
Hi @brendalf - I'm looking at this and I think
|
Hi @brendalf
And removing the |
Hi @jiriklein, how are you? I agree with you that |
Hi @brendalf, hope you're well! |
Perfect! |
|
Description
SparkHiveDataSet
doesn't not support partitions likeSparkDataSet
. As discussed in #725, @AntonyMilneQB proposed to implement this by adding a new argumentpartition_by
insideSparkHiveDataSet
constructor.The
partition_by
argument was meant to be the partition where the data should be inserted, eg:partition_by="ref = 1"
, to be used later, inside_save
, like thisinsert into {table} partition ({partition_by}) select * from tmp
.I'm opening this PR as a draft for discussion because I ran over some problems while I was implementing this approach.
Resolves #725
Development notes
Here are the problems:
1. If the table that doesn't exist yet
In this case, the
SparkHiveDataSet
is going to create with_create_empty_hive_table(self, data)
but thepartition_by
can be used here straight away because_create_empty_hive_table
needs a list of names and types to make the partition, something likePARTITIONED BY (ref integer)
.I tried to replace the
partition_by
argument with thepartitioned_by
argument, to be used inside the create table, and add a new argumentpartition
, that handles the partition where the data should be inserted, to_save
. But to do that, we would need to change thesave
method ofAbstractDataSet
, so I discarded it.I also tried to increase the @AntonyMilneQB approach by implementing two arguments to the
SparkHiveDataSet
constructor:partition
(handles where the data should be inserted) andpartitioned_by
(handles the columns where the table should be partitioned). The problem here is that the_create_empty_hive_table(self, data)
use CREATE TABLE AS SELECT (CTAS) and we can't use PARTITIONED BY with it. The can solve this by replacing the CTAS with the regular CREATE TABLE, but I don't know how to get the list of name and valid Spark SQL data types from thedata
variable that is passed to_create_empty_hive_table(self, data)
.What everyone thinks? Do you know how to implement the create table with partition? Can you see a different approach to solve problem 1?
For now, the implementation reflects the @AntonyMilneQB approach and it only works if the partitioned table is already created.
But that leaves us if the second problem.
2. How can we validate partitioned tables?
The
SparkHiveDataSet
uses the.load()
to generate theself._table_columns
when the table already exists, but theload
method loads the table with all columns, including the partitioned by columns, and that makes theself._validate_save
fails because the user table to insert doesn't have the partition columns.I tried to get around this by getting the partition columns names and remove these columns from
self._table_columns
, but I don't know if this is the best approach.What everyone thinks?
Checklist
RELEASE.md
fileNotice
I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.
I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.