-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Create initial release process scripts for official ASF source release #429
Conversation
@@ -16,7 +16,7 @@ | |||
# under the License. | |||
|
|||
[package] | |||
name = "comet" | |||
name = "datafusion-comet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a comet
crate, so I renamed this. I have reserved the crate: https://crates.io/crates/datafusion-comet
@snmvaughan @parthchandra please review when you have the time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think this is for publishing cargo package. But I think Comet should be released as a maven package at maven repository? So users can simply add Comet as a dependency in their pom file?
It depends whether we want to allow other Rust projects to be able to use the Spark-compatible expressions that we are implementing, outside of the Comet/Spark integration. The maven release will be needed for sure. |
Co-authored-by: Oleks V <comphead@users.noreply.github.com>
Also, regardless of whether we choose to publish the crate or not, I think it is better to not use the name of an existing crate, and good to have the author information be consistent with other DataFusion crates. |
@viirya @comphead @parthchandra @kazuyukitanimura @huaxingao @advancedxy Could I get more feedback on this PR so that we can prepare for the upcoming 0.1.0 source release, which we are planning to do once we have switched back to depending on an official DataFusion release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments. I think the scripts are in good shape.
core/Cargo.toml
Outdated
homepage = "https://datafusion.apache.org/comet" | ||
repository = "https://github.com/apache/datafusion-comet" | ||
authors = ["Apache DataFusion <dev@datafusion.apache.org>"] | ||
description = "Apache DataFusion Comet: Apache Spark native query execution plugin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be consistent with the README.md in the repo, this could be changed as Apache DataFusion Comet: a high-performance accelerator for Apache Spark
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to standardize our description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for the description in the README.
nit: high performance is two words (i.e. not hyphenated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this
dev/release/create-tarball.sh
Outdated
|
||
set -e | ||
|
||
SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather call it DEV_RELEASE_DIR
instead.
dev/release/create-tarball.sh
Outdated
(cd "${SOURCE_TOP_DIR}" && git archive ${release_hash} --prefix ${release}/ | gzip > ${tarball}) | ||
|
||
echo "Running rat license checker on ${tarball}" | ||
${SOURCE_DIR}/run-rat.sh ${tarball} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you missed this run-rat.sh
in this commit.
cargo build | ||
cargo test --all --features=avro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to build and test rust code, we should do it for java/scala code as well?
It just occurred to me maybe we should make this script as a github action and provided in the vote email so that the basic verification is already performed and checked. The community is also welcome to test more combinations.
The previous statement is not a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point about testing the scala code. I will add that.
# "License"); you may not use this file except in compliance | ||
# with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do have a RAT check from maven, is this check a replacement or covers other usecase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we somehow missed adding the RAT check in maven. I was also surprised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comphead I misread your comment. Where is the rat check in maven?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nm, I see it now.
We need to run the rat check against the source tarball before we release it. The maven project is checking the contents of the repo not the release tarball, so maybe we need both. I will check how we are doing this in DataFusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, ideally to remove duplications, perhaps we can remove RAT check in maven then and live only with the release one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFusion uses the same rat script from CI and from verify-release.sh.
If we do the same with Comet and remove the maven task it could cause more work for developers because they would not catch issues as part of a standard maven build and would have to run a separate script. I am not sure if it is worth the effort to change this right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. I feel we can merge and start using this.
pushd core | ||
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release | ||
popd | ||
./mvnw install -Prelease -DskipTests -P"spark-3.4" -Dmaven.gitcommitid.skip=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark-3.4 is hardcoded now. What about making it a parameter.
Also does it make sense to use one of our make commands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently have a make command that specifies -Dmaven.gitcommitid.skip=true
although one is being added in #512
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, one hack could be something like make release PROFILES="-Dmaven.gitcommitid.skip=true"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we run tests as well? All comet tests pass for me when I run them though they do take a while to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, one hack could be something like
make release PROFILES="-Dmaven.gitcommitid.skip=true"
That would be a hack. I would leave it as it is. I assume that we will update/deprecate some of this once we move to publishing binary releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we run tests as well? All comet tests pass for me when I run them though they do take a while to run.
I think that the tests take too long. We create the release candidate tarball from a tag in GitHub where we have already run all of the tests, so running them all again is potentially redundant.
Perhaps there is some subset of tests that we could run? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should at least build the scala/java code here, the test could be skipped as all the tests should be ran on CI actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark-3.4 is hardcoded now. What about making it a parameter.
+1 for this. It could be default to 3.4 though.
./mvnw install -Prelease -DskipTests
IIRC, we should use ./mvnw verify
instead? Because the install command will also deploy the comet jar into the local repository, which might mess up other tests/projects that will depend on comet-jar in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark-3.4 is hardcoded now. What about making it a parameter.
I added a comment explaining that we are testing the latest supported Spark version.
I think we can make more improvements in the future but I would like to get the basic scripts in place first so that we have the ability to make a release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Some minor comments.
core/Cargo.toml
Outdated
homepage = "https://datafusion.apache.org/comet" | ||
repository = "https://github.com/apache/datafusion-comet" | ||
authors = ["Apache DataFusion <dev@datafusion.apache.org>"] | ||
description = "Apache DataFusion Comet: Apache Spark native query execution plugin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for the description in the README.
nit: high performance is two words (i.e. not hyphenated)
(cd ${distdir} && shasum -a 512 ${tarname}) > ${tarball}.sha512 | ||
|
||
|
||
echo "Uploading to datafusion dist/dev to ${url}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can committers do this or only PMC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add to the README that the entire release process requires PMC (or committer if that is sufficient).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some steps can be completed by committers (such as creating the PR to update the project version number and generating the changelog) but some steps do require PMC members.
I can add this to the README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think committer permission should be sufficient? The bandwidth of PMC is limited, it would be great that we can make this committer sufficient. When releasing on apache uniffle, I don't think we need PMC to be the release manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have expanded the README to explain the different steps in the release process and who can do each part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added some better info on verifying release candidates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think committer permission should be sufficient? The bandwidth of PMC is limited, it would be great that we can make this committer sufficient. When releasing on apache uniffle, I don't think we need PMC to be the release manager.
Only PMC members have write access to the subversion repositories as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, IIRC, when we do binary releases, for maven artifacts, promoting release candidates to releases also requires PMC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the information, good to know.
pushd core | ||
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release | ||
popd | ||
./mvnw install -Prelease -DskipTests -P"spark-3.4" -Dmaven.gitcommitid.skip=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we run tests as well? All comet tests pass for me when I run them though they do take a while to run.
pushd core | ||
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release | ||
popd | ||
./mvnw install -Prelease -DskipTests -P"spark-3.4" -Dmaven.gitcommitid.skip=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, one hack could be something like
make release PROFILES="-Dmaven.gitcommitid.skip=true"
That would be a hack. I would leave it as it is. I assume that we will update/deprecate some of this once we move to publishing binary releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments, otherwise LGTM.
(cd ${distdir} && shasum -a 512 ${tarname}) > ${tarball}.sha512 | ||
|
||
|
||
echo "Uploading to datafusion dist/dev to ${url}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think committer permission should be sufficient? The bandwidth of PMC is limited, it would be great that we can make this committer sufficient. When releasing on apache uniffle, I don't think we need PMC to be the release manager.
dev/release/run-rat.sh
Outdated
# generate the rat report | ||
$RAT $1 > rat.txt | ||
python3 $RELEASE_DIR/check-rat-report.py $RELEASE_DIR/rat_exclude_files.txt rat.txt > filtered_rat.txt | ||
cat filtered_rat.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this line is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows the list of files that failed the rat test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Nit:
How about moving this line to the else block in L41 and L42? So that it doesn't pollute the stdout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I have made that change.
dev/release/rat_exclude_files.txt
Outdated
core/Cargo.lock | ||
core/testdata/backtrace.txt | ||
core/testdata/stacktrace.txt | ||
dev/copyright/scala-header.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this line? I think scala-header.txt
is just a soft link to java-header.txt
, which are identical
dev/release/run-rat.sh
Outdated
# generate the rat report | ||
$RAT $1 > rat.txt | ||
python3 $RELEASE_DIR/check-rat-report.py $RELEASE_DIR/rat_exclude_files.txt rat.txt > filtered_rat.txt | ||
cat filtered_rat.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Nit:
How about moving this line to the else block in L41 and L42? So that it doesn't pollute the stdout.
pushd core | ||
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release | ||
popd | ||
./mvnw install -Prelease -DskipTests -P"spark-3.4" -Dmaven.gitcommitid.skip=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark-3.4 is hardcoded now. What about making it a parameter.
+1 for this. It could be default to 3.4 though.
./mvnw install -Prelease -DskipTests
IIRC, we should use ./mvnw verify
instead? Because the install command will also deploy the comet jar into the local repository, which might mess up other tests/projects that will depend on comet-jar in the future.
@parthchandra @advancedxy I think I have addressed all of the feedback now |
LGTM |
@viirya @kazuyukitanimura @huaxingao could I get a committer review/approval |
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release | ||
popd | ||
# test with the latest supported version of Spark | ||
./mvnw verify -Prelease -DskipTests -P"spark-3.4" -Dmaven.gitcommitid.skip=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean if we support Spark 3.5, Spark4.0, we need to change this line?
Wondering should we run it against all supported versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably makes sense to verify against the latest Spark version here. I'm hoping that people verify release by building for their Spark version and running against their workflows. This script is just a basic verification that the release builds ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @andygrove
left a small comment
Which issue does this PR close?
Part of #394
Rationale for this change
We need to establish the basic ASF source release process
What changes are included in this PR?
comet
todatafusion-comet
(comet
was taken, and I have reserveddatafusion-comet
in crates.io)How are these changes tested?
They are not tested yet. I will test them prior to our first release.