Skip to content

Conversation

razvan
Copy link
Member

@razvan razvan commented Oct 6, 2025

Description

See #1273 (comment)

Definition of Done Checklist

Note

Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant.

Please make sure all these things are done and tick the boxes

  • Changes are OpenShift compatible
  • All added packages (via microdnf or otherwise) have a comment on why they are added
  • Things not downloaded from Red Hat repositories should be mirrored in the Stackable repository and downloaded from there
  • All packages should have (if available) signatures/hashes verified
  • Add an entry to the CHANGELOG.md file
  • Integration tests ran successfully
TIP: Running integration tests with a new product image

The image can be built and uploaded to the kind cluster with the following commands:

boil build <IMAGE> --image-version <RELEASE_VERSION> --strip-architecture --load
kind load docker-image <MANIFEST_URI> --name=<name-of-your-test-cluster>

See the output of boil to retrieve the image manifest URI for <MANIFEST_URI>.

@razvan razvan self-assigned this Oct 6, 2025
@razvan razvan changed the title chore(spark): update hdfs dependencies chore(spark): update hadoop dependencies Oct 6, 2025
@razvan razvan requested a review from a team October 6, 2025 09:41
Copy link
Member

@sbernauer sbernauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the bump, LGTM!

Any particular reason you removed the links such as https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.4.1?
I find it complicated enough to find the correct versions by browsing Maven. Yes, the links are a maintenance burden, but I think they are very much worth it, as I don't know the decency structure by heart

@sbernauer sbernauer moved this to Development: In Review in Stackable Engineering Oct 6, 2025
@razvan
Copy link
Member Author

razvan commented Oct 6, 2025

What do you need the links for?

Small update: the exact dependency versions must be obtained from the Hadoop image.

@sbernauer
Copy link
Member

Well... To determine the versions 😅
It's a chain hadoop 3.4.2 -> hadoop-aws 3.4.2 -> aws-java-sdk-bundle-version 2.29.52
and
hadoop 3.4.2 -> hadoop-azure 3.4.2 -> azure-storage 7.0.1 -> azure-keyvault-core 2.15.2
etc.
We need to know the correct version by walking these paths.
The maven links allow you to easily traverse this tree. If you have a nice script or "mvn dependecy:tree" or whatnot for it, it would be awesome :)

@sbernauer
Copy link
Member

sbernauer commented Oct 6, 2025

Ahh I see the different ways now. I use the dependency versions (as e.g. Maven does). You look in the file system of the docker image.

I slightly prefer explicit versions as a double safety measure, as we messed up the AWS bundle version in the past. But back than we didn't copy it from the Hadoop image.
If we don't want to check the dependecy versions and just use the ones from the Hadoop image, we can also get rid of all the version variables here and just copy azure-storage-version-*.jar from the Hadoop image.

@razvan
Copy link
Member Author

razvan commented Oct 6, 2025

You look in the file system of the docker image.

yes, because that is the source location for the spark image and takes any stackable patches into account.

@sbernauer
Copy link
Member

WDYT of getting rid of all this variables than?

@razvan
Copy link
Member Author

razvan commented Oct 6, 2025

If we don't want to check the dependecy versions and just use the ones from the Hadoop image, we can also get rid of all the version variables here and just copy azure-storage-version-*.jar from the Hadoop image.

I would prefer that too but it is not safe. The docker COPY directive will not fail if the source file doesn't exist.

@sbernauer
Copy link
Member

sbernauer commented Oct 6, 2025

The docker COPY directive will not fail if the source file doesn't exist.

At least for Hive we don't use COPY, but RUN cp, so we have the flexibility. The command obviously need to fail in case != 1 file has been copied.

BUT I noticed the spark Dockerfile curls jackson-dataformat-xml, stax2-api and woodstox-core!
We really need the Maven links in there! E.g. you picked the wrong versions for Spark 4.0.1 jackson-dataformat-xml-version, stax2-api-version and woodstox-core-version, they need fixing.

I have seen to many mistakes of this kind, we should IMHO just stick to schema F and put the links in there. To easy to make it wrong otherwise.
A scripted solution would obviously be nicer, but much effort

@razvan
Copy link
Member Author

razvan commented Oct 6, 2025

The docker COPY directive will not fail if the source file doesn't exist.

At least for Hive we don't use COPY, but RUN cp, so we have the flexibility. The command obviously need to fail in case != 1 file has been copied.

RUN cp cannot copy files from a different stage.

BUT I noticed the spark Dockerfile curls jackson-dataformat-xml, stax2-api and woodstox-core! We really need the Maven links in there! E.g. you picked the wrong versions for Spark 4.0.1 jackson-dataformat-xml-version, stax2-api-version and woodstox-core-version, they need fixing.

505f9d5

@sbernauer
Copy link
Member

RUN cp cannot copy files from a different stage.

Good point! In Hive we copy from the hadoop builder to the hive builder. But doesn't matter right now. Future optimization 😅

sbernauer
sbernauer previously approved these changes Oct 6, 2025
Copy link
Member

@sbernauer sbernauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the links, only one small typo

Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de>
@razvan razvan enabled auto-merge October 6, 2025 12:33
@razvan razvan added this pull request to the merge queue Oct 6, 2025
@sbernauer sbernauer moved this from Development: In Review to Development: Done in Stackable Engineering Oct 6, 2025
Merged via the queue into main with commit 0667811 Oct 6, 2025
3 checks passed
@razvan razvan deleted the issues/1273-hdfs-update branch October 6, 2025 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Development: Done

Development

Successfully merging this pull request may close these issues.

2 participants