chore(spark): update hadoop dependencies #1297

razvan · 2025-10-06T09:37:43Z

Description

Definition of Done Checklist

Note

Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant.

Please make sure all these things are done and tick the boxes

Changes are OpenShift compatible
All added packages (via microdnf or otherwise) have a comment on why they are added
Things not downloaded from Red Hat repositories should be mirrored in the Stackable repository and downloaded from there
All packages should have (if available) signatures/hashes verified
Add an entry to the CHANGELOG.md file
Integration tests ran successfully

TIP: Running integration tests with a new product image

The image can be built and uploaded to the kind cluster with the following commands:

boil build <IMAGE> --image-version <RELEASE_VERSION> --strip-architecture --load
kind load docker-image <MANIFEST_URI> --name=<name-of-your-test-cluster>

See the output of boil to retrieve the image manifest URI for <MANIFEST_URI>.

sbernauer

Thanks for the bump, LGTM!

Any particular reason you removed the links such as https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.4.1?
I find it complicated enough to find the correct versions by browsing Maven. Yes, the links are a maintenance burden, but I think they are very much worth it, as I don't know the decency structure by heart

razvan · 2025-10-06T11:34:22Z

What do you need the links for?

Small update: the exact dependency versions must be obtained from the Hadoop image.

sbernauer · 2025-10-06T11:39:21Z

Well... To determine the versions 😅
It's a chain hadoop 3.4.2 -> hadoop-aws 3.4.2 -> aws-java-sdk-bundle-version 2.29.52
and
hadoop 3.4.2 -> hadoop-azure 3.4.2 -> azure-storage 7.0.1 -> azure-keyvault-core 2.15.2
etc.
We need to know the correct version by walking these paths.
The maven links allow you to easily traverse this tree. If you have a nice script or "mvn dependecy:tree" or whatnot for it, it would be awesome :)

sbernauer · 2025-10-06T11:40:33Z

Ahh I see the different ways now. I use the dependency versions (as e.g. Maven does). You look in the file system of the docker image.

I slightly prefer explicit versions as a double safety measure, as we messed up the AWS bundle version in the past. But back than we didn't copy it from the Hadoop image.
If we don't want to check the dependecy versions and just use the ones from the Hadoop image, we can also get rid of all the version variables here and just copy azure-storage-version-*.jar from the Hadoop image.

razvan · 2025-10-06T11:44:22Z

You look in the file system of the docker image.

yes, because that is the source location for the spark image and takes any stackable patches into account.

sbernauer · 2025-10-06T11:45:19Z

WDYT of getting rid of all this variables than?

razvan · 2025-10-06T11:46:14Z

If we don't want to check the dependecy versions and just use the ones from the Hadoop image, we can also get rid of all the version variables here and just copy azure-storage-version-*.jar from the Hadoop image.

I would prefer that too but it is not safe. The docker COPY directive will not fail if the source file doesn't exist.

sbernauer · 2025-10-06T11:53:22Z

The docker COPY directive will not fail if the source file doesn't exist.

At least for Hive we don't use COPY, but RUN cp, so we have the flexibility. The command obviously need to fail in case != 1 file has been copied.

BUT I noticed the spark Dockerfile curls jackson-dataformat-xml, stax2-api and woodstox-core!
We really need the Maven links in there! E.g. you picked the wrong versions for Spark 4.0.1 jackson-dataformat-xml-version, stax2-api-version and woodstox-core-version, they need fixing.

I have seen to many mistakes of this kind, we should IMHO just stick to schema F and put the links in there. To easy to make it wrong otherwise.
A scripted solution would obviously be nicer, but much effort

razvan · 2025-10-06T12:19:43Z

The docker COPY directive will not fail if the source file doesn't exist.

At least for Hive we don't use COPY, but RUN cp, so we have the flexibility. The command obviously need to fail in case != 1 file has been copied.

RUN cp cannot copy files from a different stage.

BUT I noticed the spark Dockerfile curls jackson-dataformat-xml, stax2-api and woodstox-core! We really need the Maven links in there! E.g. you picked the wrong versions for Spark 4.0.1 jackson-dataformat-xml-version, stax2-api-version and woodstox-core-version, they need fixing.

505f9d5

sbernauer · 2025-10-06T12:31:22Z

RUN cp cannot copy files from a different stage.

Good point! In Hive we copy from the hadoop builder to the hive builder. But doesn't matter right now. Future optimization 😅

sbernauer

Thanks for the links, only one small typo

spark-k8s/boil-config.toml

Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de>

chore(spark): update hdfs dependencies

5faa70e

razvan self-assigned this Oct 6, 2025

bump hbase version

63300ea

razvan changed the title ~~chore(spark): update hdfs dependencies~~ chore(spark): update hadoop dependencies Oct 6, 2025

razvan requested a review from a team October 6, 2025 09:41

sbernauer reviewed Oct 6, 2025

View reviewed changes

sbernauer added this to Stackable Engineering Oct 6, 2025

sbernauer moved this to Development: In Review in Stackable Engineering Oct 6, 2025

add maven repo links for logging libs

505f9d5

sbernauer previously approved these changes Oct 6, 2025

View reviewed changes

spark-k8s/boil-config.toml Outdated Show resolved Hide resolved

razvan dismissed sbernauer’s stale review via 008ccf7 October 6, 2025 12:33

Update spark-k8s/boil-config.toml

008ccf7

Co-authored-by: Sebastian Bernauer <sebastian.bernauer@stackable.de>

razvan enabled auto-merge October 6, 2025 12:33

sbernauer approved these changes Oct 6, 2025

View reviewed changes

razvan added this pull request to the merge queue Oct 6, 2025

sbernauer moved this from Development: In Review to Development: Done in Stackable Engineering Oct 6, 2025

Merged via the queue into main with commit 0667811 Oct 6, 2025
3 checks passed

razvan deleted the issues/1273-hdfs-update branch October 6, 2025 12:36

Uh oh!

chore(spark): update hadoop dependencies #1297

chore(spark): update hadoop dependencies #1297

Uh oh!

Conversation

razvan commented Oct 6, 2025

Description

Definition of Done Checklist

Uh oh!

sbernauer left a comment

Choose a reason for hiding this comment

Uh oh!

razvan commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbernauer commented Oct 6, 2025

Uh oh!

sbernauer commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

razvan commented Oct 6, 2025

Uh oh!

sbernauer commented Oct 6, 2025

Uh oh!

razvan commented Oct 6, 2025

Uh oh!

sbernauer commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

razvan commented Oct 6, 2025

Uh oh!

sbernauer commented Oct 6, 2025

Uh oh!

sbernauer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

razvan commented Oct 6, 2025 •

edited

Loading

sbernauer commented Oct 6, 2025 •

edited

Loading

sbernauer commented Oct 6, 2025 •

edited

Loading