Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage System] Support for IBM Cloud Object Storage #302

Closed
wants to merge 35 commits into from

Conversation

guykhazma
Copy link
Contributor

@guykhazma guykhazma commented Jan 19, 2020

This PR adds support for IBM Cloud Object Storage (IBM COS) by creating COSLogStore which extends the HadoopFileSystemLogStore and relies on IBM COS ability to handle atomic writes using Etags.

The support for IBM COS relies on the following properties:

  1. Write on COS is all-or-nothing, whether overwrite or not.
  2. List-after-write is consistent.
  3. Write is atomic when using the Stocator - Storage Connector for Apache Spark (v1.1.1+) and setting the configuration fs.cos.atomic.write to true.

In addition I propose the following documentation to be added to the Storage Configuration page.

@databricks-cla-assistant

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@guykhazma
Copy link
Contributor Author

The build fails because the license comment in the new class doesn't contain the copyright line.
Since the project was donated to Linux Foundation I wasn't sure which copyright to use.

@zsxwing
Copy link
Member

zsxwing commented Jan 21, 2020

@guykhazma Thanks a lot for your contribution. As we don't have an environment to test it, could you clarify what tests you have done in a real environment?

@guykhazma
Copy link
Contributor Author

guykhazma commented Jan 22, 2020

@zsxwing thanks for reviewing.
I have tested the LogStore implementation directly to verify it achieves the mutual exclusion requirement by running multiple writes to the same location in threads and verifying that only one succeeds.
The two other requirements are properties of the storage system so they should be OK as well.

@guykhazma
Copy link
Contributor Author

guykhazma commented Jan 29, 2020

Hi @zsxwing, can you please review the PR.

@guykhazma
Copy link
Contributor Author

@zsxwing gentle ping, @tdas @mukulmurthy I'll appreciate your review as well.

@guykhazma
Copy link
Contributor Author

guykhazma commented Apr 21, 2020

Hi @zsxwing, @tdas, @rahulsmahadev,
I have updated the PR to include the new copyright title and the build is successful now.
Can you please review this PR.
Thanks!

@guykhazma
Copy link
Contributor Author

@zsxwing, @tdas, @rahulsmahadev can you please review the PR

@zsxwing
Copy link
Member

zsxwing commented May 28, 2020

Sorry for the delay. We are working on adding a contrib directory and a new project to put all future LogStore implementations like this. We would like to keep the core project small because special LogStore implementations are not needed by everyone. Will report back when the project is ready.

@guykhazma
Copy link
Contributor Author

guykhazma commented Apr 27, 2021

@tdas I have made the changes but didn't change the documentation yet.
I guess we want to add to the README the new multi-module so I can suggest something but was not sure if you already something in mind.

@guykhazma
Copy link
Contributor Author

Hi @tdas ,

Any updates?

Thanks!

@tdas
Copy link
Contributor

tdas commented May 6, 2021

My apologies for the delay, we have been trying to line up a few more features like a new, more stable LogStore API that LogStore builders like you folks can use. However, there are a few more pieces that we still need to get in for the public API to be easily usable. I dont want you folks to block on those. So we can continue working using the existing LogStore API, which we will continue supporting for at least the 1.x releases. Later we can rewrite the IBMLogStore to use the new API, we are not going to worry about the 1.0 release we are targeting in a couple of weeks.

Let me leave some quick comments

build.sbt Outdated Show resolved Hide resolved
build.sbt Outdated Show resolved Hide resolved
build.sbt Outdated Show resolved Hide resolved
build.sbt Outdated Show resolved Hide resolved
@guykhazma
Copy link
Contributor Author

guykhazma commented May 6, 2021

@tdas thanks for the update and review. I made the needed changes.

Also, let me know when you want to have the change to use the new API and I'll open a PR for that.

Thanks

@tdas
Copy link
Contributor

tdas commented May 6, 2021

@guykhazma thank you for the speedy response. regarding the new LogStore APIs, I think we have to first convert the HadoopLogStore base class to use the new APIs, only then can the subclasses extending HadoopLogStore also use the new APIs easily.

It can be done after this PR. If you are interested, you can take a crack at it if you want!

build.sbt Outdated Show resolved Hide resolved
"fs.fake.impl" -> classOf[FakeFileSystem].getName,
"fs.fake.impl.disable.cache" -> "true")

protected def shouldUseRenameToWriteCheckpoint: Boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not really testing on actual IBM cloud ... do you folks have full integration testing on your side to really test the atomic guarantees etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I used these tests since the same are used in the core library for the other log stores.
As for internal tests, we simulated concurrent writes by multiple log stores and verified it worked fine.
I am not sure though how we can have such tests here that check against IBM COS, do you have any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. i dont have a good way. even our existing AzureLogStore and S3SingleDriverLogStore do not have cloud-specific unit tests for the same reasons... the complexity of maintaining it in a public repo. As long you folks have integration tests, that's good enough for now.


import org.apache.spark.sql.delta.storage._

class IBMCOSLogStoreSuite extends LogStoreSuiteBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make the file name match the class name. there maybe other logstores and logstore suites in the contrib module.

@guykhazma
Copy link
Contributor Author

@tdas thanks for the review!
I can try and help with converting the HadoopLogStore base class to use the new APIs sometime in the coming weeks.

@tdas
Copy link
Contributor

tdas commented May 6, 2021

awesome. this looks good now. i will start the process of merging this! thank you for the super quick response!

@tdas
Copy link
Contributor

tdas commented May 6, 2021

@guykhazma question for you! what is the git email address that you want to use for the commits? Linux foundation guideline requires having an email address tied to each contribution. and your github profile does not have any email on it, and email on the commits are "33684427+guykhazma@users.noreply.github.com" which is not a public email for contacting.

@tdas tdas closed this in d299062 May 6, 2021
@tdas tdas mentioned this pull request May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants