Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend CMake to handle aws-sdk-cpp dependency needed for S3 connector #167

Closed

Conversation

majetideepak
Copy link
Collaborator

@majetideepak majetideepak commented Sep 7, 2021

This change currently extends the CMake script to find or download and install aws-sdk-cpp.
VELOX_ENABLE_S3 option is used to enable the S3 connector and is set by default to OFF.
The package (including headers, libraries, binaries) is installed in the directory specified by VELOX_DEPENDENCY_INSTALL_DIR. The default is the build directory.
If AWS SDK CPP is already installed but in a custom location, set AWS_SDK_CPP_INSTALL_DIR to the install location.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 7, 2021
@majetideepak majetideepak changed the title Extend CMake scripts to find or build aws-sdk-cpp library Add support for S3 connector Sep 7, 2021
Copy link
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thanks @majetideepak for paving the way! Tagging @funrollloops and @mshang816 who understand these things a lot better than me, but if this is just as portable I think we could try to use it for the other dependencies as well.

include(ExternalProject)
macro(build_awssdk)
message("Configured to download and build AWS-SDK-CPP version " ${AWS_SDK_VERSION})
externalproject_add(aws-sdk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great! We currently have custom library installation logic wrapped in bash scripts (scripts/setup-*), but consolidating this with cmake will be a great cleanup. Especially considering (I'm assuming) this is portable between linux and macos?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is portable across linux, macos, and Windows. Some libraries (mostly on Windows) might need some custom linking. CMake allows us to do that as well.

macro(build_awssdk)
message("Configured to download and build AWS-SDK-CPP version " ${AWS_SDK_VERSION})
externalproject_add(aws-sdk
PREFIX aws-sdk-cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this require the external library to be built with cmake as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, https://cmake.org/cmake/help/latest/module/ExternalProject.html is pretty versatile. We can have a custom build command. It allows patching as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you evaluated using vcpkg?

ExternalProject_Add is a reasonable choice, though I would prefer not to have to replicate the find script (below).

ExternalProject_Get_Property(aws-sdk INSTALL_DIR)
add_library(aws-sdk-core STATIC IMPORTED)
add_library(aws-sdk-s3 STATIC IMPORTED)
set_target_properties(aws-sdk-core PROPERTIES IMPORTED_LOCATION ${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-core${CMAKE_STATIC_LIBRARY_SUFFIX})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we fit this line in 80 chars?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. will do.

CMakeLists.txt Outdated
set(CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/CMake" ${CMAKE_MODULE_PATH})

# Add all options below
option(VELOX_ENABLE_S3 "Build S3 Connector" ON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we leave this OFF by default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

# Set third-party dependency versions below
set(AWS_SDK_VERSION "1.9.96")

include(ThirdpartyDependencies)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do other projects usually organize these things physically? All the different projects in one file, or one file per external project?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have another level of indirection inside ThirdpartyDependencies for each external project. It makes sense to have per external project file.

@funrollloops
Copy link
Contributor

Is the intent to enable s3 on CircleCI? If so, we need to give some thought to dependency caching.

Caching when using ExternalProject_Add is possible, but awkward. In fact by default dependencies aren't even shared between build modes when using ExternalProject_Add, since the deps are in the build directory.

On balance IMO managing the deps outside the CMake script is simpler overall, but it's a debatable point. @pedroerp, @mshang816 thoughts?

@majetideepak
Copy link
Collaborator Author

aren't even shared between build modes when using ExternalProject_Add, since the deps are in the build directory.

When using ExternalProject_Add, the install directory is configurable. We can add a CMake variable that sets a shared install directory if sharing deps across build modes is desired.
The CMake approach can achieve what the current bash scripts achieve and more.

macro(build_awssdk)
message("Configured to download and build AWS-SDK-CPP version " ${AWS_SDK_VERSION})
externalproject_add(aws-sdk
PREFIX aws-sdk-cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you evaluated using vcpkg?

ExternalProject_Add is a reasonable choice, though I would prefer not to have to replicate the find script (below).

add_library(aws-sdk-s3 STATIC IMPORTED)
set_target_properties(aws-sdk-core PROPERTIES IMPORTED_LOCATION ${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-core${CMAKE_STATIC_LIBRARY_SUFFIX})
set_target_properties(aws-sdk-s3 PROPERTIES IMPORTED_LOCATION ${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-s3${CMAKE_STATIC_LIBRARY_SUFFIX})
set(AWS_INCLUDE_DIRS ${INSTALL_DIR}/include)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're having to effectively duplicate the Find script here.

Another pattern I've seen (and used) is to build the primary project through ExternalProject_Add as well. Since it uses a separate CMake invocation it can use find_package to find the dependencies built by the metaproject.

However this makes building awkward and would require invasive changes, so let's not do that now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplication is only limited to aws-sdk-cpp since they provide their own Find implementation which is only available after download. If we have the Find script apriori (usual for other packages), we can just use that here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aws-sdk-cpp has a complicated Find setup due to its dependencies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use find_package and ExternalProject_Add together. So you are right about the duplication.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more on this yesterday. Since we already know where the package is installed via ExternalProject_Add, we don't really need the Find, do we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's somewhat true, but if we want people to be able to provide their own dependencies we need to replicate at least part of the interface presented by the find script. It's another thing that can break if it's not tested.

set_target_properties(aws-sdk-s3 PROPERTIES IMPORTED_LOCATION ${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-s3${CMAKE_STATIC_LIBRARY_SUFFIX})
set(AWS_INCLUDE_DIRS ${INSTALL_DIR}/include)
set(AWS_LIBS aws-sdk-s3 aws-sdk-core)
include_directories(SYSTEM ${AWS_INCLUDE_DIRS})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use target_include_directories to add the include directories to the interface of the aws-sdk-s3 target? This will ensure that files that include the header have the library dependency properly set.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

@majetideepak
Copy link
Collaborator Author

I have not used vcpkg before. But a quick search seems to say that both are somewhat similar. Do you see any advantages of using that?
vcpkg A package manager from Microsoft, somewhat similar to CMakeExternalProjectAdd.
https://jonathan.temno.eu/blog/vcpkg.html

@funrollloops
Copy link
Contributor

funrollloops commented Sep 8, 2021

I have not used vcpkg before. But a quick search seems to say that both are somewhat similar. Do you see any advantages of using that?
vcpkg A package manager from Microsoft, somewhat similar to CMakeExternalProjectAdd.
https://jonathan.temno.eu/blog/vcpkg.html

That is a little reductionist, but not wrong. The vcpkg repo also has recipes for installation of packages, and presumably other features beyond ExternalProject. If you read that entire blog post, actually, the author ended up choosing vcpkg over ExternalProject, Conan, Hunter, and a few other more obscure options. That doesn't mean it's the best option for Velox, of course.

I haven't had a chance to use vcpkg (or other package managers like hunter and conan), so I don't have a strong opinion on it. I have used ExternalProject before however, and IMO it was superficially simple but ended up being overall more difficult to work with than a shell script. I am pretty comfortable with bash to be fair. Note we have another dependency management system that was partially implemented but not deployed: FB's getdeps.

The current state is definitely not perfect, but it was arrived at after several iterations. This is my wishlist for dependency management:

  1. Use system packages where it makes sense (which depends on platform). I don't believe there's any reasonable way to integrate this into CMake, so some shell scripts are likely to remain. Getdeps does handle this btw.
  2. Allow users to manage dependencies themselves (somewhat implied by 1). To minimize the maintenance burden this IMO implies we should use standard find scripts as much as possible and at most fall back to installation in the build system. Preferably, we would use the find scripts even when we install the dependency ourselves. Everything but ExternalProject gets this for free.
  3. Automatic rebuilds when version requirements change with caching (for switching branches). This is something that e.g. getdeps can provide, but see (4) below. Vcpkg provides neither (I think). Shell scripts can "cache" somewhat by re-using the src and build directories. ExternalProject can do automatic rebuilds, and if it retains the build directory it's on par with shell scripts here so overall the best option.
  4. Dependency caching in the CI for macos (using CircleCI caching) and linux (by baking the deps into the docker image). Currently we install all deps to a separate folder for macos and use the setup scripts + the CircleCI config as the cache key. We never implemented getdeps because I failed to find a useful dependency cache key. If we put all external CMake dependencies in one directory/file, we could probably mix that into the key without adding too much complexity. We'd need wiring in the CMakeLists.txt to allow the user to specify the installation directory (preferably one flag?). So this is not a blocker for ExternalProject IMO, but will take work to implement. For linux, we simply run the setup scripts in a fresh docker image and use the result for the CI.
  5. Debuggability. In general I would rather debug a (well written) shell script than CMake. On vcpkg, I don't have enough experience to make a judgement here. getdeps however was kind of slow and painful to debug.
  6. Some degree of consistency. At the moment all other dependencies are managed by shell script, though the current state isn't great either -- we should factor out a common library for the setup scripts. I suspect system deps will always be managed by shell script. I, personally, am not planning to make time to swap everything over to ExternalProject. @mshang816 , would you want to make time for this?
  7. Simple way to enable/disable features. Integrating the dependency management into CMake would allow it to be conditional on feature flags, which is definitely nice. However we get similar flexibility IMO if we just disable features by default if their dependencies are missing. That could be implemented even if we use ExternalProject, though, by requiring a cmake flag to enable automatic dependency installation.

Benefits of using ExternalProject: automatic rebuild on dependency version changes, a potentially simplified new developer experience particularly if we allow dependencies to be enabled/disabled via CMake flags, source-installed dependencies are isolated from the system by default (note the setup scripts for macos already support this, we just default to installing to /).

We also had a requirement from @spershin to have fine-grained control of dependency rebuilds; I added interactive prompts in the macos setup script at his request. I'm not sure how he would feel about switching over to ExternalProject.

If you would like to avoid this discussion, I would suggest adding the dependency to the setup scripts initially. It should be straightforward to do so.

@funrollloops
Copy link
Contributor

On a narrower note, this PR needs curl to be added to the linux docker image to enable ExternalProject. We'd want to update the image to include s3 anyway; unfortunately this has to be done by one of us.

@majetideepak
Copy link
Collaborator Author

Thank you for the detailed explanation.

Benefits of using ExternalProject: automatic rebuild on dependency version changes, a potentially simplified new developer experience particularly if we allow dependencies to be enabled/disabled via CMake flags

I feel this benefit will become very important/useful as we include more optional features/dependencies such as HDFS and S3 connectors. Having the dependencies be automatically inferred will also be a big factor as we move towards open-sourcing the project.
The bash script makes this impossible.

I would like to continue this discussion and plan a path forward. I am also happy to help with this.
Obviously, this does not tie into the S3 connector feature. I will plan something separately for that.

@funrollloops
Copy link
Contributor

Discussed offline: let's go forward with this PR. @majetideepak do you want to enable s3 in the CI (along with caching) in this PR, or defer to the next one?

We can also enable on one platform at a time.

@majetideepak
Copy link
Collaborator Author

majetideepak commented Sep 9, 2021

@funrollloops I will address the todo's in the review comments but will disable S3 in this PR. We can enable S3 and implement the CI caching in a separate PR. For the latter step, we need curl installed as well in the Linux CI images.

@majetideepak majetideepak mentioned this pull request Sep 10, 2021
@majetideepak majetideepak changed the title Add support for S3 connector Extend CMake to handle aws-sdk-cpp dependency needed for S3 connector Sep 10, 2021
@majetideepak majetideepak marked this pull request as ready for review September 10, 2021 17:01
@majetideepak
Copy link
Collaborator Author

@funrollloops @pedroerp I addressed the review comments. I also thought we could merge this PR as is given how busy (in a good way) this PR became. The S3 connector has the File System interface design and I expect some amount of discussion there as well. I plan to add that in a separate PR. Any thoughts?

Copy link
Contributor

@funrollloops funrollloops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need #202 in order for linux-build to succeed.


macro(build_awssdk)
message("Configured to download and build AWS-SDK-CPP version " ${AWS_SDK_VERSION})
externalproject_add(aws-sdk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: capitalization on ExternalProject_Add

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

CMakeLists.txt Outdated
option(VELOX_ENABLE_S3 "Build S3 Connector" OFF)

# Set third-party dependency versions below
set(AWS_SDK_VERSION "1.9.96")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not urgent, but it would simplify caching if this dependency information was in the CMake/ directory.

When building the docker image, ideally all files we copy in to do dependency installation should change only when the deps change. Some refactoring may be required to get that to work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the proposed requirement. But am still trying to understand how the dependency chain would look like.
I moved it to CMake/ThirdpartyDependencies.cmake. But not sure if that is what you exactly meant.

${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-core${CMAKE_STATIC_LIBRARY_SUFFIX})
set_target_properties(aws-sdk-s3 PROPERTIES IMPORTED_LOCATION
${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-s3${CMAKE_STATIC_LIBRARY_SUFFIX})
target_include_directories(aws-sdk-s3 PRIVATE ${AWS_INCLUDE_DIRS})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be PUBLIC not PRIVATE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@majetideepak
Copy link
Collaborator Author

majetideepak commented Sep 13, 2021

It looks like curl is actually needed by aws-sdk-cpp and not ExternalProject_Add.

@majetideepak
Copy link
Collaborator Author

majetideepak commented Sep 16, 2021

After implementing the S3 Connector, I see that there are a lot more dependencies required for aws-sdk-cpp when the ExternalPoject_Add path is taken. Let not merge this now.
Let's address the build story separately for libhdfs3 and aws-sdk-cpp.

PHILO-HE pushed a commit to PHILO-HE/velox that referenced this pull request Apr 3, 2023
rui-mo added a commit to rui-mo/velox that referenced this pull request Apr 23, 2023
zhouyuan pushed a commit to zhouyuan/velox that referenced this pull request Jun 7, 2023
relative pr:

Check a fallback case in validation: using literal partition key in window function facebookincubator#148
Fix might_contain validate fallback and support struct literal facebookincubator#137
Implement datetime functions in velox/sparksql. facebookincubator#81
Parse options in SingularOrList correctly facebookincubator#48
Add SingularOrList support facebookincubator#45
Support if then in filter facebookincubator#74
Fix semi join output type and support existence join facebookincubator#67
Support decimal as partition column facebookincubator#167
Add the window support facebookincubator#61
Add expand operator facebookincubator#65
Support more cases of filter and its pushdown facebookincubator#14
Yohahaha pushed a commit to Yohahaha/velox that referenced this pull request Jul 4, 2023
relative pr:

Check a fallback case in validation: using literal partition key in window function facebookincubator#148
Fix might_contain validate fallback and support struct literal facebookincubator#137
Implement datetime functions in velox/sparksql. facebookincubator#81
Parse options in SingularOrList correctly facebookincubator#48
Add SingularOrList support facebookincubator#45
Support if then in filter facebookincubator#74
Fix semi join output type and support existence join facebookincubator#67
Support decimal as partition column facebookincubator#167
Add the window support facebookincubator#61
Add expand operator facebookincubator#65
Support more cases of filter and its pushdown facebookincubator#14
chenxu14 pushed a commit to chenxu14/velox that referenced this pull request Jul 5, 2023
relative pr:

Check a fallback case in validation: using literal partition key in window function facebookincubator#148
Fix might_contain validate fallback and support struct literal facebookincubator#137
Implement datetime functions in velox/sparksql. facebookincubator#81
Parse options in SingularOrList correctly facebookincubator#48
Add SingularOrList support facebookincubator#45
Support if then in filter facebookincubator#74
Fix semi join output type and support existence join facebookincubator#67
Support decimal as partition column facebookincubator#167
Add the window support facebookincubator#61
Add expand operator facebookincubator#65
Support more cases of filter and its pushdown facebookincubator#14
PHILO-HE pushed a commit to PHILO-HE/velox that referenced this pull request Jul 17, 2023
relative pr:

Check a fallback case in validation: using literal partition key in window function facebookincubator#148
Fix might_contain validate fallback and support struct literal facebookincubator#137
Implement datetime functions in velox/sparksql. facebookincubator#81
Parse options in SingularOrList correctly facebookincubator#48
Add SingularOrList support facebookincubator#45
Support if then in filter facebookincubator#74
Fix semi join output type and support existence join facebookincubator#67
Support decimal as partition column facebookincubator#167
Add the window support facebookincubator#61
Add expand operator facebookincubator#65
Support more cases of filter and its pushdown facebookincubator#14
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Jul 21, 2023
relative pr:

Check a fallback case in validation: using literal partition key in window function facebookincubator#148
Fix might_contain validate fallback and support struct literal facebookincubator#137
Implement datetime functions in velox/sparksql. facebookincubator#81
Parse options in SingularOrList correctly facebookincubator#48
Add SingularOrList support facebookincubator#45
Support if then in filter facebookincubator#74
Fix semi join output type and support existence join facebookincubator#67
Support decimal as partition column facebookincubator#167
Add the window support facebookincubator#61
Add expand operator facebookincubator#65
Support more cases of filter and its pushdown facebookincubator#14
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Jul 24, 2023
relative pr:

Check a fallback case in validation: using literal partition key in window function facebookincubator#148
Fix might_contain validate fallback and support struct literal facebookincubator#137
Implement datetime functions in velox/sparksql. facebookincubator#81
Parse options in SingularOrList correctly facebookincubator#48
Add SingularOrList support facebookincubator#45
Support if then in filter facebookincubator#74
Fix semi join output type and support existence join facebookincubator#67
Support decimal as partition column facebookincubator#167
Add the window support facebookincubator#61
Add expand operator facebookincubator#65
Support more cases of filter and its pushdown facebookincubator#14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants