-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend CMake to handle aws-sdk-cpp dependency needed for S3 connector #167
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, thanks @majetideepak for paving the way! Tagging @funrollloops and @mshang816 who understand these things a lot better than me, but if this is just as portable I think we could try to use it for the other dependencies as well.
CMake/ThirdpartyDependencies.cmake
Outdated
include(ExternalProject) | ||
macro(build_awssdk) | ||
message("Configured to download and build AWS-SDK-CPP version " ${AWS_SDK_VERSION}) | ||
externalproject_add(aws-sdk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great! We currently have custom library installation logic wrapped in bash scripts (scripts/setup-*), but consolidating this with cmake will be a great cleanup. Especially considering (I'm assuming) this is portable between linux and macos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is portable across linux, macos, and Windows. Some libraries (mostly on Windows) might need some custom linking. CMake allows us to do that as well.
CMake/ThirdpartyDependencies.cmake
Outdated
macro(build_awssdk) | ||
message("Configured to download and build AWS-SDK-CPP version " ${AWS_SDK_VERSION}) | ||
externalproject_add(aws-sdk | ||
PREFIX aws-sdk-cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this require the external library to be built with cmake as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, https://cmake.org/cmake/help/latest/module/ExternalProject.html is pretty versatile. We can have a custom build command. It allows patching as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you evaluated using vcpkg?
ExternalProject_Add is a reasonable choice, though I would prefer not to have to replicate the find script (below).
CMake/ThirdpartyDependencies.cmake
Outdated
ExternalProject_Get_Property(aws-sdk INSTALL_DIR) | ||
add_library(aws-sdk-core STATIC IMPORTED) | ||
add_library(aws-sdk-s3 STATIC IMPORTED) | ||
set_target_properties(aws-sdk-core PROPERTIES IMPORTED_LOCATION ${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-core${CMAKE_STATIC_LIBRARY_SUFFIX}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we fit this line in 80 chars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. will do.
CMakeLists.txt
Outdated
set(CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/CMake" ${CMAKE_MODULE_PATH}) | ||
|
||
# Add all options below | ||
option(VELOX_ENABLE_S3 "Build S3 Connector" ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we leave this OFF by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
# Set third-party dependency versions below | ||
set(AWS_SDK_VERSION "1.9.96") | ||
|
||
include(ThirdpartyDependencies) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do other projects usually organize these things physically? All the different projects in one file, or one file per external project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have another level of indirection inside ThirdpartyDependencies for each external project. It makes sense to have per external project file.
Is the intent to enable s3 on CircleCI? If so, we need to give some thought to dependency caching. Caching when using ExternalProject_Add is possible, but awkward. In fact by default dependencies aren't even shared between build modes when using ExternalProject_Add, since the deps are in the build directory. On balance IMO managing the deps outside the CMake script is simpler overall, but it's a debatable point. @pedroerp, @mshang816 thoughts? |
When using ExternalProject_Add, the install directory is configurable. We can add a CMake variable that sets a shared install directory if sharing deps across build modes is desired. |
CMake/ThirdpartyDependencies.cmake
Outdated
macro(build_awssdk) | ||
message("Configured to download and build AWS-SDK-CPP version " ${AWS_SDK_VERSION}) | ||
externalproject_add(aws-sdk | ||
PREFIX aws-sdk-cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you evaluated using vcpkg?
ExternalProject_Add is a reasonable choice, though I would prefer not to have to replicate the find script (below).
CMake/ThirdpartyDependencies.cmake
Outdated
add_library(aws-sdk-s3 STATIC IMPORTED) | ||
set_target_properties(aws-sdk-core PROPERTIES IMPORTED_LOCATION ${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-core${CMAKE_STATIC_LIBRARY_SUFFIX}) | ||
set_target_properties(aws-sdk-s3 PROPERTIES IMPORTED_LOCATION ${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-s3${CMAKE_STATIC_LIBRARY_SUFFIX}) | ||
set(AWS_INCLUDE_DIRS ${INSTALL_DIR}/include) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're having to effectively duplicate the Find script here.
Another pattern I've seen (and used) is to build the primary project through ExternalProject_Add as well. Since it uses a separate CMake invocation it can use find_package to find the dependencies built by the metaproject.
However this makes building awkward and would require invasive changes, so let's not do that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplication is only limited to aws-sdk-cpp since they provide their own Find implementation which is only available after download. If we have the Find script apriori (usual for other packages), we can just use that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aws-sdk-cpp has a complicated Find setup due to its dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot use find_package and ExternalProject_Add together. So you are right about the duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more on this yesterday. Since we already know where the package is installed via ExternalProject_Add, we don't really need the Find, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's somewhat true, but if we want people to be able to provide their own dependencies we need to replicate at least part of the interface presented by the find
script. It's another thing that can break if it's not tested.
CMake/ThirdpartyDependencies.cmake
Outdated
set_target_properties(aws-sdk-s3 PROPERTIES IMPORTED_LOCATION ${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-s3${CMAKE_STATIC_LIBRARY_SUFFIX}) | ||
set(AWS_INCLUDE_DIRS ${INSTALL_DIR}/include) | ||
set(AWS_LIBS aws-sdk-s3 aws-sdk-core) | ||
include_directories(SYSTEM ${AWS_INCLUDE_DIRS}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use target_include_directories
to add the include directories to the interface of the aws-sdk-s3 target? This will ensure that files that include the header have the library dependency properly set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
I have not used vcpkg before. But a quick search seems to say that both are somewhat similar. Do you see any advantages of using that? |
That is a little reductionist, but not wrong. The vcpkg repo also has recipes for installation of packages, and presumably other features beyond ExternalProject. If you read that entire blog post, actually, the author ended up choosing vcpkg over ExternalProject, Conan, Hunter, and a few other more obscure options. That doesn't mean it's the best option for Velox, of course. I haven't had a chance to use vcpkg (or other package managers like hunter and conan), so I don't have a strong opinion on it. I have used ExternalProject before however, and IMO it was superficially simple but ended up being overall more difficult to work with than a shell script. I am pretty comfortable with bash to be fair. Note we have another dependency management system that was partially implemented but not deployed: FB's getdeps. The current state is definitely not perfect, but it was arrived at after several iterations. This is my wishlist for dependency management:
Benefits of using ExternalProject: automatic rebuild on dependency version changes, a potentially simplified new developer experience particularly if we allow dependencies to be enabled/disabled via CMake flags, source-installed dependencies are isolated from the system by default (note the setup scripts for macos already support this, we just default to installing to /). We also had a requirement from @spershin to have fine-grained control of dependency rebuilds; I added interactive prompts in the macos setup script at his request. I'm not sure how he would feel about switching over to ExternalProject. If you would like to avoid this discussion, I would suggest adding the dependency to the setup scripts initially. It should be straightforward to do so. |
On a narrower note, this PR needs curl to be added to the linux docker image to enable ExternalProject. We'd want to update the image to include s3 anyway; unfortunately this has to be done by one of us. |
Thank you for the detailed explanation.
I feel this benefit will become very important/useful as we include more optional features/dependencies such as HDFS and S3 connectors. Having the dependencies be automatically inferred will also be a big factor as we move towards open-sourcing the project. I would like to continue this discussion and plan a path forward. I am also happy to help with this. |
Discussed offline: let's go forward with this PR. @majetideepak do you want to enable s3 in the CI (along with caching) in this PR, or defer to the next one? We can also enable on one platform at a time. |
@funrollloops I will address the todo's in the review comments but will disable S3 in this PR. We can enable S3 and implement the CI caching in a separate PR. For the latter step, we need curl installed as well in the Linux CI images. |
@funrollloops @pedroerp I addressed the review comments. I also thought we could merge this PR as is given how busy (in a good way) this PR became. The S3 connector has the File System interface design and I expect some amount of discussion there as well. I plan to add that in a separate PR. Any thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need #202 in order for linux-build to succeed.
CMake/AWSSDKCPP.cmake
Outdated
|
||
macro(build_awssdk) | ||
message("Configured to download and build AWS-SDK-CPP version " ${AWS_SDK_VERSION}) | ||
externalproject_add(aws-sdk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: capitalization on ExternalProject_Add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
CMakeLists.txt
Outdated
option(VELOX_ENABLE_S3 "Build S3 Connector" OFF) | ||
|
||
# Set third-party dependency versions below | ||
set(AWS_SDK_VERSION "1.9.96") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not urgent, but it would simplify caching if this dependency information was in the CMake/ directory.
When building the docker image, ideally all files we copy in to do dependency installation should change only when the deps change. Some refactoring may be required to get that to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the proposed requirement. But am still trying to understand how the dependency chain would look like.
I moved it to CMake/ThirdpartyDependencies.cmake
. But not sure if that is what you exactly meant.
CMake/AWSSDKCPP.cmake
Outdated
${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-core${CMAKE_STATIC_LIBRARY_SUFFIX}) | ||
set_target_properties(aws-sdk-s3 PROPERTIES IMPORTED_LOCATION | ||
${INSTALL_DIR}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}aws-cpp-sdk-s3${CMAKE_STATIC_LIBRARY_SUFFIX}) | ||
target_include_directories(aws-sdk-s3 PRIVATE ${AWS_INCLUDE_DIRS}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be PUBLIC
not PRIVATE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
It looks like |
After implementing the S3 Connector, I see that there are a lot more dependencies required for aws-sdk-cpp when the ExternalPoject_Add path is taken. Let not merge this now. |
relative pr: Check a fallback case in validation: using literal partition key in window function facebookincubator#148 Fix might_contain validate fallback and support struct literal facebookincubator#137 Implement datetime functions in velox/sparksql. facebookincubator#81 Parse options in SingularOrList correctly facebookincubator#48 Add SingularOrList support facebookincubator#45 Support if then in filter facebookincubator#74 Fix semi join output type and support existence join facebookincubator#67 Support decimal as partition column facebookincubator#167 Add the window support facebookincubator#61 Add expand operator facebookincubator#65 Support more cases of filter and its pushdown facebookincubator#14
relative pr: Check a fallback case in validation: using literal partition key in window function facebookincubator#148 Fix might_contain validate fallback and support struct literal facebookincubator#137 Implement datetime functions in velox/sparksql. facebookincubator#81 Parse options in SingularOrList correctly facebookincubator#48 Add SingularOrList support facebookincubator#45 Support if then in filter facebookincubator#74 Fix semi join output type and support existence join facebookincubator#67 Support decimal as partition column facebookincubator#167 Add the window support facebookincubator#61 Add expand operator facebookincubator#65 Support more cases of filter and its pushdown facebookincubator#14
relative pr: Check a fallback case in validation: using literal partition key in window function facebookincubator#148 Fix might_contain validate fallback and support struct literal facebookincubator#137 Implement datetime functions in velox/sparksql. facebookincubator#81 Parse options in SingularOrList correctly facebookincubator#48 Add SingularOrList support facebookincubator#45 Support if then in filter facebookincubator#74 Fix semi join output type and support existence join facebookincubator#67 Support decimal as partition column facebookincubator#167 Add the window support facebookincubator#61 Add expand operator facebookincubator#65 Support more cases of filter and its pushdown facebookincubator#14
relative pr: Check a fallback case in validation: using literal partition key in window function facebookincubator#148 Fix might_contain validate fallback and support struct literal facebookincubator#137 Implement datetime functions in velox/sparksql. facebookincubator#81 Parse options in SingularOrList correctly facebookincubator#48 Add SingularOrList support facebookincubator#45 Support if then in filter facebookincubator#74 Fix semi join output type and support existence join facebookincubator#67 Support decimal as partition column facebookincubator#167 Add the window support facebookincubator#61 Add expand operator facebookincubator#65 Support more cases of filter and its pushdown facebookincubator#14
relative pr: Check a fallback case in validation: using literal partition key in window function facebookincubator#148 Fix might_contain validate fallback and support struct literal facebookincubator#137 Implement datetime functions in velox/sparksql. facebookincubator#81 Parse options in SingularOrList correctly facebookincubator#48 Add SingularOrList support facebookincubator#45 Support if then in filter facebookincubator#74 Fix semi join output type and support existence join facebookincubator#67 Support decimal as partition column facebookincubator#167 Add the window support facebookincubator#61 Add expand operator facebookincubator#65 Support more cases of filter and its pushdown facebookincubator#14
relative pr: Check a fallback case in validation: using literal partition key in window function facebookincubator#148 Fix might_contain validate fallback and support struct literal facebookincubator#137 Implement datetime functions in velox/sparksql. facebookincubator#81 Parse options in SingularOrList correctly facebookincubator#48 Add SingularOrList support facebookincubator#45 Support if then in filter facebookincubator#74 Fix semi join output type and support existence join facebookincubator#67 Support decimal as partition column facebookincubator#167 Add the window support facebookincubator#61 Add expand operator facebookincubator#65 Support more cases of filter and its pushdown facebookincubator#14
This change currently extends the CMake script to find or download and install aws-sdk-cpp.
VELOX_ENABLE_S3 option is used to enable the S3 connector and is set by default to OFF.
The package (including headers, libraries, binaries) is installed in the directory specified by VELOX_DEPENDENCY_INSTALL_DIR. The default is the build directory.
If AWS SDK CPP is already installed but in a custom location, set AWS_SDK_CPP_INSTALL_DIR to the install location.