Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Apache Accumulo connector and documentation #5030

Closed
wants to merge 66 commits into from

Conversation

adamjshook
Copy link
Member

A Presto connector for Apache Accumulo. See the RST in presto-docs for more details. I've completed the CLA.

@electrum
Copy link
Contributor

Thanks for sending this! A couple of initial questions:

  1. Why a separate checkstyle configuration? All the code in the Presto tree should conform to the same style. If there are style changes to make, we should do them globally.
  2. Can you remove the "Copyright" line in the license header? For a number of reasons, we don't add that to any files in Presto.

@adamjshook
Copy link
Member Author

The checkstyle.xml is a leftover artifact from where I had originally developed the connector. It is no different -- I'll remove it.

As far as the Copyright, I'll look into it.

@apaprocki
Copy link

@electrum Would you be open to creating the standard Apache NOTICE file in the root that achieves the same outcome? Apache recommends creating that file, moving third-party copyright notices into there are not putting them in individual source files. Getting this from:

http://www.apache.org/legal/src-headers.html#header-existingcopyright

Since contributors maintain their copyright ownership in the works they submit, it is misleading to place the ASF copyright at the top of each source file. The previous ASF copyright notice represented the ASF's ownership in the collective work of the entire distribution, but placing it at the top of each source file was causing a lot of confusion. To avoid this confusion, we are placing all copyright notices in a single file and leaving only the licensing information within each source file.

(emphasis mine)

@electrum
Copy link
Contributor

electrum commented Apr 14, 2016 via email

@martint
Copy link
Contributor

martint commented Apr 14, 2016

We already have a NOTICE file, but it's not at the root of the source tree (primarily because it needs to get packaged and distributed as part of the binary package) https://github.com/prestodb/presto/blob/master/presto-server/NOTICE


// Check all the column types, and throw an exception if the types of a map are complex
// While it is a rare case, this is not supported by the Accumulo connector
Set<String> columnNames = new HashSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableSet.Builder instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@maciejgrzybek
Copy link
Member

About tests (sorry, cannot comment on diff because GitHub complains about too long PR):
It looks like AccumuloQueryRunner expects Accumulo running on localhost.
Have you considered writing a Tempto (https://github.com/prestodb/tempto) end-to-end product test which would spin up Accumulo for your tests? You should probably provide an Accumulo docker (similar to Hadoop images we have in presto-product-tests) image and write a fulfiller for Tempto to spin it up as a prerequisite for tests.

@adamjshook
Copy link
Member Author

adamjshook commented Apr 15, 2016

The test suites use MiniAccumuloCluster to stand up a single-node instance of HDFS, ZooKeeper, and Accumulo on the node executing the tests, which sounds similar to the goal of Tempto (though I am unfamiliar with Tempto -- I can look into it).

Regarding the Travis build, it was failing earlier due to the TestAccumuloDistributedQueries and TestAccumuloIntegrationSmokeTest being initialized back-to-back and them sharing the same MiniAccumuloCluster. One tries to create tables that already exist. I've fixed that by always forking a VM for surefire. mvn test is functioning on my end. Unsure about the current build error in Travis. It could be some intermittent ZooKeeper error or something more sinister.

LOG.info("Shutting down MAC");
accumulo.stop();
LOG.info("Cleaning up MAC directory");
FileUtils.forceDelete(macDir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if accumulo.stop() throws, won't you delete macDir?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a directory under /tmp, but I should move it to its own try/catch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in one flow you delete it, in other not - it's inconsistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Updated version to 0.151-SNAPSHOT, made API changes based on upstream
changes, increased AccumuloQueryRunner to use 4 workers, and updated
documentation.
Fixed new import locations in AccumuloStringFunctions.
- Removed unused variables
- Made many member variables 'final'
- Removed unnecessary Javadoc from self-explanatory methods
- Removed unnecessary 'throws' declarations
- Constructor arguments are now one per line
- Connector config properties now start with 'accumulo'
- Removed unused public static final variables in session properties
- Remove unused properties in presto-accumulo/pom.xml
- Remove zookeeper dependency and add runtime scope to hadoop artifacts
- Add exclusions to remove dependency version check plugin
- Add exclusions for duplicate classes
- Remove presto-main from pom.xml, including code updates to remove
  classes from the presto-main package
- Remove forking of test cases
- Remove provided scope from some dependencies
- Remove accumulo.properties file
- Move Logger to top of file
- Remove TPC-H test cases
- Remove log4j.properties
- Standardize new line wrapping
- Fix weird AccumuloSplit serialization problem
- More exclusions in pom file
- Apply API updates from rebase
- Add flush to AccumuloPageSink
For large table inserts, the AccumuloPageSink can cause a JVM crash due
to the client-side buffering of metrics.  This adds a temporary fix
while a better solution can be implemented.
- Rename FLOAT to REAL
- Remove serialVersionUID
- Use getConstructor.newInstance
- Use Guice to inject Connector vs static singleton
Connector remains singleton in testing code to ensure all tests use the
same MiniAccumuloCluster instance
- Remove setOptionalConfig
- Add exception clause for ZKMetadataManager race condition
- Remove provided scope from javax.inject and jackson deps
- Wrap Range objects in AccumuloSplit
- Fix NPE in listTables
- Documentation improvements
- Validate exception class
- Remove use of ConnectorFactoryContext
- Make io.airlift.units provided
Some predicates were being pushed to Accumulo for server-side filtering,
however the current implementation actually slows queries down.  These
iterators were stale code -- best to remove them for now until a
sufficient implementation surfaces.
- Use StandardErrorCodes where applicable
- Remove commented out tests
- Remove stale 'addColumn'
@cberner
Copy link
Contributor

cberner commented Oct 3, 2016

It looks like the accumulo.metadata.manager.class config isn't used. Can that be removed?

working with data. By default, the first column of the table definition
is set to the Accumulo row ID. This should be the primary key of your
table, and keep in mind that any ``INSERT statements`` containing the same
row ID is effectively an UPDATE as far as Accumulo is concerned, as any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this how Accumulo works in general? It seems to me that this should fail the INSERT, and the user should have to issue a DELETE first

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is how it works in general. All inserts will either insert a new row or overwrite any existing values.

@adamjshook
Copy link
Member Author

@cberner My original intent was having the metadata manager be pluggable, but I don't foresee people really doing that since ZooKeeper has to be there in the first place and is a reasonable way of managing the Presto/Accumulo metadata. I'll remove the property and pluggable nature of it.

@cberner
Copy link
Contributor

cberner commented Oct 3, 2016

👍 Cool, let me know when it's ready for another look. I think the rest of it looked good to me

@adamjshook
Copy link
Member Author

@cberner Back to you!

@cberner
Copy link
Contributor

cberner commented Oct 4, 2016

Squashed the commits together and merged. Thanks!

@cberner cberner closed this Oct 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants