Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow impl for LocationProvider for to be created during runtime #1531

Merged

Conversation

mickjermsurawong-stripe
Copy link
Contributor

@mickjermsurawong-stripe mickjermsurawong-stripe commented Sep 30, 2020

This PR allows users to supply implementation of LocationProvider to be created during runtime.
This is motivated from the proposal to solve data localization.
From the community sync, the feedback is to make to leave implementation to users so other use cases outside data localization could make use of this too

@mickjermsurawong-stripe mickjermsurawong-stripe changed the title Localized store based on partition [wip] Localized store based on partition Sep 30, 2020
@jackye1995
Copy link
Contributor

I have 2 concerns with this PR:

  1. I think we should in addition propose dynamically loading the LocationProvider implementation, instead of using a boolean to switch on or off each implementation. This will allow any user to easily plugin a custom implementation. I wonder if such proposal is enough to both satisfy your need and also keep the community version general.

  2. for this specific implementation, the approach of using table property to get localizedDataLocationLookUp map feels not scalable to me if we want to make it a useful generic implementation. For example, customers might want to have 1 HDFS cluster per AWS local zone to read and write data due to latency requirement, and this setup easily generate hundreds to thousands of table properties. I don't know if there is any bad impact for such use case, since table properties is not designed to hold a lot of data.

@mickjermsurawong-stripe
Copy link
Contributor Author

hi @jackye1995 thanks for the feedback here! This is aligned with the discussion we had during the sync for solution that folks can easily customizable their locations. I've re-purposed this PR to address this. If this is in, we can easily implement our data localization here.

@mickjermsurawong-stripe mickjermsurawong-stripe changed the title [wip] Localized store based on partition Allow impl for LocationProvider for to be created during runtime Oct 1, 2020
@jackye1995
Copy link
Contributor

Thank you! I created an issue for this, can you link it to this PR? #1538

@mickjermsurawong-stripe
Copy link
Contributor Author

cc @stripe/iceberg-reviewers, I iterated on this based on feedback to have pluggable impl for location provider instead

@rdblue
Copy link
Contributor

rdblue commented Oct 2, 2020

Thanks for working on this, @mickjermsurawong-stripe!

I like dynamically loading a location provider, and I like that the changes required are pretty small. That looks good to me.

What I'd like to change is how this is tested. There are two main issues with it. First, the tests are in an unrelated suite for Spark that is mainly testing that Spark works with a variety of different schemas. It doesn't seem appropriate to put the tests for injecting a location provider in a different module and into this suite. Second, the TestLocationProvider returns real paths and the tests work by writing data files and validating that files are underneath those paths. That does test that it works, but I think this makes validation much harder. This approach requires a lot of test code and runs a lot of unrelated cases through to validate the location provider.

I would prefer a more targeted test in a new suited in core, TestLocationProvider. That would have a case that injects a TestLocationProvider that keeps track of calls in static variables, so you'd be able to inject the provider, get it from table.operations().locationProvider(), call some methods, and validate that TestLocationProvider received the calls.

We also need negative tests for when the provider can't be found. I'd like to have a good error message like "Cannot find location provider class: %s".

@rdblue
Copy link
Contributor

rdblue commented Oct 2, 2020

@jackye1995, I think this dynamic loading approach might be a good idea for injecting S3FileIO as well. What do you think?

@jackye1995
Copy link
Contributor

@rdblue yes, that is exactly how I implement, just waiting for your branch to cross check all details about S3FileIO before opening a PR

@mickjermsurawong-stripe
Copy link
Contributor Author

@rdblue, ah thank you for the feedback on test! I guess the complicated test here is the artifacts of me trying to test the logic in my first data localization feature.

I created a separate test as suggested; I made use of the existing TableTestBase and construct tests based on table.locationProvider() instead of its underlying operations (though that also directly delegates to the ops). I also addressed feedback on error message.

String.format("Unable to instantiate the provided implementation %s for %s.",
nonExistentImpl,
LocationProvider.class));
this.table.locationProvider();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exception rules, we like to use AssertHelpers.assertThrows. That way, the test doesn't need to end when the exception is thrown and you can run other validations.

Also, we don't typically use the prefix this. unless setting an instance field, unless it helps understanding what's going on. We always using when assigning, though.


// publicly visible for testing to be dynamically loaded
public static class InvalidDynamicallyLoadedLocationProvider implements LocationProvider {
// No public constructor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support a no-arg constructor as well as the one that passes the table location and properties?

}

@Test
public void testInvalidDynamicallyLoadedLocationProvider() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the class doesn't implement LocationProvider?

Could you add a test case for that?

@@ -87,6 +87,8 @@ private TableProperties() {
public static final String OBJECT_STORE_ENABLED = "write.object-storage.enabled";
public static final boolean OBJECT_STORE_ENABLED_DEFAULT = false;

public static final String LOCATION_PROVIDER_IMPL = "write.location-provider.impl";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add this to the docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

DynConstructors.builder(LocationProvider.class)
.impl(impl).build()
);
Optional<DynConstructors.Ctor<LocationProvider>> twoArgConstructor = Optional.empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the arguments passed to one constructor are a prefix of the args passed to the other, there is no need to use a different builder or call because the arguments actually passed are limited by the number of parameters defined by the constructor. That makes it possible to pass both arguments to either one, and any extras are ignored.

That helps simplify cases like this because you don't need the extra optional logic:

  ctor = DynConstructors.builder(LocationProvider.class)
      .impl(impl, String.class, Map.class)
      .impl(impl) // fall back to no-arg constructor
      .build()
  return ctor.newInstance(location, properties);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation here

<td class="colFirst"><a name="org.apache.iceberg.TableProperties.WRITE_METADATA_LOCATION">
<!-- -->
</a><code>public&nbsp;static&nbsp;final&nbsp;java.lang.String</code></td>
<td><code><a href="org/apache/iceberg/TableProperties.html#WRITE_METADATA_LOCATION">WRITE_METADATA_LOCATION</a></code></td>
<td class="colLast"><code>"write.metadata.path"</code></td>
</tr>
<tr class="altColor">
<tr class="rowColor">
Copy link
Contributor

@rdblue rdblue Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the html changes please? The docs that @aokolnychyi was referring to are built from markdown source located in site/docs/configuration.md.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh. ack!

@rdblue
Copy link
Contributor

rdblue commented Oct 5, 2020

Thanks for the update, @mickjermsurawong-stripe! I think the implementation can be quite a bit simpler and more like the original, even with support for the no-arg constructor. Sorry I didn't mention that earlier, before you went ahead and implemented it this way.

"has a public no-arg constructor or a two-arg constructor " +
"taking in the string base table location and its property string map.",
impl, LocationProvider.class));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If provided class has the specified constructor signature, but not implementing LocationProvider, it can still be loaded into the dynConstructor. The following block differentiates this missing interface more explicitly.
Might be a bit more code churn, but it does give more signals to users.

@mickjermsurawong-stripe
Copy link
Contributor Author

thanks @rdblue for the review so far! I believe this should be ready for another review.

} catch (ClassCastException e) {
throw new IllegalArgumentException(
String.format("Provided implementation for dynamic instantiation should implement %s, " +
"but found dynamic constructor %s.", LocationProvider.class, ctor));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic constructor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well you should probably throw the exception with the original as a cause so the context isn't lost.

"Make sure the implementation is in classpath, and that it either " +
"has a public no-arg constructor or a two-arg constructor " +
"taking in the string base table location and its property string map.",
impl, LocationProvider.class));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're catching the exception, it probably makes more sense to use buildChecked so the exception class isn't generic. That exception also has good information about why all of the implementations failed, so I would recommend adding that exception as a cause to this one that is thrown. I like throwing this one for context, though!

@mickjermsurawong-stripe
Copy link
Contributor Author

mickjermsurawong-stripe commented Oct 5, 2020

Propagating the causes now should give stacktrace for these two cases, with more specific causes.

 java.lang.IllegalArgumentException: Unable to find a constructor for implementation org.apache.iceberg.TestLocationProvider$InvalidArgTypesDynamicallyLoadedLocationProvider of interface org.apache.iceberg.io.LocationProvider. Make sure the implementation is in classpath, and that it either has a public no-arg constructor or a two-arg constructor taking in the string base table location and its property string map.
        at org.apache.iceberg.LocationProviders.locationsFor(LocationProviders.java:48)
        at org.apache.iceberg.TestTables$TestTableOperations.locationProvider(TestTables.java:212)
        at org.apache.iceberg.BaseTable.locationProvider(BaseTable.java:199)
        at org.apache.iceberg.TestLocationProvider.testInvalidArgTypesDynamicallyLoadedLocationProvider(TestLocationProvider.java:208)

        Caused by:
        java.lang.NoSuchMethodException: Cannot find constructor for interface org.apache.iceberg.io.LocationProvider
                Missing org.apache.iceberg.TestLocationProvider$InvalidArgTypesDynamicallyLoadedLocationProvider() [java.lang.NoSuchMethodException: org.apache.iceberg.TestLocationProvider$InvalidArgTypesDynamicallyLoadedLocationProvider.<init>()]
                Missing org.apache.iceberg.TestLocationProvider$InvalidArgTypesDynamicallyLoadedLocationProvider(java.lang.String,java.util.Map) [java.lang.NoSuchMethodException: org.apache.iceberg.TestLocationProvider$InvalidArgTypesDynamicallyLoadedLocationProvider.<init>(java.lang.String, java.util.Map)]

and

java.lang.IllegalArgumentException: Provided implementation for dynamic instantiation should implement interface org.apache.iceberg.io.LocationProvider.
        at org.apache.iceberg.LocationProviders.locationsFor(LocationProviders.java:59)
        at org.apache.iceberg.TestTables$TestTableOperations.locationProvider(TestTables.java:212)
        at org.apache.iceberg.BaseTable.locationProvider(BaseTable.java:199)
        at org.apache.iceberg.TestLocationProvider.testInvalidNoInterfaceDynamicallyLoadedLocationProvider(TestLocationProvider.java:190)

        Caused by:
        java.lang.ClassCastException: org.apache.iceberg.TestLocationProvider$InvalidNoInterfaceDynamicallyLoadedLocationProvider cannot be cast to org.apache.iceberg.io.LocationProvider
            at org.apache.iceberg.LocationProviders.locationsFor(LocationProviders.java:56)
            ... 3 more

@rdblue rdblue linked an issue Oct 5, 2020 that may be closed by this pull request
@rdblue rdblue merged commit 61aa677 into apache:master Oct 6, 2020
@rdblue
Copy link
Contributor

rdblue commented Oct 6, 2020

Thanks, @mickjermsurawong-stripe! I merged this. Nice work, and thank you for your patience and quick turn-around with the review comments!

@mickjermsurawong-stripe
Copy link
Contributor Author

Thank you Ryan! This will help with our internal feature :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dynamically load LocationProvider implementation
4 participants