Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix injection failure of StorageLocationSelectorStrategy objects #10363

Merged
merged 10 commits into from
Dec 8, 2020

Conversation

FrankChen021
Copy link
Member

This PR fixes #10348 , which is caused by injection failure of storage selector strategy.

Description

Currently, all 4 implementations of StorageLocationSelectorStrategy requires a list of StorageLocation objects during construction. And StorageLocation differs from StorageLocationConfig deserialized from configuration file, and the former is instantiated by SegmentLoaderLocalCacheManager. This also means StorageLocation could not be injected into StorageLocationSelectorStrategy when they are being constructed.

In this PR,

  1. the ctor of implementations of StorageLocationSelectorStrategy are removed, instead, a setter of storage location method is provided in this interface so that SegmentLoaderLocalCacheManager could pass the objects to the strategy object

  2. based on the original code, configuration property should be druid.segmentCache.locationSelectorStrategy.type, related docs are also updated

  3. some unit test cases are added to check whether these strategy objects are correctly instantiated from configuration properties.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Member

@asdf2014 asdf2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@jihoonson
Copy link
Contributor

Tagged "Design Review" since it changes user-facing configurations.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an alternate approach recommended that I think will be more robust.

}

@VisibleForTesting
LeastBytesUsedStorageLocationSelectorStrategy(List<StorageLocation> storageLocations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach is brittle and might break when a new implementation of StorageLocationSelectorStrategy is added. Instead, I think we should make List<StorageLocation> storageLocations injectable instead via a module, something like

@Provides
@Singleton
public List<StorageLocation> provideStorageLocations(SegmentLoaderConfig config)
{
  this.locations = new ArrayList<>();
  for (StorageLocationConfig locationConfig : config.getLocations()) {
    locations.add(
        new StorageLocation(
            locationConfig.getPath(),
            locationConfig.getMaxSize(),
            locationConfig.getFreeSpacePercent()
        )
    );
return locations;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @suneet-s I check the code and find that this approach is a little bit complex for the existing code.

Storage selector strategy object is constructed during deserialization of SegmentLoaderConfig, and only after the construction of SegmentLoaderConfig, could it be possible to inject SegmentLoaderConfig to other objects to get list of StorageLocationConfig inside the object.

If we want to inject StorageLocation objects as the way you suggest, the strategy object must be separated from SegmentLoaderConfig into another new config class so that both SegmentLoaderConfig and this new config class can be injected into SegmentLocalCacheManager. There will be lots of test cases involved to change to meet the new ctor of SegmentLocalCacheManager. So I think these change involve more complexity.

Back to the concern you mentioned, I think there's no need to worry that new implementation would break the constraints. Because if it breaks, the problem would be easily detected during unit test or integrated test.

What do you think ?

@@ -1379,7 +1379,7 @@ These Historical configurations can be defined in the `historical/runtime.proper
|Property|Description|Default|
|--------|-----------|-------|
|`druid.segmentCache.locations`|Segments assigned to a Historical process are first stored on the local file system (in a disk cache) and then served by the Historical process. These locations define where that local cache resides. This value cannot be NULL or EMPTY. Here is an example `druid.segmentCache.locations=[{"path": "/mnt/druidSegments", "maxSize": "10k", "freeSpacePercent": 1.0}]`. "freeSpacePercent" is optional, if provided then enforces that much of free disk partition space while storing segments. But, it depends on File.getTotalSpace() and File.getFreeSpace() methods, so enable if only if they work for your File System.| none |
|`druid.segmentCache.locationSelectorStrategy`|The strategy used to select a location from the configured `druid.segmentCache.locations` for segment distribution. Possible values are `leastBytesUsed`, `roundRobin`, `random`, or `mostAvailableSize`. |leastBytesUsed|
|`druid.segmentCache.locationSelectorStrategy.type`|The strategy used to select a location from the configured `druid.segmentCache.locations` for segment distribution. Possible values are `leastBytesUsed`, `roundRobin`, `random`, or `mostAvailableSize`. |leastBytesUsed|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a doc fix? I don't see an associated change in the code. Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fix. According to original code, annotation on StorageLocationSelectorStrategy indicates the name of json property is type. When using druid.segmentCache.locationSelectorStrategy, jackson tries to find ctor with a String parameter , which causes the issue.

@FrankChen021
Copy link
Member Author

FrankChen021 commented Sep 9, 2020

Hi @jihoonson @suneet-s , here's a clarification for the change of user facing configuration.

This PR has no code change with the configuration name, but a rectification of the doc from the wrong configuration item name of selector strategy named as druid.segmentCache.locationSelectorStrategy to druid.segmentCache.locationSelectorStrategy.type.

When I first checked the issue, the doc also made me confused and took me sometime on the question which was the right name.

Looking at the annotation of StorageLocationSelectorStrategy, it's bounded to a property named as type to determine which class should be used during deserialization.

@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl =
    LeastBytesUsedStorageLocationSelectorStrategy.class)
@JsonSubTypes(value = {

Based upon the code, the right configuration name should druid.segmentCache.locationSelectorStrategy.type. The old implementation always fails to deserialize strategy object in default configuration mode but never throws any NPE exception because its getter method tries to initialize a default strategy object when it finds the strategy object is null.

I don't know why the druid.segmentCache.locationSelectorStrategy was put into the doc. Taking another class BalancerStrategyFactory for example, its doc corresponds to its property name in code.

@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "strategy", defaultImpl = CostBalancerStrategyFactory.class)
@JsonSubTypes(value = {
        @JsonSubTypes.Type(name = "diskNormalized", value = DiskNormalizedCostBalancerStrategyFactory.class),
        @JsonSubTypes.Type(name = "cost", value = CostBalancerStrategyFactory.class),
        @JsonSubTypes.Type(name = "cachingCost", value = CachingCostBalancerStrategyFactory.class),
        @JsonSubTypes.Type(name = "random", value = RandomBalancerStrategyFactory.class),
})
public interface BalancerStrategyFactory
{
JsonConfigProvider.bind(binder, "druid.coordinator.balancer", BalancerStrategyFactory.class);

And the doc says the property name should be druid.coordinator.balancer.strategy instead of druid.coordinator.balancer

@FrankChen021
Copy link
Member Author

@jihoonson @suneet-s I come up with an idea that requires no change to the existing configuration.

  1. bind configuration path druid.segmentCache to StorageLocationSelectorStrategy directly
  2. change the value of property in JsonTypeInfo annotation of StorageLocationSelectorStrategy from type to locationSelectorStrategy

The disadvantage is that all properties extended by implementations of StorageLocationSelectorStrategy in the future will all be put under druid.segmentCache, which means these properties are mixed up with properties of SegmentLoaderConfig. This approach might cause some confusion.

What do u think ?

@suneet-s
Copy link
Contributor

@FrankChen021 I'll look through these changes over the next couple of days and get back to you. Thanks for the fix and your patience :)

@FrankChen021
Copy link
Member Author

The latest CI reports that

Error: found 1 missing licenses. These licenses are reported, but missing in the registry
druid_module: core, groupId: com.fasterxml.jackson.module, artifactId: jackson-module-guice, version: 2.10.2, license: Apache License version 2.0

jackson-module-guice is added to druid-server due to the failure of dependency check of previous CI check.

I don't know how to handle it. Do you have any idea ? @suneet-s @asdf2014

@jihoonson
Copy link
Contributor

jihoonson commented Sep 17, 2020

The latest CI reports that

Error: found 1 missing licenses. These licenses are reported, but missing in the registry
druid_module: core, groupId: com.fasterxml.jackson.module, artifactId: jackson-module-guice, version: 2.10.2, license: Apache License version 2.0

jackson-module-guice is added to druid-server due to the failure of dependency check of previous CI check.

I don't know how to handle it. Do you have any idea ? @suneet-s @asdf2014

This is because of a mismatch between the version used in pom.xml and the one registered in licenses.yaml. I'm not sure why we haven't seen this error before.. But you can fix it by removing this entry and adding jackson-module-guice to here.

Other CI failures look unrelated. I just restarted them.

@@ -54,9 +54,6 @@
@JsonProperty("numBootstrapThreads")
private Integer numBootstrapThreads = null;

@JsonProperty("locationSelectorStrategy")
private StorageLocationSelectorStrategy locationSelectorStrategy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous implementation is better since you can use the configuration name druid.segmentCache.locationSelectorStrategy without type. Is there a reason that locationSelectorStrategy cannot be here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because StorageLocationStrategy depends on SegmentLoaderConfig.locations. If StorageLocationStrategy is placed here, when SegmentLoaderConfig is being deserialized, locations required by StorageLocationStrategy can't be found and injected into it.

After moving this property out of SegmentLoaderConfig, both SegmentLoaderConfig and StorageLocationStrategy are deserialized during construction of SegmentLoaderLocalCacheManager, and jackson could find the locations objects required by strategy object through google guice framework to create strategy object correctly.

Using the configuration name without type I think is wrong. Please take a look at this configuration druid.coordinator.balancer.strategy, or the clarification on the change of configuration name I left above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Then, how about using a similar naming to the coordinator balancer? For example, we can bind StorageLocationSelectorStrategy to druid.segmentCache.locationSelector, and use a strategy property name for StorageLocationSelectorStrategy instead of type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply.

Your suggestion makes the naming more meaningful. I'll update this PR later this day.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @FrankChen021 thank you!

@jihoonson
Copy link
Contributor

@suneet-s do you have more comments?

@@ -52,6 +55,7 @@ public void configure(Binder binder)
{
JsonConfigProvider.bind(binder, "druid.server", DruidServerConfig.class);
JsonConfigProvider.bind(binder, "druid.segmentCache", SegmentLoaderConfig.class);
JsonConfigProvider.bind(binder, "druid.segmentCache.locationSelector.strategy", StorageLocationSelectorStrategy.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be druid.segmentCache.locationSelector instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, this is a bug. I remember I tested it in my cluster, maybe I missed something then. The code has been updated in the latest commit.

binder.bind(Properties.class).toInstance(props);

JsonConfigProvider.bind(binder, "druid.segmentCache", SegmentLoaderConfig.class);
JsonConfigProvider.bind(binder, "druid.segmentCache.locationSelector", StorageLocationSelectorStrategy.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This binding is different from what actual module binds. I think this is why we missed the wrong binding. Can we use the same StorageNodeModule here? Or can we add a helper method which does the proper binding for both StorageNodeModule and tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why StorageModule is not used here is because there're some extra dependencies in StorageModule, which might introduce lots of injection code in test cases. A helper method is extracted to do the correct binding.

@suneet-s
Copy link
Contributor

suneet-s commented Oct 7, 2020

@FrankChen021 Can you describe how you tested this in your test cluster. What was the user-visible behavior before and after this change so that we can update the release notes if needed.

I didn't see any integration tests added for this change, so it's possible that someone might break this behavior again in the future - would it be possible to add integration tests?

@FrankChen021
Copy link
Member Author

Hi @suneet-s , in this PR a log is added in the ctor of SegmentLoaderLocalCacheManager to print the class name of the strategy object. The log shows as follows if druid.segmentCache.locationSelector.strategy is set to roundRobin

2020-10-07T15:22:55,251` INFO [main] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Using storage location strategy: [RoundRobinStorageLocationSelectorStrategy]

In this way, I know whether the configuration takes effect.

As the test cases, I've added some unit test cases to test whether this configuration takes effect by setting druid.segmentCache.locationSelector.strategy to different values. I think these cases would guard our code in case of future modification.

@FrankChen021
Copy link
Member Author

Hi @jihoonson @suneet-s , CI reports that compaction integration test fails while all other checks are ok. I don't know why, could you help me take a look at it?

@suneet-s
Copy link
Contributor

suneet-s commented Oct 8, 2020

Hi @jihoonson @suneet-s , CI reports that compaction integration test fails while all other checks are ok. I don't know why, could you help me take a look at it?

@FrankChen021 It appears these tests are flaky. I re-triggered them and the job passed - could you file an issue for this flaky test please?

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I tested the changed property working in my local machine.

@FrankChen021
Copy link
Member Author

Hi @suneet-s , do you have any further comments ?

@suneet-s
Copy link
Contributor

Hi @suneet-s , do you have any further comments ?

@FrankChen021 Sorry for the delay in this review. Since there are no integration tests, I'm trying to see what would happen on an upgrade, and whether or not this change would introduce a change in behavior. If you have integration tests that prove nothing has changed on the upgrade path, it would make this review a lot faster. Thanks for your patience

@FrankChen021
Copy link
Member Author

@suneet-s Got it. I'm not familiar with IT. One thing I don't understand is what we should expect from the integration tests. To verify if the right type of selector object injected for corresponding configuration?

@suneet-s
Copy link
Contributor

@suneet-s Got it. I'm not familiar with IT. One thing I don't understand is what we should expect from the integration tests. To verify if the right type of selector object injected for corresponding configuration?

@FrankChen021 I think the ITs should test that Druid uses the StorageLocationSelector that was configured by the user by looking at the user visible impact of changing this setting. So for example: A user who sets this to mostAvailableSize should see that segments are loaded on the historical with the most available size instead of the default which is least bytes used. Constructing a scenario where this behavior difference should occur may be tricky. I haven't had the time to think about how to construct this scenario yet. I will get back to you on my tests today. Thanks again for your patience

@FrankChen021
Copy link
Member Author

@FrankChen021 I think the ITs should test that Druid uses the StorageLocationSelector that was configured by the user by looking at the user visible impact of changing this setting. So for example: A user who sets this to mostAvailableSize should see that segments are loaded on the historical with the most available size instead of the default which is least bytes used. Constructing a scenario where this behavior difference should occur may be tricky. I haven't had the time to think about how to construct this scenario yet. I will get back to you on my tests today. Thanks again for your patience

I see. It's an end-to-end test case. Usually it is tricky to set up an environment for such a case.

I also studied some IT cases, I found that test cases send HTTP requests to nodes to verify whether they run successfully or not. But for historical nodes, there're no such interfaces exposed for us to observe the behavior of selector strategy. As you say, it's a little tricky.

@FrankChen021
Copy link
Member Author

Hi @suneet-s , Is this PR ready to be merged ? or is there anything I need to do ?

@suneet-s
Copy link
Contributor

@FrankChen021 - I got pulled in to some other issues and haven't had the time to look at this yet. I will look at this over the weekend and get back to you.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the tests and fix @FrankChen021 - sorry for the delay in reviewing this.

I'm happy to merge the fix as soon as the conflict is resolved.

@FrankChen021
Copy link
Member Author

Hi @suneet-s The conflict has been resolved by rebasing the branch onto the latest master. But CI failed, I checked it, and it seems that it's not related to the changes in PR. Could you re-trigger it to see if it passes ?

@suneet-s suneet-s merged commit c410648 into apache:master Dec 8, 2020
@suneet-s
Copy link
Contributor

suneet-s commented Dec 8, 2020

Thanks @FrankChen021 - looks like someone else beat me to the re-trigger.

@jihoonson jihoonson added this to the 0.21.0 milestone Jan 4, 2021
JulianJaffePinterest pushed a commit to JulianJaffePinterest/druid that referenced this pull request Jan 22, 2021
…che#10363)

* fix to allow customer storage location selector strategy

* add test cases to check instance of selector strategy

* update doc

* code format

* resolve code review comments

* inject StorageLocation

* fix CI

* fix mismatched license item reported by CI

* change property path from druid.segmentCache.locationSelectorStrategy.type to druid.segmentCache.locationSelector.strategy

* using a helper method to bind to correct property path
@FrankChen021 FrankChen021 deleted the bug_10348 branch March 9, 2021 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

druid.segmentCache.locationSelectorStrategy config raise Exception
4 participants