Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove restriction for single physical dimension to multiple lookup dimension #112

Merged
merged 5 commits into from
Nov 30, 2016

Conversation

garyluoex
Copy link
Collaborator

@garyluoex garyluoex commented Nov 29, 2016

  • Modified physical dimension name to logical dimension name mapping into a Map<String, List<String>> instead of Map<String, String> in PhysicalTable.java

@garyluoex garyluoex changed the title Remove restriction for single physcial dimension to multiple lookup dimension Remove restriction for single physical dimension to multiple lookup dimension Nov 29, 2016
@@ -38,7 +40,7 @@
private Map<Column, Set<Interval>> workingIntervals;
private final Object mutex = new Object();
private final Map<String, String> logicalToPhysicalColumnNames;
private final Map<String, String> physicalToLogicalColumnNames;
private final Map<String, List<String>> physicalToLogicalColumnNames;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason we're making this a List rather than a Collection?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Set fit better (since it should be a unique set of logical dimensions)? I'm not sure what are the advantages of just using Collection, fill me in if you have time, thanks!

Copy link
Contributor

@archolewa archolewa Nov 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really more about avoiding overspecification. The places we use it only needs a Collection (we're iterating over every element). And at least in the current implementation we don't seem overly concerned with uniqueness. Though having the same dimension twice could lead to weirdness if addNewDimensionColumn isn't idempotent.

I'm not too concerned about it, it's just that we usually default to Collection unless we have a good reason for using a more specific type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to have any headaches around nested types. Let's just use set for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will make it Set since it makes sense to add a uniqueness constraint and doesn't make sense that there are duplicates.

continue;
}
DimensionColumn dimensionColumn = DimensionColumn.addNewDimensionColumn(this, dimension);
workingIntervals.put(dimensionColumn, nameIntervals.getValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can streamify this easily enough, I think:

 getLogicalColumnName(physicalName).stream()
     .map(dimensionDictionary::findByApiName)
     .filter(dimension -> dimension != null)
     .forEach(dimension -> {
           DimensionColumn dimensionColumn = DimensionColumn.addNewDimensionColumn(this, dimension);
           workingIntervals.put(dimensionColumn, nameIntervals.getValue()); 
     });

Note that you could put DimensionColumn.addNewDimensionColumn(this, dimension) in a map rather than the first statement of a forEach, but addNewDimensionColumn has side-effects, and side-effecting operations in a map operation feels ugly to me.

@@ -237,8 +244,8 @@ public String getPhysicalColumnName(String logicalName) {
* @param physicalName Physical name to lookup in physical table
* @return Translated physicalName if applicable
*/
private String getLogicalColumnName(String physicalName) {
return physicalToLogicalColumnNames.getOrDefault(physicalName, physicalName);
private List<String> getLogicalColumnName(String physicalName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this could probably be a Collection rather than a List.

.entrySet()
.stream()
.collect(StreamUtils.toLinkedMap(Map.Entry::getValue, Map.Entry::getKey))
this.logicalToPhysicalColumnNames.entrySet().stream().sequential().collect(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing the sequential() here is fine right? Will remove it in the next commit if no one complains.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe sequential just forces the stream to be sequential, even if it was originally parallel. However, by default the stream of a set is sequential, so sequential doesn't actually do anything.

So yes, I would suggest removing the sequential.

@michael-mclawhorn
Copy link
Contributor

The fact that this didn't break a test worries me. Please add a test to see that the correct set of logical column names is returned from the method in both the single-dimension-to-physical-name situation and the multiple-dimension-to-physical-name situation.

private List<String> getLogicalColumnName(String physicalName) {
return physicalToLogicalColumnNames.getOrDefault(physicalName, Arrays.asList(physicalName));
private Set<String> getLogicalColumnName(String physicalName) {
return physicalToLogicalColumnNames.getOrDefault(physicalName, new HashSet<>(Arrays.asList(physicalName)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pluralize this method name. getLogicalColumnNames

@@ -14,6 +14,9 @@ Current

### Changed:

- [Remove restriction for single physical dimension to multiple lookup dimensions](https://github.com/yahoo/fili/pull/112)
* Modified physical dimension name to logical dimension name mapping into a `Map<String, List<String>>` instead of `Map<String, String>` in `PhysicalTable.java`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in synch with the code change.

workingIntervals.put(dimensionColumn, nameIntervals.getValue());

getLogicalColumnName(physicalName).stream()
.sequential()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is extra again, will remove in the next commit again.

Copy link
Collaborator

@cdeszaq cdeszaq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing big, just a few trivial cleanups. 👍


getLogicalColumnNames(physicalName).stream()
.map(dimensionDictionary::findByApiName)
.filter(dimension -> dimension != null)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Objects::isNotNull simplifies this I think

.filter(dimension -> dimension != null)
.forEach(
dimension -> {
DimensionColumn dimensionColumn = DimensionColumn.addNewDimensionColumn(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a .map() call on the stream, since you're just transforming the Dimension into a DimensionColumn

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the operations will cause side-affects right (adding things to existing things)? so if the stream runs in parallel things will get synchronization problem. forEach is a reduce operation that allows side-affects I believe. Let me know if I'm wrong or not making sense, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @garyluoex on this one. It's not a big deal, especially since we don't run around randomly making streams parallel (and we shouldn't, not without carefully examining the operations in the stream first), but it just feels deeply and fundamentally wrong to me to put a side-effectful operation in a map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate on why it feels wrong to me, I generally prefer to be hit across the head with a big neon sign when an operation has side-effects, especially when the operation returns a value as well. forEach does that, while a map operation does not.

def "test physical to logical mapping is constructed correctly"() {
setup:
def oneDimPhysicalTable = new PhysicalTable("test table", DAY.buildZonedTimeGrain(UTC), ['dimension':'druidDim'])
def twoDimPhysicalTable = new PhysicalTable("test table", DAY.buildZonedTimeGrain(UTC), ['dimension1':'druidDim', 'dimension2':'druidDim'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have good reason to be taking advantage of Groovy's duck-typing, we should be explicitly specifying the types of the variables.

Copy link
Collaborator Author

@garyluoex garyluoex Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make things shorter and simpler to read since the type should be obvious from the assignment initialization in this case (it feels like it is more intuitive to me), let me know if there are any consequences of doing this or it is just good practice to declare type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in the context of our project, it's considered good practice (I don't know enough about the larger Groovy community to know if that's true more generally).

If nothing else, it improves IDE support, because the IDE knows what the variable type is.

It can also make things more readable, because it's easy for readers to see what the variable type this (though this can be partially offset because def is more concise, so which is more readable is going to be a matter of opinion).

Duck typing has its advantages, but it has its disadvantages too. Here we're not using any of the advantages, but we're still saddled with the disadvantages.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will change it then, thanks!

private String getLogicalColumnName(String physicalName) {
return physicalToLogicalColumnNames.getOrDefault(physicalName, physicalName);
private Set<String> getLogicalColumnNames(String physicalName) {
return physicalToLogicalColumnNames.getOrDefault(physicalName, new HashSet<>(Arrays.asList(physicalName)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use Collections.singleton.

.map(dimensionDictionary::findByApiName)
.filter(Objects::nonNull)
.forEach(
dimension -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation seems off here. dimension -> should be indented eight spaces, and everything inside the lambda should be indented four spaces past that.

@archolewa
Copy link
Contributor

Had a few small comments, 👍 once they've been addressed.

@archolewa archolewa merged commit 35b1557 into master Nov 30, 2016
@archolewa archolewa deleted the ManyPhysical branch November 30, 2016 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants