Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better programmatic generation of metadata json in tests #412

Merged

Conversation

QubitPi
Copy link
Contributor

@QubitPi QubitPi commented Jun 23, 2017

Address issue #392

The strategy is to have all metadata specs under com.yahoo.bard.webservice.metadata to extend BaseDataSourceMetadataSpec, which is a base specification of all metadata tests. All metadata Specs override childSetupSpec(), and call/override generate*() methods to initiate metadata related resources.

Take "intervals" as an example:

  1. you don't need any of the resources(including "intervals") defined in base Spec, you do nothing
  2. you need "intervals" in your test, you override childSetupSpec() and call its generation method in your Spec as follows
@Override
def childSetupSpec() {
    intervals = generateIntervals()
}
  1. you need "intervals" in your test, but you need a different set of intervals:
@Override
def childSetupSpec() {
    intervals = generateIntervals()
}

@Override
Map<String, Interval> generateIntervals() {
    [
        "interval1": Interval.parse("2015-01-01T00:00:00.000Z/2015-01-02T00:00:00.000Z"),
        "interval2": Interval.parse("2015-01-02T00:00:00.000Z/2015-01-03T00:00:00.000Z"),
        "interval3": Interval.parse("2015-01-03T00:00:00.000Z/2015-01-04T00:00:00.000Z"),
        "interval123": Interval.parse("2015-01-01T00:00:00.000Z/2015-01-04T00:00:00.000Z")
    ]
}

@QubitPi QubitPi added the WIP label Jun 23, 2017
@QubitPi QubitPi force-pushed the better-programmatic-generation-of-metadata-json-in-tests branch from a094a2e to d1969ef Compare November 16, 2017 07:10
@QubitPi QubitPi force-pushed the better-programmatic-generation-of-metadata-json-in-tests branch from 1825d9c to d722376 Compare November 17, 2017 06:51
@QubitPi QubitPi force-pushed the better-programmatic-generation-of-metadata-json-in-tests branch from d722376 to 607c8b2 Compare November 17, 2017 06:55
Copy link
Contributor

@archolewa archolewa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the ideas look reasonable. Mostly comments around trying to make the Groovy code a bit cleaner, more concise, and/or easier to read.

@Shared
String version1
Integer binversion1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why is it an Integer rather than an int?

Copy link
Contributor Author

@QubitPi QubitPi Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archolewa A binary version(binVersion) is used as a required argument to construct a DataSegment. That argument is an Integer

@Shared
DataSegment segment1

long size2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these the size of?

Copy link
Contributor Author

@QubitPi QubitPi Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archolewa Same reason as binVersion above


@Shared
List<String> dimensions123
def shutdownSpec() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be cleanupSpec. Also if we're going to have a childSetupSpec then we should have a childCleanupSpec as well, so that people can cleanup whatever they setup.

[
"interval1": Interval.parse("2015-01-01T00:00:00.000Z/2015-01-02T00:00:00.000Z"),
"interval2": Interval.parse("2015-01-02T00:00:00.000Z/2015-01-03T00:00:00.000Z"),
"interval12": Interval.parse("2015-01-01T00:00:00.000Z/2015-01-03T00:00:00.000Z")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to wrap quotes around String keys in Groovy:

     interval1: Interval.parse(...),
     interval2: Interval.parse(...),
     ...

version2 = DateTimeFormat.fullDateTime().print(DateTime.now())
RangeSet<DateTime> generateRangeSet() {
rangeSet = TreeRangeSet.create()
rangeSet.add(Range.closedOpen(intervals["interval12"].getStart(), intervals["interval12"].getEnd()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, when accessing string keys in a map, you can just write map.key:

        rangeSet.add(Range.closedOpen(intervals.interval12.getStart(), intervals.intervals12.getEnd

Though since you're pulling out interval12 twice, I'd probably pull it out into a variable:

       Interval interval12 = intervals.interval12

Copy link
Contributor Author

@QubitPi QubitPi Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archolewa Thank you very much for the tip! I've changed this as well as how intervals are accessed in the same way.

[
new DataSegment(
tableName,
intervals["interval1"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Groovy has a syntactic shortcut for accessing string keys: intervals.interval1


def shutdownSpec() {
DateTimeZone.setDefault(currentTZ)
[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I prefer to use an explicit return statement in functions/methods that are longer than one or two lines. Makes it clearer what you're doing with that last line.

@@ -170,10 +145,10 @@ class DataSourceMetadataLoadTaskSpec extends Specification {
druidWS.jsonResponse = {fullDataSourceMetadataJson}

expectedIntervalsMap = [:]
dimensions123.each {
expectedIntervalsMap.put(new DimensionColumn(dimensionDict.findByApiName(it)), [interval123] as Set)
dimensions.each {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than building an empty map, and then using an each to populate it, I would initialize it from the beginning:

expectedIntervalsMap = dimensions123
              .collect { dimensionDict.findByApiName(it) }
              .collectEntries{[(it): [interval123] as Set]} 

And then use assignment syntax for the metrics:

    metrics123
            .collect { new MetricColumn(it)}
            .each { expectedIntervalsMap[it] = [intervals.interval123] as Set }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archolewa Thank you very much for teaching me this good practice! I'll take a note on this 😉

TestApiDimensionName dim1 = TestApiDimensionName.BREED
TestApiDimensionName dim2 = TestApiDimensionName.SPECIES
TestApiDimensionName dim3 = TestApiDimensionName.SEX
List<TestApiDimensionName> generateDimensions() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than making this a list, I would make this a Map<String, TestApiDimensionName>. Then, in extending classes, you can use dimensions.dim3.asName() rather than dimensions[2].asName(). This allows people writing specs to give their dimensions meaningful names and then use them (though admittedly the names we typically use aren't all that more meaningful than dimensions[2]).

The same applies to metrics.

Copy link
Contributor Author

@QubitPi QubitPi Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archolewa Beautiful! 👍 This is exactly what I was considering before looking at this comment

And thank you for letting me know that we can index a map in this smart way in Groovy: dimensions.dim3.asName()

intervals = generateIntervals()
dimensions = generateDimensions()
metrics = generateMetrics()
versions = generateVersions()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep doing this in children classes. Since this simply assigns fields that are in the parent class, why not move this into the parent class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archolewa Would you like to move versions = generateVersions() or all other generate*()'s ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the entire method, sorry I wasn't clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, If we move all of these and (in some cases) we override generateIntervals() in child Spec, for example, then other generate method might throw an error. For instance

    @Override
    Map<String, Interval> generateIntervals() {
        [
                "interval1": Interval.parse(intervalString1),
                "interval3": Interval.parse(intervalString3),
                "interval13": Interval.parse(intervalString13),
                "interval23": Interval.parse(intervalString23)
        ]
    }

Then this default in base Spec won't work, because intervals.interval2 does not exist

    Map<String, DataSegment> generateSegments() {
        NumberedShardSpec partition1 = Stub(NumberedShardSpec) {
            getPartitionNum() >> 0
        }

        NumberedShardSpec partition2 = Stub(NumberedShardSpec) {
            getPartitionNum() >> 1
        }

        return [
                segment1: new DataSegment(
                        tableName,
                        intervals.interval1,
                        versions.version1,
                        null,
                        dimensions.keySet().toList(),
                        metrics.keySet().toList(),
                        partition1,
                        binaryVersions.binaryVersion1,
                        sizes.size1
                ),
                segment2: new DataSegment(
                        tableName,
                        intervals.interval1,
                        versions.version2,
                        null,
                        dimensions.keySet().toList(),
                        metrics.keySet().toList(),
                        partition2,
                        binaryVersions.binaryVersion1,
                        sizes.size2
                ),
                segment3: new DataSegment(
                        tableName,
                        intervals.interval2,
                        versions.version1,
                        null,
                        dimensions.keySet().toList(),
                        metrics.keySet().toList(),
                        partition1,
                        binaryVersions.binaryVersion1,
                        sizes.size1
                ),
                segment4: new DataSegment(
                        tableName,
                        intervals.interval2,
                        versions.version2,
                        null,
                        dimensions.keySet().toList(),
                        metrics.keySet().toList(),
                        partition2,
                        binaryVersions.binaryVersion1,
                        sizes.size2
                )
        ]
    }

Copy link
Contributor

@archolewa archolewa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to request changes...

@QubitPi QubitPi removed the TESTING label Nov 20, 2017
@QubitPi
Copy link
Contributor Author

QubitPi commented Nov 20, 2017

@archolewa I applied the dictionary method to all shared variables in BaseDataSourceMetadataSpec, because I think your suggestion is such a good idea that it should be applied more broadly.

In addition, in the last commit, I cleaned up all meta data tests a little bit(organizing init stuff to setup() method and applied more Groovy smart syntax, etc)

}
def setup() {
dimensions13 = [
dimensions.(TestApiDimensionName.BREED.asName()).asName(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know you could do that (wrap TestApiDimensionName.BREED.asName() in parenthesis to use the dot syntax, rather than the [] syntax).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archolewa I did that just incase if people change BREED API name in the future, they don't break this test 😌 Does that look good to you?

dimensions.(TestApiDimensionName.BREED.asName()).asName(),
dimensions.(TestApiDimensionName.SEX.asName()).asName()
]
quotedDimensions = dimensions13.collect() { '"' + it + '"'}
Copy link
Contributor

@archolewa archolewa Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you can't do /"$it"/? Slashy strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should do that, yes. Thanks for the doc link 😉

@QubitPi QubitPi force-pushed the better-programmatic-generation-of-metadata-json-in-tests branch from e90c411 to 9751ea9 Compare November 22, 2017 00:42
intervals.interval1,
versions.version1,
null,
dimensions.keySet().toList(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use the more groovy way: dimensions.keySet() as List. Same for all the following toList()

Interval interval2 = Interval.parse("2015-01-02T00:00:00.000Z/2015-01-03T00:00:00.000Z")
Interval interval3 = Interval.parse("2015-01-03T00:00:00.000Z/2015-01-04T00:00:00.000Z")
Interval interval123 = Interval.parse("2015-01-01T00:00:00.000Z/2015-01-04T00:00:00.000Z")
List<String> dimensions13
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, one thing your strategy confuse me is that in the original Spec, we have something like:

List<String> dimensions123 = [dim1, dim2, dim3]*.asName()		
List<String> dimensions13 = [dim1, dim3]*.asName()

Your strategy choose to use generateDimensions() to populate your new dimensions123 as the shared variable dimensions, but for dimensions13 it's populate in the setup(). It's a bit inconsistent for me that two similar variables in the old Spec get to be set in different ways.

I might be wrong, but here is my suggestion:

  1. Define all generateBlah() in BaseDataSourceMetadataSpec and for each generateBlah(), populate the map with all possible data you will be using in all child Spec, you can rename the keys to distinguish between which child Spec will be using it:
    Map<String, Interval> generateIntervals() {
        return [
                loadTaskSpecInterval1: Interval.parse("2015-01-01T00:00:00.000Z/2015-01-02T00:00:00.000Z"),
                loadTaskSpecInterval2: Interval.parse("2015-01-02T00:00:00.000Z/2015-01-03T00:00:00.000Z"),
                loadTaskSpecInterval3: Interval.parse("2015-01-01T00:00:00.000Z/2015-01-03T00:00:00.000Z"),
                SegmentSpecInterval1: Interval.parse("2015-01-01T00:00:00.000Z/2015-01-04T00:00:00.000Z"),
                ...
        ]
    }
  1. Don't override any generateBlah() in your child Spec, then in your setup():
def setup() {
        Interval loadTaskSpecInterval1 = intervals.loadTaskSpecInterval1
        Interval loadTaskSpecInterval2 = intervals.loadTaskSpecInterval2
        ....

With these change you can avoid calling those generateBlah() functions over and over again in your childSetupSepc() as mentioned by @archolewa . You don't even need a childSetupSepc() because the parent Spec does all the population for you. 😼

Copy link
Contributor Author

@QubitPi QubitPi Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dimensions123 is gone. dimensions serves as a resource container for all dimensions that are to be used in a Spec

If we initialize all the resources in the way mentioned above, we didn't solve the original problem because in that way we are changing from piling up hardcoded variables to piling up hardcoded strings. In addition, base spec won't become pluggable and expressive. For example, Andrew wrote a child spec which had a resource called "expectedInterval": "2017/2018", but Tarrant also writes a separate spec but also wants to have a resource called "expectedInterval": "2018/2019"(with different value). In that case, Tarrant will be forced to use a different name or different interval value, which makes him a little unhappy. And things becomes worse if variable name affects readability and maintainability of child Spec. Moreover, if Tarrant makes a compromise and use the same "expectedInterval" and Andrew assign expectedInterval to a different value in his test logic, Tarrant will be angry to see his test is not working because someone outside of his test "polluted" his testing resources(This seriously affects the decoupling behavior of two separate classes). In the end, base spec will again take long time to be loaded and initialized, hard to read, and hard to maintain, as more child specs are written, which is exactly what the original problem we are trying to solve.

The reason for not generating these resources but having child extend and override is that, for example, if you override generateInterval, generateSegment throws error in other Specs, since segments depends on generated intervals. What we have in this new paradigm is that we offer default resource package, but if you want a customized package(intervals + dimensions + segments), go define your clean and own world and your things won't be seen by others, won't affect others, and other's customization won't have any impact on you neither

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable

@QubitPi
Copy link
Contributor Author

QubitPi commented Nov 22, 2017

Thanks @archolewa !
Thanks @tarrantzhang for helping to review at late night 👍 !

@QubitPi QubitPi merged commit 888dd8c into master Nov 22, 2017
@QubitPi QubitPi deleted the better-programmatic-generation-of-metadata-json-in-tests branch November 22, 2017 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants