Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canonical-data.json: Shared data among test cases #699

Closed
rbasso opened this issue Mar 11, 2017 · 7 comments
Closed

canonical-data.json: Shared data among test cases #699

rbasso opened this issue Mar 11, 2017 · 7 comments

Comments

@rbasso
Copy link
Contributor

rbasso commented Mar 11, 2017

While working in #625, we discovered that some canonical-data.json files have data shared among test cases:

  • grep - Has a files object that encodes files to be "grepped" in multiples test cases.
  • pov - Has a trees object shared by multiple test cases, referenced by their names.

The current JSON Schema doesn't allow that, so I plan to sidestep this issue putting the offending structures in comments, temporarily, until we decide how deal with that.

I see a few option:

  • Replicate the information where it is needed: This has the advantage of keeping all test cases independent, but it hurts to ask a programmer to do it.
  • Change the schema to allow a resources/shared structure: This would avoid repetition, but also increase the complexity of test generators significantly.
  • Move from JSON to YAML: Using YAML anchors would be a clean and easy solution, but YAML is not well-supported as JSON, and that may be a problem for test generators in some languages.
  • Change the offending exercises, so that this feature isn't needed: The simple, trustworthy non-solution.

Any other option?

@petertseng
Copy link
Member

petertseng commented Mar 11, 2017

Doh. And I know that I was at least partially involved in both:


Let me pretend for a moment that we have decided to accept duplication in pov/canonical-data.json, and see what that looks like.

$ grep '"tree"' pov/canonical-data.json | sort | uniq -c                                             
      6         "tree": "cousins",
      1         "tree": "deeply_nested",
      1         "tree": "kids",
      2         "tree": "large_flat",
      2         "tree": "simple",
      2         "tree": "singleton",

So we won't care if deeply_nested and kids go away, and we probably can survive repeating simple and large_flat. cousins would hurt. What are the cases that use it?

  • Can reroot a tree with cousins: All right, I guess we have to keep it here.
  • Errors if target does not exist in a large tree: we probably do not have to use cousins for that. We can use a smaller tree that is less painful to write out in full.
  • Can find path to cousin: OK, I guess we have to keep it here as well.
  • Can find path from nodes other than x: we most definitely do not have to use cousins for that. We can use a smaller tree that is less painful to write out in full.
  • Errors if destination does not exist: same; doesn't require cousins.
  • Errors if source does not exist: same; doesn't require cousins.

In general I was having some doubts about whether the extra indirection of "tree": was making the canonical-data.json file difficult to understand for humans (a test generator obviously would have no problem, given that it has been taught about how to deal with the indirection.

Given this, I could support removing the indirection in "pov" (using smaller trees instead of cousins where possible)


grep... what would someone think about just placing the three files (iliad.txt, midsummer-night.txt, paradise-lost.txt) as normal .txt files in this repo rather than embedding them in JSON? That would remove the need for the files section.

Alternatively, perhaps we think it is acceptable to embed smaller files within each individual case. It could be that we don't need to use those exact three files for all our tests. Being able to tailor the files to each test might help each test be a better check of the respective behaviours.


Even if we think we can solve the problem with these two exercises today, what if another exercise comes along that does in fact need the shared data? Will we accept then that the schema will have to have it, and the test generators will have to deal with it?

I also see that it will be pretty rare to have exercises that need this, which means that a test generator that simply doesn't understand this section will only miss out on a handful of exercises, and maybe that is acceptable for them.

However, it seems the meaning of such a section will be very exercise-dependent. That is surely a negative.

@rbasso
Copy link
Contributor Author

rbasso commented Mar 11, 2017

Good analysis, @petertseng.

grep... what would someone think about just placing the three files (iliad.txt, midsummer-night.txt, paradise-lost.txt) as normal .txt files in this repo rather than embedding them in JSON?

Seems a good solution for this exercise.

Even if we think we can solve the problem with these two exercises today, what if another exercise comes along that does in fact need the shared data?

I'm trying to find an example where this is indeed needed.

Maybe that is a canonical-data smell. It could be the case that, instead of carefully designing data to test a specific thing, we are saving work by utilizing a bigger, shared data structure.

Makes sense?

@ErikSchierboom
Copy link
Member

grep... what would someone think about just placing the three files (iliad.txt, midsummer-night.txt, paradise-lost.txt) as normal .txt files in this repo rather than embedding them in JSON?

Seems like a very reasonable solution.

Given this, I could support removing the indirection in "pov" (using smaller trees instead of cousins where possible)

I think smaller trees could actually be beneficial, as they can help people better focus on what the actual problem is instead of figuring out the structure.

All in all, I think the benefit of reducing duplication by introducing a new, shared data concept does not outweight its advantages. With the existing tests, I feel that we are better off just accepting the duplication, and when possible, make the duplication less by cutting away unnecessary bits of data.

@rbasso
Copy link
Contributor Author

rbasso commented Mar 12, 2017

pov is solved, so the only thing we need to decide now is grep.

If we think that it is not worthy to patch the schema, we have two three two options:

  • Add the text files to x-common.
  • Duplicate the data and maybe simplify, as we did with pov.
  • Just put the files section in the comments, and leave it as it is. (edit)

I just checked and there is no exercise with additional files in x-common, which is a nice regularity. So, I'm leaning if favor of the second third option to avoid creating future problems.

Edit: The second option generates a huge file. It seems to make more sense to change as little as possible now and leave the problem for a future redesign of this exercise. Is anyone using a test generator for it?!

@ErikSchierboom
Copy link
Member

@rbasso I'm in favor of the third option. This is a bit of an odd one, but I don't think we can fully encode all edge-cases without getting into trouble somewhere. Better to have one exception, than to make things more complex for maintaining (2) or less obvious (1).

@rbasso
Copy link
Contributor Author

rbasso commented Mar 13, 2017

Is there something else to be decide here or can we close it?

@petertseng
Copy link
Member

I think we can close.

there is a possibility for more focused cases in grep, but those can be a separate issue.

@rbasso rbasso closed this as completed Mar 13, 2017
emcoding pushed a commit that referenced this issue Nov 19, 2018
Reformat 'about' documentation for nextercism
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants