Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we use datatypes with more appropriate semantics at the cost of worse runtime performance and some code bloat? #2669

Closed
sjakobi opened this issue Oct 3, 2016 · 3 comments

Comments

@sjakobi
Copy link
Member

sjakobi commented Oct 3, 2016

In many functions, such as Stack.Package.findCandidates, we currently use lists for parameters that should semantically be sets:

  • We don't care about the order of the elements.
  • Duplicate elements don't change the return value.

I think this practice is quite common and not without benefits: lists have nice notation and can often be fused away during compilation, leading to fast code.

I also find that this practice has some downsides:

  • Reading the code, I'm often unsure if a parameter has list semantics or set semantics. What happens when there are duplicates? Is the order important?
  • In the particular case of the findCandidate function, a set parameter could have prevented a performance bug where duplicate elements caused a lot of unnecessary parsing and IO.

A similar argument can probably made for other types too.

I currently think that we should put correctness and readability above performance considerations and use the most "correct" datatypes possible.

How does everyone else feel about this tradeoff?

@mgsloan mgsloan added this to the P3: Optional milestone Oct 4, 2016
@mgsloan
Copy link
Contributor

mgsloan commented Oct 4, 2016

One option would be to introduce newtype ListSet a = ListSet [a], and delay the duplicate check for the main time it is relevant - toList :: Ord a => ListSet a -> [a]. Could also have something which does the check when in debug mode (dev mode), but otherwise skips the check. I'm thinking:

toListUnsafe :: ListSet a -> [a]
#ifdef DEBUG_MODE
toListUnsafe (ListSet xs) =
    case getFirstDuplicate of
        Just _ -> error "toListUnsafe encountered a list with duplicates"
        Nothing -> xs
#else
toListUnsafe (ListSet xs) = xs
#endif

@sjakobi
Copy link
Member Author

sjakobi commented Oct 5, 2016

Hm, I'm not sure that something like ListSet could would be a good solution. While it will add some defense against (performance) bugs stemming from accidental duplicates, I think it would make the code even harder to understand. Readers would have to lookup the definition of ListSet to understand how it behaves. Also, we'd rely on beta-testers to run into the problematic cases where the bugs pop up while users in the wild may still run into performance bugs without being able to easily diagnose the problem.

I have converted a bit of Stack.Packages to use Set parameters instead of lists and benchmarked that modified version of stack by running null builds in the path, stack and amazonka projects using tekmo's bench tool. I couldn't find a significant slowdown compared to master.

Of course that doesn't mean that we couldn't run into a case where a conversion along these lines causes a performance regression. But I think that we should be able to catch such cases by running a few benchmarks if in doubt.

@mgsloan
Copy link
Contributor

mgsloan commented Oct 6, 2016

Considering that the size of the data we're working on tends to be rather low, it seems fine to me to use Set. So actually, I'm backtracking on my original concern that was from the performance impact. Particularly since you have done the work to benchmark!

In summary, I am in favor of using Set in more places, when appropriate.

@mgsloan mgsloan closed this as completed Oct 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants