Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of features tests #522

Merged
merged 1 commit into from
Nov 10, 2015

Conversation

brendan-ward
Copy link
Contributor

This PR involves a major refactor of the tests for features.py and rio/features.py.

Goals:

  1. Make tests as small, simple, and fast as possible
  2. Use pytest fixtures for repeated inputs
  3. Increase test coverage

Along the way, I dropped WKB support from _features.pyx (wasn't used) and moved as much of the validation logic for shapes and sieve as possible from the Python functions (wrappers) to the base Cython implementations to keep these more tightly encapsulated. rasterize does too much in the Python layer to easily move into the Cython implementation. I also refactored the validation logic for data types used by rasterize into function in dtypes to enhance accessibility and testability.

I also consolidated the several individual files with tests for various parts of features into a single test file; makes it somewhat easier to manage when there is one test file to one source file.

I used pytest fixtures heavily to use very small test images and files; this contributed a significant speed increase to the test suite - toward #450.

I found and fixed a few isolated bugs along the way.

I made a couple things a bit more strict:

  1. bounds does not return None on exception, it now just passes that exception through.
  2. rasterize no longer ignores invalid shapes; they will raise an exception instead.

Note: the decrease in coverage is due to mis-reporting of coverage in _features.pyx and rio/features.py files; much more of the functionality is actually covered than is indicated here. I believe that actual coverage has actually increased over the prior implementation.

I've found awkward behavior with coverage that is proving hard to explain especially for the CLI tests; if I run the same call to the click test runner 2x in the same test function, I get coverage but if I run it only once, I don't. This is partly why coverage was recorded higher previously, because the test functions before did everything and the kitchen sink in a single function.

@brendan-ward
Copy link
Contributor Author

In case it matters, I believe that I've isolated the issues with coverage for rio/features.py being due to interactions between click and coverage, since I'm able to reproduce the behavior without using pytest (so we can't blame it on pytest fixtures importing things before coverage is ready).

Still digging in on that, which may take a while. I'd prefer to deal with the issues related to coverage in a separate PR to keep this one from growing too much more.

@sgillies
Copy link
Member

sgillies commented Nov 9, 2015

I'm 👍 on merging this despite the coverage dip. Agree or disagree @perrygeo?

I'd like to run a few more tests locally before I do.

@sgillies sgillies added this to the pre 1.0 milestone Nov 9, 2015
@brendan-ward
Copy link
Contributor Author

@sgillies suggestions to test by hand, since they were changes from the previous implementation:

  1. any suitably large, complex geojson that may have contained invalid geometries and been silently ignored by rasterize. They will now fail with exceptions and could break people's data pipelines. My view is that cleaning geometry belongs elsewhere, and we should insist on clean geometry.

  2. any case where you might have been relying on bounds returning None, also within any processing pipelines.

return numpy.allclose(values, values.astype(dtype))

else:
return numpy.array_equal(values, values.astype(dtype))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there a more efficient way to test cast safety without making a copy and doing an element-wise comparison? I've never used it but http://docs.scipy.org/doc/numpy/reference/generated/numpy.can_cast.html looks like it might do what we need here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@perrygeo It doesn't work against values in an array:
numpy.can_cast(numpy.array([1,2,3]), numpy.int8) ==> False

using other values for casting can produce undesirable results.

numpy.can_cast(numpy.array([1,2,3, 65564]), numpy.int8, casting='same_kind') ==> True

Of course, we can use it in a loop and give it a single scalar value at a time, because then it actually checks values:

[numpy.can_cast(numpy.array(v), numpy.int8) for v in [1, 2, 3]] ==> [True, True, True] but that is less efficient than what we are doing now.

Unfortunately less useful than it's name suggests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bummer. Well let's stick to this method for now but keep an eye out for more efficient ways (possibly test array.min and array.max ?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@perrygeo I thought about that, but I think it would get confused by 1, 2, 3.4, 4.1, 5. I can't see a way around an approach that avoids evaluating each value.

If this becomes a performance bottleneck, we can always move this over to Cython. I suspect this is low on our list of performance worst offenders though...

@perrygeo
Copy link
Contributor

perrygeo commented Nov 9, 2015

I like it. The tests are much cleaner and faster. Overall it seems more robust and I'm always a fan of the fail-fast strategy.

I want to do some integration testing with some real data™, particularly in the context of rasterstats since it relies heavily on rasterize. I'll report back tomorrow with some results.

@perrygeo
Copy link
Contributor

I've tested it against a few datasets and it works as expected; same results with valid geojsons, now raises an exception on invalid features which I think is a good thing. All the rasterstats tests pass.

Sticking with the current can_cast_dtype and removing the inverted raster logic sounds good to me.

👍 on merging.

sgillies added a commit that referenced this pull request Nov 10, 2015
@sgillies sgillies merged commit da93f2b into rasterio:master Nov 10, 2015
@sgillies
Copy link
Member

Old tests pass for me. That was my sanity check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants