-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major rewrite for correctness, performance #295
Conversation
Oh, another thing this does is warn packages that extend |
One more: @SimonDanisch, this rewrite revealed that there are still shader formats registered to GLAbstraction, which AFAICT is not a registered package. Delete? Or something else? |
Amazing, thanks for taking the time to do this :)
Oh wow, definitely delete! |
Super excited for this change! Out of curiosity, I'm not sure I understand why the following is true:
|
It's a lot like this: julia> foo(n) = Val(n)
foo (generic function with 1 method)
julia> @code_warntype foo(3)
MethodInstance for foo(::Int64)
from foo(n) in Main at REPL[1]:1
Arguments
#self#::Core.Const(foo)
n::Int64
Body::Val{_A} where _A
1 ─ %1 = Main.Val(n)::Val{_A} where _A
└── return %1 (with lots of red). There's no way to infer the result type. In contrast, if you keep things in the value domain (use Using This still creates a |
OK, after a bit of ugly Pkg-wrangling across Julia versions we're passing tests. I might be able to tackle the whole list above in the morning. Should I implement the precompiles, or should I hold off a while to give someone a chance to review it? I'm happy leaving this open a week or so if needed. However, I also know it's not a fun PR to review, so no worries either way. |
Codecov Report
@@ Coverage Diff @@
## master #295 +/- ##
==========================================
+ Coverage 85.37% 88.68% +3.30%
==========================================
Files 9 10 +1
Lines 588 592 +4
==========================================
+ Hits 502 525 +23
+ Misses 86 67 -19
Continue to review full report at Codecov.
|
I've got the docs ready but I think it's probably best to submit them as a second PR. The docs would detract from the reviewability of this PR since they are mostly a move from the README, making for big linecount changes with little practical effect. The same goes for the rest of my TODOs (with the possible exception of fixing the depwarns). So I might just wait and submit those separately. |
In the new file |
3ed09c5
to
44bbf99
Compare
I haven't taken more than a cursory glance at the diff here, but your rationale sounds good
|
OK! I'll fix the depwarns from the test suite and then merge. |
This package was started in 2015 (back in the Julia 0.3 or 0.4 days), and a lot has changed since then. It's never really gotten a serious freshening. This rewrite has several goals. Improving the robustness of package/Module identification --------------------------------------------------------- In modern versions of Julia, the package manager uses name/UUID combinations to identify packages. This is far more robust and flexible than older strategies for specifying packages. FileIO doesn't do this: it identifies modules by their name only. We should adopt the new approach here: going forward (once the deprecation period has passed and we release FileIO v2), all not-yet-loaded modules must be specified by name/UUID. There are some cases--often used in tests or transiently during development of a new I/O package--where the handler *isn't* a registered package, and so there's no UUID available. Currently we try to look up the module based on a `name::Symbol`. It used to be that most modules were loaded into `Main`, then Julia switched to `Base.__toplevel__`; currently we search both, since modules defined in the REPL or tests might still live in `Main`. Of course, even back in the old days, sub-modules could not be found in `Main`, so the current system can't handle submodules. To address the need for specifying modules that aren't packages, while improving both correctness and flexibility, this PR allows you to specify it by (duh) the module itself rather than the name of the module. The combination of using either the module itself or a name/UUID combination means that we can replace a lot of brittle & slow code. When we have the module, we're done; when we have a name/UUID combination, we just call `Base.require` to get the module. It even checks for us whether the module is already loaded. End of story. To help transition existing users to the new system, this has "depwarn"-code to look for the module based on its name. It searches: 1. the currently-loaded modules 2. `Main` 3. The user's current `Pkg` environment One key (breaking) difference is that this lookup is now done during `add_format` rather than when the user tries to `load` or `save` a file. This is obviously better for runtime efficiency, but it does change the point in the code where an error occurs. One of the relatively changes to the tests addresses this change. **Summary**: the new system is strictly more flexible than the old one, since we could never previously support sub-modules. It is also strictly more correct since the registry now specifies precisely what it means by `ImageIO`. There is depwarn-code to help existing users transition, and the only known breakages only concern the specific point in the code from which an error would be thrown. Improving performance and reducing latency with better inferrability -------------------------------------------------------------------- In the original design of this package, `load` and `save` were designed to be specialized by packages. To allow format-specific dispatch, we encoded the file format into the type system using types like `DataFormat{:PNG}`. However, at a certain point we switched to calling module-specific unexported `load` and `save` methods. As a consequence, we don't really need to encode the format in the type system, we can just use a runtime value. Indeed, the downside of using the type system is that having each format be a separate type makes it impossible to infer types. This hurts the runtime performance, increases latency due to unnecessary method specialization by the compiler, and increases the risk of invalidation. However, one way in which we may *under*-specialize is for the filename. defined in `FilePathsBase`. That's a nice change, but this package does quite a lot of manipulation based on file name, and having the type be non-inferrable has some downsides. Finally, several of the container types have historically been poorly-specified, e.g., `const magic_list = Vector{Pair}()`. This rewrite tries to straddle two goals: improving internal inferrability while maintaining backwards compatibility. The strategy taken is to try to wait until the last possible moment to construct non-inferrable objects---to wait until the results are reported back to the caller. In this rewrite, the data format is encoded internally just as a `Symbol`, and the file is passed around as a separate object. This prevents one from needing to specialize on the data format while preserving inferrability for the file. There are a couple of minor changes to internal types, and this forced a couple of changes to the tests. Most significantly, `File{fmt}` is no longer a concrete type, because `File` got a second type-parameter to encode the filename type. To prevent inference failures due to varying-length tuples, this also transitions all magic bytes from `NTuple{N,UInt8}` to `Vector{UInt8}`. As a case study, with the existing FileIO release, I get ~50us to load a 10x10 RGB png file. With this version, it's ~25us. It's remarkable that inference can compete with I/O as a source of slowness, but there you have it.
This is in preparation for adding Documenter docs, but it's useful on its own.
This will improve printing in documentation tables
This package was started in 2015 (back in the Julia 0.3 or 0.4 days), and a lot
has changed since then. It's
never reallynot recently gotten a serious freshening.This rewrite has several goals.
Improving the robustness of package/Module identification
In modern versions of Julia, the package manager uses name/UUID combinations
to identify packages. This is far more robust and flexible than older strategies
for specifying packages. FileIO doesn't do this: it identifies modules by their
name only. We should adopt the new approach here: going forward (once the
deprecation period has passed and we release FileIO v2), all not-yet-loaded
modules must be specified by name/UUID.
There are some cases--often used in tests or transiently during development of
a new I/O package--where the handler isn't a registered package, and so there's
no UUID available. Currently we try to look up the module based on a
name::Symbol
.It used to be that most modules were loaded into
Main
, then Julia switchedto
Base.__toplevel__
; currently we search both, since modules defined in theREPL or tests might still live in
Main
.Of course, even back in the old days, sub-modules could not be found in
Main
,so the current system can't handle submodules.
To address the need for specifying modules that aren't packages, while
improving both correctness and flexibility, this PR allows you to
specify it by (duh) the module itself rather than the name of the module.
The combination of using either the module itself or a name/UUID combination
means that we can replace a lot of brittle & slow code. When we have the module,
we're done; when we have a name/UUID combination, we just call
Base.require
to get the module. It even checks for us whether the module is already loaded, and
when it is it's just a single
Dict
lookup. This should be more robust and considerablyfaster than
FileIO.jl/src/loadsave.jl
Lines 7 to 35 in 1df5033
To help transition existing users to the new system, this has
"depwarn"-code to look for the module based on its name. It searches:
Pkg
environmentMain
One key (breaking) difference is that this lookup is now done during
add_format
rather than when the user tries to
load
orsave
a file. This is obviouslybetter for runtime efficiency, but it does change the point in the code where
an error occurs. One of the relatively few changes to the tests addresses this change.
Summary: the new system is strictly more flexible than the old one, since we could
never previously support sub-modules. It is also strictly more correct since
the registry now specifies precisely what it means by
ImageIO
.There is depwarn-code to help existing users transition, and the only known breakages
only concern the specific point in the code from which an error would be thrown.
Finally, it should be substantially more performant.
Improving performance and reducing latency with better inferrability
In the original design of this package,
load
andsave
were designed to bespecialized by packages. To allow format-specific dispatch, we encoded the
file format into the type system using types like
DataFormat{:PNG}
.However, at a certain point we switched to calling module-specific
unexported
load
andsave
methods. As a consequence, we don't reallyneed to encode the format in the type system, we can just use a runtime
value. Indeed, the downside of using the type system is that having each
format be a separate type makes it impossible to infer types. This hurts the
runtime performance, increases latency due to unnecessary method specialization
by the compiler, and increases the risk of invalidation.
However, one way in which we may under-specialize is for the filename.
#259 removed the type-specification of the filename to support types
defined in
FilePathsBase
. That's a nice change, but this package does quitea lot of manipulation based on file name, and having the type be non-inferrable
has some downsides.
Finally, several of the container types have historically been poorly-specified,
e.g.,
const magic_list = Vector{Pair}()
.This rewrite tries to straddle two goals: improving internal inferrability
while maintaining backwards compatibility. The strategy taken is to try to
wait until the last possible moment to construct non-inferrable objects---to wait
until the results are reported back to the caller.
In this rewrite, the data format is encoded internally just as a
Symbol
,and the file is passed around as a separate object. This prevents one from
needing to specialize on the data format while preserving inferrability for the file.
To move towards a world in which we could infer the type of the filename,
this adds a parameter to our existing types.
There are a couple of minor changes to internal types, and this forced a couple of
changes to the tests. Most significantly,
File{fmt}
is no longer a concretetype, because
File
got a second type-parameter to encode the filename type.To prevent inference failures due to varying-length tuples, this also transitions
all magic bytes from
NTuple{N,UInt8}
toVector{UInt8}
.Overall consequences
As a case study, with the existing FileIO release, I get ~50us to load a
10x10 RGB png file. With this version, it's ~25us. It's remarkable that some
of the current bad things about the code here can compete with I/O as a source
of slowness, but there you have it. Of course for a larger image it becomes
increasingly I/O dominated.
I've not measured latency yet, because this currently removes all
precompile
directives. I'll add those back in after addressing review comments, since I expect
that could change the precompile directives too.
TODOs
These will probably be separate PRs to enhance the reviewability of this one.
Breaking or not?
Since I think all the breakages (check the changes to the test files) are essentially things that should
only be used internally, I don't think this needs to be FileIO v2.0.0; instead, it can be FileIO v1.5,
and when we delete the deprecations we can go to FileIO v2.0.0.
However, there is one important(see below)caveat: ImageIO exploited FileIO internals and some of these internals are changing.
(See JuliaIO/ImageIO.jl#22.) I don't see any way of preventing a breakage for
people who have current versions of ImageIO. Perhaps we could retrospectively place an upper bound
on the FileIO version at
1.4
in the registry?