Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use mmap to read files when possible #106

Closed
StefanKarpinski opened this issue Jul 10, 2011 · 10 comments
Closed

use mmap to read files when possible #106

StefanKarpinski opened this issue Jul 10, 2011 · 10 comments
Assignees
Labels
performance Must go faster speculative Whether the change will be implemented is speculative

Comments

@StefanKarpinski
Copy link
Member

It occurs to me that we could potentially mmap files that are mmappable when reading them, potentially improving performance quite drastically. We'd have to test it out to make sure it didn't have major issues, but it could be a big win.

One really great example of where this could be an enormous performance improvement would be when using a regex to scan a file. The normal approach is to real in a chunk at a time and then scan that with the regex. However, that has issues if you can't be sure whether the regex has to match within the chunk. For example, when reading line-by-line, matching a regex that can only match within a line, this is fine — as long as the lines don't get prohibitively long. If the scan is through the entire file, e.g. in the case where the regex is being used to split the file into chunks for consumption, then this can get tricky. Using the mmap approach, you can just map the file and let the regex engine go to work — the kernel will handle getting the data when the regex engine is ready for it!

It would be nice to be able to apply the same trick to streamed data, which cannot be mmapped. However, I think a related trick might work: memory protect the page after the last bit of valid, read data from the file and trap accesses to that memory. Then if the regex engine (or whatever is doing the reading) reads past the valid buffer, you can go ahead and read more data on demand. This allows a very similar trick to be done even with streamed input.

@ghost ghost assigned StefanKarpinski Jul 10, 2011
@StefanKarpinski
Copy link
Member Author

@JeffBezanson: does this make any sense?

@JeffBezanson
Copy link
Member

Yeah, we can make Arrays with any data pointer, so you can make strings or numeric arrays backed by binary files. Seems to make the most sense when the entire file corresponds to a single object, especially a random-access object. We could define a kind of i/o stream that wraps an Array{Uint8,1}, then an mmapped file can be accessed by both indexing and i/o operations.

@StefanKarpinski
Copy link
Member Author

Cool. This is very 2.0, of course, but I think it could make it possible to do operations on files both efficiently and easily.

@ViralBShah
Copy link
Member

This could be very cool if it could be done. I presume that this is done in R. Perhaps matlab only focusses on windows.

@StefanKarpinski
Copy link
Member Author

I'm fairly certain this is not done in R. I don't know of any language that does this. For some reason, people seem to be allergic to mmap. I don't understand why because IMO it's one of the best syscalls every invented.

@JeffBezanson
Copy link
Member

In commit d65e897, we have it.

@ViralBShah
Copy link
Member

Can we close this one?

@timholy
Copy link
Member

timholy commented Aug 17, 2012

Good question. Stefan has some interesting ideas above that go above the current mmap implementation, so I'll let him comment.

@StefanKarpinski
Copy link
Member Author

Let's leave it open as an idea bookmark, although I'm not sure this fits too well with our current I/O model. Would need a lot of deep surgery to make it work, I think.

@quinnj
Copy link
Member

quinnj commented Oct 14, 2015

I think we can close this issue. We now have fairly accessible Mmap.mmap functionality, which returns an mmapped byte-array for any platform, and you can easily wrap it in an IOBuffer for IO operations. readdlm also utilizes this, as well as SharedArrays. I think we can open new issues for specific areas where we currently aren't leveraging mmap.

LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Oct 11, 2021
Keno pushed a commit that referenced this issue Oct 9, 2023
REPL also uses sigatomic, but I don't expect to be interpreting REPL
anytime soon.
Keno pushed a commit that referenced this issue Oct 9, 2023
Run all Base users of `sigatomic` in compiled mode. Fixes #106
udesou pushed a commit to udesou/julia that referenced this issue Nov 22, 2023
…uliaLang#106)

Otherwise we may just observe `gc_n_threads = 0` (`jl_gc_collect` sets
it to 0 in the very end of its body) and this function becomes a no-op.
aviatesk added a commit that referenced this issue May 29, 2024
This commit makes it possible for `names` to return `using`-ed names
as well:
```julia
julia> using Base: @assume_effects

julia> Symbol("@assume_effects") in names(@__MODULE__; usings=true)
true
```

Currently, to find all names available in a module `A`, the following
steps are needed:
1. Use `names(A; all=true, imported=true)` to get the names defined by
   `A` and the names explicitly `import`ed by `A`.
2. Use `jl_module_usings(A)` to get the list of modules `A` has
   `using`-ed and then use `names()` to get the names `export`ed by
   those modules.

This method is implemented in e.g. REPL completions, but it has a
problem: it could not get the names explicitly `using`-ed by
`using B: ...` (#36529, #40356, JuliaDebug/Infiltrator.jl/#106, etc.).

This commit adds a new keyword argument `usings::Bool=false` to
`names(A; ...)`, which, when `usings=true` is specified, returns all
names introduced by `using` in `A`.
In other words, `usings=true` not only returns explicitly `using`-ed
names but also incorporates step 2 above into the implementation of
`names`.

By using this new option, we can now use
`names(A; all=true, imported=true, usings=true)` to know all names
available in `A`, without implementing the two-fold steps on application
side.
As example application, this new feature will be used to simplify and
enhance the implementation of REPL completions.

- fixes #36529

Co-authored-by: Nathan Daly <NHDaly@gmail.com>
Co-authored-by: Sebastian Pfitzner <pfitzseb@gmail.com>
aviatesk added a commit that referenced this issue May 29, 2024
This commit makes it possible for `names` to return `using`-ed names
as well:
```julia
julia> using Base: @assume_effects

julia> Symbol("@assume_effects") in names(@__MODULE__; usings=true)
true
```

Currently, to find all names available in a module `A`, the following
steps are needed:
1. Use `names(A; all=true, imported=true)` to get the names defined by
   `A` and the names explicitly `import`ed by `A`.
2. Use `jl_module_usings(A)` to get the list of modules `A` has
   `using`-ed and then use `names()` to get the names `export`ed by
   those modules.

This method is implemented in e.g. REPL completions, but it has a
problem: it could not get the names explicitly `using`-ed by
`using B: ...` (#36529, #40356, JuliaDebug/Infiltrator.jl/#106, etc.).

This commit adds a new keyword argument `usings::Bool=false` to
`names(A; ...)`, which, when `usings=true` is specified, returns all
names introduced by `using` in `A`.
In other words, `usings=true` not only returns explicitly `using`-ed
names but also incorporates step 2 above into the implementation of
`names`.

By using this new option, we can now use
`names(A; all=true, imported=true, usings=true)` to know all names
available in `A`, without implementing the two-fold steps on application
side.
As example application, this new feature will be used to simplify and
enhance the implementation of REPL completions.

- fixes #36529

Co-authored-by: Nathan Daly <NHDaly@gmail.com>
Co-authored-by: Sebastian Pfitzner <pfitzseb@gmail.com>
aviatesk added a commit that referenced this issue May 29, 2024
This commit makes it possible for `names` to return `using`-ed names
as well:
```julia
julia> using Base: @assume_effects

julia> Symbol("@assume_effects") in names(@__MODULE__; usings=true)
true
```

Currently, to find all names available in a module `A`, the following
steps are needed:
1. Use `names(A; all=true, imported=true)` to get the names defined by
   `A` and the names explicitly `import`ed by `A`.
2. Use `jl_module_usings(A)` to get the list of modules `A` has
   `using`-ed and then use `names()` to get the names `export`ed by
   those modules.

This method is implemented in e.g. REPL completions, but it has a
problem: it could not get the names explicitly `using`-ed by
`using B: ...` (#36529, #40356, JuliaDebug/Infiltrator.jl/#106, etc.).

This commit adds a new keyword argument `usings::Bool=false` to
`names(A; ...)`, which, when `usings=true` is specified, returns all
names introduced by `using` in `A`.
In other words, `usings=true` not only returns explicitly `using`-ed
names but also incorporates step 2 above into the implementation of
`names`.

By using this new option, we can now use
`names(A; all=true, imported=true, usings=true)` to know all names
available in `A`, without implementing the two-fold steps on application
side.
As example application, this new feature will be used to simplify and
enhance the implementation of REPL completions.

- fixes #36529

Co-authored-by: Nathan Daly <NHDaly@gmail.com>
Co-authored-by: Sebastian Pfitzner <pfitzseb@gmail.com>
aviatesk added a commit that referenced this issue May 29, 2024
This commit makes it possible for `names` to return `using`-ed names
as well:
```julia
julia> using Base: @assume_effects

julia> Symbol("@assume_effects") in names(@__MODULE__; usings=true)
true
```

Currently, to find all names available in a module `A`, the following
steps are needed:
1. Use `names(A; all=true, imported=true)` to get the names defined by
   `A` and the names explicitly `import`ed by `A`.
2. Use `jl_module_usings(A)` to get the list of modules `A` has
   `using`-ed and then use `names()` to get the names `export`ed by
   those modules.

This method is implemented in e.g. REPL completions, but it has a
problem: it could not get the names explicitly `using`-ed by
`using B: ...` (#36529, #40356, JuliaDebug/Infiltrator.jl/#106, etc.).

This commit adds a new keyword argument `usings::Bool=false` to
`names(A; ...)`, which, when `usings=true` is specified, returns all
names introduced by `using` in `A`.
In other words, `usings=true` not only returns explicitly `using`-ed
names but also incorporates step 2 above into the implementation of
`names`.

By using this new option, we can now use
`names(A; all=true, imported=true, usings=true)` to know all names
available in `A`, without implementing the two-fold steps on application
side.
As example application, this new feature will be used to simplify and
enhance the implementation of REPL completions.

- fixes #36529

Co-authored-by: Nathan Daly <NHDaly@gmail.com>
Co-authored-by: Sebastian Pfitzner <pfitzseb@gmail.com>
aviatesk added a commit that referenced this issue May 30, 2024
This commit makes it possible for `names` to return `using`-ed names
as well:
```julia
julia> using Base: @assume_effects

julia> Symbol("@assume_effects") in names(@__MODULE__; usings=true)
true
```

Currently, to find all names available in a module `A`, the following
steps are needed:
1. Use `names(A; all=true, imported=true)` to get the names defined by
   `A` and the names explicitly `import`ed by `A`.
2. Use `jl_module_usings(A)` to get the list of modules `A` has
   `using`-ed and then use `names()` to get the names `export`ed by
   those modules.

This method is implemented in e.g. REPL completions, but it has a
problem: it could not get the names explicitly `using`-ed by
`using B: ...` (#36529, #40356, JuliaDebug/Infiltrator.jl/#106, etc.).

This commit adds a new keyword argument `usings::Bool=false` to
`names(A; ...)`, which, when `usings=true` is specified, returns all
names introduced by `using` in `A`.
In other words, `usings=true` not only returns explicitly `using`-ed
names but also incorporates step 2 above into the implementation of
`names`.

By using this new option, we can now use
`names(A; all=true, imported=true, usings=true)` to know all names
available in `A`, without implementing the two-fold steps on application
side.
As example application, this new feature will be used to simplify and
enhance the implementation of REPL completions.

- fixes #36529

Co-authored-by: Nathan Daly <NHDaly@gmail.com>
Co-authored-by: Sebastian Pfitzner <pfitzseb@gmail.com>
aviatesk added a commit that referenced this issue May 30, 2024
This commit makes it possible for `names` to return `using`-ed names
as well:
```julia
julia> using Base: @assume_effects

julia> Symbol("@assume_effects") in names(@__MODULE__; usings=true)
true
```

Currently, to find all names available in a module `A`, the following
steps are needed:
1. Use `names(A; all=true, imported=true)` to get the names defined by
   `A` and the names explicitly `import`ed by `A`.
2. Use `jl_module_usings(A)` to get the list of modules `A` has
   `using`-ed and then use `names()` to get the names `export`ed by
   those modules.

This method is implemented in e.g. REPL completions, but it has a
problem: it could not get the names explicitly `using`-ed by
`using B: ...` (#36529, #40356, JuliaDebug/Infiltrator.jl/#106, etc.).

This commit adds a new keyword argument `usings::Bool=false` to
`names(A; ...)`, which, when `usings=true` is specified, returns all
names introduced by `using` in `A`.
In other words, `usings=true` not only returns explicitly `using`-ed
names but also incorporates step 2 above into the implementation of
`names`.

By using this new option, we can now use
`names(A; all=true, imported=true, usings=true)` to know all names
available in `A`, without implementing the two-fold steps on application
side.
As example application, this new feature will be used to simplify and
enhance the implementation of REPL completions.

- fixes #36529

Co-authored-by: Nathan Daly <NHDaly@gmail.com>
Co-authored-by: Sebastian Pfitzner <pfitzseb@gmail.com>
DilumAluthge added a commit that referenced this issue Dec 17, 2024
Stdlib: Distributed
URL: https://github.com/JuliaLang/Distributed.jl
Stdlib branch: master
Julia branch: master
Old commit: 6c7cdb5
New commit: c613685
Julia version: 1.12.0-DEV
Distributed version: 1.11.0(Does not match)
Bump invoked by: @DilumAluthge
Powered by:
[BumpStdlibs.jl](https://github.com/JuliaLang/BumpStdlibs.jl)

Diff:
JuliaLang/Distributed.jl@6c7cdb5...c613685

```
$ git log --oneline 6c7cdb5..c613685
c613685 Merge pull request #116 from JuliaLang/ci-caching
20e2ce7 Use julia-actions/cache in CI
9c5d73a Merge pull request #112 from JuliaLang/dependabot/github_actions/codecov/codecov-action-5
ed12496 Merge pull request #107 from JamesWrigley/remotechannel-empty
010828a Update .github/workflows/ci.yml
11451a8 Bump codecov/codecov-action from 4 to 5
8b5983b Merge branch 'master' into remotechannel-empty
729ba6a Fix docstring of `@everywhere` (#110)
af89e6c Adding better docs to exeflags kwarg (#108)
8537424 Implement Base.isempty(::RemoteChannel)
6a0383b Add a wait(::[Abstract]WorkerPool) (#106)
1cd2677 Bump codecov/codecov-action from 1 to 4 (#96)
cde4078 Bump actions/cache from 1 to 4 (#98)
6c8245a Bump julia-actions/setup-julia from 1 to 2 (#97)
1ffaac8 Bump actions/checkout from 2 to 4 (#99)
8e3f849 Fix RemoteChannel iterator interface (#100)
f4aaf1b Fix markdown errors in README.md (#95)
2017da9 Merge pull request #103 from JuliaLang/sf/sigquit_instead
07389dd Use `SIGQUIT` instead of `SIGTERM`
```

Co-authored-by: Dilum Aluthge <dilum@aluthge.com>
stevengj pushed a commit that referenced this issue Jan 2, 2025
Stdlib: Distributed
URL: https://github.com/JuliaLang/Distributed.jl
Stdlib branch: master
Julia branch: master
Old commit: 6c7cdb5
New commit: c613685
Julia version: 1.12.0-DEV
Distributed version: 1.11.0(Does not match)
Bump invoked by: @DilumAluthge
Powered by:
[BumpStdlibs.jl](https://github.com/JuliaLang/BumpStdlibs.jl)

Diff:
JuliaLang/Distributed.jl@6c7cdb5...c613685

```
$ git log --oneline 6c7cdb5..c613685
c613685 Merge pull request #116 from JuliaLang/ci-caching
20e2ce7 Use julia-actions/cache in CI
9c5d73a Merge pull request #112 from JuliaLang/dependabot/github_actions/codecov/codecov-action-5
ed12496 Merge pull request #107 from JamesWrigley/remotechannel-empty
010828a Update .github/workflows/ci.yml
11451a8 Bump codecov/codecov-action from 4 to 5
8b5983b Merge branch 'master' into remotechannel-empty
729ba6a Fix docstring of `@everywhere` (#110)
af89e6c Adding better docs to exeflags kwarg (#108)
8537424 Implement Base.isempty(::RemoteChannel)
6a0383b Add a wait(::[Abstract]WorkerPool) (#106)
1cd2677 Bump codecov/codecov-action from 1 to 4 (#96)
cde4078 Bump actions/cache from 1 to 4 (#98)
6c8245a Bump julia-actions/setup-julia from 1 to 2 (#97)
1ffaac8 Bump actions/checkout from 2 to 4 (#99)
8e3f849 Fix RemoteChannel iterator interface (#100)
f4aaf1b Fix markdown errors in README.md (#95)
2017da9 Merge pull request #103 from JuliaLang/sf/sigquit_instead
07389dd Use `SIGQUIT` instead of `SIGTERM`
```

Co-authored-by: Dilum Aluthge <dilum@aluthge.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests

5 participants