-
Notifications
You must be signed in to change notification settings - Fork 185
List of strings (implementation ideas) #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
topic: strings
String processing
Comments
If we want alternative implementations, tuned to various usage scenarios,
then a common API is definitely desirable. You would simply replace the
type name by the one suitable for the intended use - quite possibly even by
using the rename feature of a module ;) (and for the real, lazy programmers
among us: use an intermediate module to even hide that renaming, so that
there is only one location in the source code that determines which type
will be actually used).
That said, I wonder how we can measure the consequences of the various
implementations: performance is one thing and while not all that easy to
get right, it is definitely measurable. But memory fragmentation is, I
guess, a very different thing. I have no idea how you can measure that - or
what a useful measure is.
We might make it part of the GSoC containers project or even a project in
its own right!
Op ma 15 feb. 2021 om 21:04 schreef Ivan Pribec <notifications@github.com>:
… I thought it might be best to create a new issue to discuss alternative
implementations for the list of strings proposed in #268
<#268>. As @arjenmarkus
<https://github.com/arjenmarkus> wrote there:
This proposal considers the features that would be useful for this derived
type, a list of strings. It does not prescribe the methods for implementing
the type. Given the ease by which arrays are handled in Fortran, it is
possible to use allocatable arrays, but perhaps a linked list may prove to
be more efficient, at least in certain scenarios of use. Therefore the
proposal concentrates on the usage.
I thought it might be helpful to expand on this topic using some analogies
with the C++ library. In C++, the standard template library provides the
special container std::string as a sequence of characters supporting fast
random access, and fast insert/delete at back of the container.
In Fortran we can reach similar *behavior/usage* using the character(len=:),
allocatable type:
character(len=:), allocatable :: str
! Automatic allocation upon initialization
str = "Hello"
! "Insertion" at the back
str = str//", world!" ! "Hello, world!"
! Fast random access
str(8:8) = "W" ! "Hello, World!"
! "Deletion"
str = str(1:len(str)-1) ! "Hello, World"
There are of course also very large differences between the C++
std::string and the Fortran allocatable character sequences in terms of
memory management and other things. Some of the differences can be inferred
from the numerous member functions(
http://www.cplusplus.com/reference/string/string/) of std::string for
access, modification, and string operations.
Moving one step further to a "lists of strings" in C++ there are several
options. For "fixed-size" lists one has:
// Built-in (fixed) size array
std::string colour[4] = { "Blue", "Red",
"Orange", "Yellow" };
// Array of strings (safer, and easier to use than the built-in array)
std::array<std::string, 4> colour { "Blue", "Red", "Orange",
"Yellow" };
For dynamic lists of strings the most commonly used option in C++ is
// Vector of strings (the "go to" container)
std::vector<std::string> colour {"Blue", "Red", "Orange"};// Strings can be added at any time with push_back
colour.push_back("Yellow")
However, the STL library also contains other specialized containers such
as: std::deque, std::list, and std::forward_list. Quoting from the
textbook *C++ Primer*:
The idea of having different containers is they offer different
performance trade-offs relative to
- The costs to add or delete elements to the container
- The costs to perform non-sequential access to elements of the
container
The performance aspects of different sequential containers are summarized
neatly in this table copied out of the same book:
Container Description
vector Flexible-size array. Supports fast random access.
Inserting or deleting elements other than at the back may be slow.
deque Double-ended queue. Supports fast random access.
Fast insert/delete at front or back.
list Doubly linked list. Supports only bidirectional sequential access.
Fast insert/delete at any point in the list.
forward_list Singly linked list. Supports only sequential access in one
direction.
Fast insert/delete at any point in the list.
array Fixed-size array. Supports fast random access.
Cannot add or remove elements.
string A specialized container, similar to vector, that contains
characters.
Fast random access. Fast insert/delete at back.
The beauty of the C++ containers is they share many common operations. If
you can avoid random access, and instead use only iterators to access the
elements you can switch easily between some containers. A table of the
supported operations can be found here
<http://www.cplusplus.com/reference/stl/>.
I believe the current implementation in #311
<#311> is most similar to the
std::vector and uses contiguous storage for the underlying string
elements. The comments from @esterjo <https://github.com/esterjo> (#268
(comment)
<#268 (comment)>
and #241 (comment)
<#241 (comment)>)
already address the fact that other implementations might be more suitable
in some usage cases in order to avoid memory fragmentation or support other
operations.
Given that there is not much precedence for such string containers in
Fortran, I find it hard to judge which implementation is most suitable for
typical Fortran usage cases. Getting some feedback from the community could
help guide further (experimental) implementation efforts.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#322>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR74MYKXW4MLOW2GSNLS7F46TANCNFSM4XVHVWXA>
.
|
Something I forgot to brig up here: I recently wrote a small module that
reads in an entire text file into memory and then splits it up in
individual lines. The memory used consists of one block of memory and
pointers to the start and stop of the lines. My intention is to expand
this, so that it can be used to get data from CSV files and the like. It
relies on a different memory layout than the stringlist_type, because I
know in advance how much memory is required.
Op di 16 feb. 2021 om 08:33 schreef Arjen Markus <arjen.markus895@gmail.com
…:
If we want alternative implementations, tuned to various usage scenarios,
then a common API is definitely desirable. You would simply replace the
type name by the one suitable for the intended use - quite possibly even by
using the rename feature of a module ;) (and for the real, lazy programmers
among us: use an intermediate module to even hide that renaming, so that
there is only one location in the source code that determines which type
will be actually used).
That said, I wonder how we can measure the consequences of the various
implementations: performance is one thing and while not all that easy to
get right, it is definitely measurable. But memory fragmentation is, I
guess, a very different thing. I have no idea how you can measure that - or
what a useful measure is.
We might make it part of the GSoC containers project or even a project in
its own right!
Op ma 15 feb. 2021 om 21:04 schreef Ivan Pribec ***@***.***
>:
> I thought it might be best to create a new issue to discuss alternative
> implementations for the list of strings proposed in #268
> <#268>. As @arjenmarkus
> <https://github.com/arjenmarkus> wrote there:
>
> This proposal considers the features that would be useful for this
> derived type, a list of strings. It does not prescribe the methods for
> implementing the type. Given the ease by which arrays are handled in
> Fortran, it is possible to use allocatable arrays, but perhaps a linked
> list may prove to be more efficient, at least in certain scenarios of use.
> Therefore the proposal concentrates on the usage.
>
> I thought it might be helpful to expand on this topic using some
> analogies with the C++ library. In C++, the standard template library
> provides the special container std::string as a sequence of characters
> supporting fast random access, and fast insert/delete at back of the
> container.
>
> In Fortran we can reach similar *behavior/usage* using the character(len=:),
> allocatable type:
>
> character(len=:), allocatable :: str
>
> ! Automatic allocation upon initialization
> str = "Hello"
>
> ! "Insertion" at the back
> str = str//", world!" ! "Hello, world!"
>
> ! Fast random access
> str(8:8) = "W" ! "Hello, World!"
>
> ! "Deletion"
> str = str(1:len(str)-1) ! "Hello, World"
>
> There are of course also very large differences between the C++
> std::string and the Fortran allocatable character sequences in terms of
> memory management and other things. Some of the differences can be inferred
> from the numerous member functions(
> http://www.cplusplus.com/reference/string/string/) of std::string for
> access, modification, and string operations.
>
> Moving one step further to a "lists of strings" in C++ there are several
> options. For "fixed-size" lists one has:
>
> // Built-in (fixed) size array
> std::string colour[4] = { "Blue", "Red",
> "Orange", "Yellow" };
> // Array of strings (safer, and easier to use than the built-in array)
> std::array<std::string, 4> colour { "Blue", "Red", "Orange",
> "Yellow" };
>
> For dynamic lists of strings the most commonly used option in C++ is
>
> // Vector of strings (the "go to" container)
> std::vector<std::string> colour {"Blue", "Red", "Orange"};// Strings can be added at any time with push_back
> colour.push_back("Yellow")
>
> However, the STL library also contains other specialized containers such
> as: std::deque, std::list, and std::forward_list. Quoting from the
> textbook *C++ Primer*:
>
> The idea of having different containers is they offer different
> performance trade-offs relative to
>
> - The costs to add or delete elements to the container
> - The costs to perform non-sequential access to elements of the
> container
>
> The performance aspects of different sequential containers are summarized
> neatly in this table copied out of the same book:
> Container Description
> vector Flexible-size array. Supports fast random access.
> Inserting or deleting elements other than at the back may be slow.
> deque Double-ended queue. Supports fast random access.
> Fast insert/delete at front or back.
> list Doubly linked list. Supports only bidirectional sequential access.
> Fast insert/delete at any point in the list.
> forward_list Singly linked list. Supports only sequential access in one
> direction.
> Fast insert/delete at any point in the list.
> array Fixed-size array. Supports fast random access.
> Cannot add or remove elements.
> string A specialized container, similar to vector, that contains
> characters.
> Fast random access. Fast insert/delete at back.
>
> The beauty of the C++ containers is they share many common operations. If
> you can avoid random access, and instead use only iterators to access the
> elements you can switch easily between some containers. A table of the
> supported operations can be found here
> <http://www.cplusplus.com/reference/stl/>.
>
> I believe the current implementation in #311
> <#311> is most similar to the
> std::vector and uses contiguous storage for the underlying string
> elements. The comments from @esterjo <https://github.com/esterjo> (#268
> (comment)
> <#268 (comment)>
> and #241 (comment)
> <#241 (comment)>)
> already address the fact that other implementations might be more suitable
> in some usage cases in order to avoid memory fragmentation or support other
> operations.
>
> Given that there is not much precedence for such string containers in
> Fortran, I find it hard to judge which implementation is most suitable for
> typical Fortran usage cases. Getting some feedback from the community could
> help guide further (experimental) implementation efforts.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#322>, or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAN6YR74MYKXW4MLOW2GSNLS7F46TANCNFSM4XVHVWXA>
> .
>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I thought it might be best to create a new issue to discuss alternative implementations for the list of strings proposed in #268. As @arjenmarkus wrote there:
I thought it might be helpful to expand on this topic using some analogies with the C++ library. In C++, the standard template library provides the special container
std::string
as a sequence of characters supporting fast random access, and fast insert/delete at back of the container.In Fortran we can reach similar behavior/usage using the
character(len=:), allocatable
type:There are of course also very large differences between the C++
std::string
and the Fortran allocatable character sequences in terms of memory management and other things. Some of the differences can be inferred from the numerous member functions ofstd::string
for access, modification, and string operations.Moving one step further to a "lists of strings" in C++ there are several options. For "fixed-size" lists one has:
For dynamic lists of strings the most commonly used option in C++ is
However, the STL library also contains other specialized containers such as:
std::deque
,std::list
, andstd::forward_list
. Quoting from the textbook C++ Primer:The performance aspects of different sequential containers are summarized neatly in this table copied out of the same book:
vector
Inserting or deleting elements other than at the back may be slow.
deque
Fast insert/delete at front or back.
list
Fast insert/delete at any point in the list.
forward_list
Fast insert/delete at any point in the list.
array
Cannot add or remove elements.
string
vector
, that contains characters.Fast random access. Fast insert/delete at back.
The beauty of the C++ containers is they share many common operations. If you can avoid random access, and instead use only iterators to access the elements you can switch easily between some containers. A table of the supported operations can be found here.
I believe the current implementation in #311 is most similar to the
std::vector
and uses contiguous storage for the underlying string elements. The comments from @esterjo (#268 (comment) and #241 (comment)) already address the fact that other implementations might be more suitable in some usage cases in order to avoid memory fragmentation or support other operations.Given that there is not much precedence for such string containers in Fortran, I find it hard to judge which implementation is most suitable for typical Fortran usage cases. Getting some feedback from the community could help guide further (experimental) implementation efforts.
The text was updated successfully, but these errors were encountered: