-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: ByteVec string implementation #11235
Conversation
@@ -615,8 +630,7 @@ static inline void gc_wb_back(void *ptr) // ptr isa jl_value_t* | |||
#define jl_gc_unpreserve() | |||
#define jl_gc_n_preserved_values() (0) | |||
|
|||
#define allocb(nb) malloc(nb) | |||
DLLEXPORT jl_value_t *allocobj(size_t sz); | |||
#define allocb(nb) malloc(nb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this line changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated whitespace cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the DLLEXPORT allocobj
line also disappeared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DLLEXPORT jl_bytevec_struct_t jl_bytevec(const uint8_t *data, size_t n) | ||
{ | ||
jl_bytevec_struct_t b; | ||
if (n < 2*sizeof(void*)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n < sizeof(b.here.data)
to leave space for the null byte?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the length of the string is encoded in the last byte.
there seems to be some confusion in the codegen about the proper layout for this object composed of a recursive immutable:
it appears that the definition of ByteVec is corrupted and
|
just a thought: now that tuples are inline immutable, it might be interesting to just declare this:
and then make jeff go optimize it (for example #11187) |
Yes, @JeffBezanson had the same thought. I was hoping to get somewhere with this approach before having to wait for that optimization, but maybe it's better just to do it that way. Another thought is to just have a ByteVec type that always has a buffer but avoids all the overhead of Arrays. But we've already got too many vector-like types around, so that's kind of unappealing. |
This blog post may be of some interest: https://blog.mozilla.org/ejpbruel/2012/02/06/how-strings-are-implemented-in-spidermonkey-2/ |
@simonster Great post... interesting that 1) they consider string handling performance so important that they are willing to have some extra complexity 2) \0 termination is only guaranteed for certain types of strings |
My tl;dr interpretation is they merge our ByteString/ByteVec/RopeString/SubString types into a single unified interface called a JavaScript string, and use copy-on-read to ensure O(n) efficiency of the common += operations, while preserving memory efficiency and \0 termination at all user-visible points. |
@JeffBezanson, @Keno, @vtjnash, I'm having trouble debugging what's going on here – I'm pushing it in an incomplete state with lots of debugging output in hopes someone can help figure out what's going on here. So this croaks during system image build as soon as it hits code that uses a Regex:
What you can see here is that it's hitting the
r"..."
macro just fine here:https://github.com/JuliaLang/julia/blob/251202fa8f8445053f66fd/base/regex.jl#L72-L78
At this point,
pattern
is anASCIIString
with a data field of0xffffffffffffffeb0000000106c75770
, as expected – it's a negative length and a pointer to the actual string data, which starts with0x7c6a7c417c612825295e7c5d255e5b28
– or in other words, the data([^%]|^)%(a|A|j|
, which is the initial part of a regex pattern that gets passed to ther_str
macro. So far so good. This in turn invokes theRegex
constructor, here:https://github.com/JuliaLang/julia/blob/251202fa8f8445053f66fd/base/regex.jl#L18-L32
But now, surprisingly,
pattern
is no longer0xffffffffffffffeb0000000106c75770
but seems to be the value that this used to point to – i.e.0x7c6a7c417c612825295e7c5d255e5b28
. I have no idea how this is happening, since the same value ofpattern
is passed from the macro to the constructor function. Any ideas?