Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ByteVec string implementation #11235

Closed
wants to merge 8 commits into from
Closed

WIP: ByteVec string implementation #11235

wants to merge 8 commits into from

Conversation

StefanKarpinski
Copy link
Member

@JeffBezanson, @Keno, @vtjnash, I'm having trouble debugging what's going on here – I'm pushing it in an incomplete state with lots of debugging output in hopes someone can help figure out what's going on here. So this croaks during system image build as soon as it hits code that uses a Regex:

libc.jl
"A"
ASCIIString
Int128(0xffffffffffffffeb0000000106c75770)
Int128(0x7c6a7c417c612825295e7c5d255e5b28)
()
"B"
ASCIIString
Int128(0x7c6a7c417c612825295e7c5d255e5b28)
"C"
ASCIIString
Int128(0x7c6a7c417c612825295e7c5d255e5b28)
"D"
Int128(0x7c6a7c417c612825295e7c5d255e5b28)

signal (11): Segmentation fault: 11
length at bytevec.jl:18
print_to_string at string.jl:24
compile at pcre.jl:95

What you can see here is that it's hitting the r"..." macro just fine here:

https://github.com/JuliaLang/julia/blob/251202fa8f8445053f66fd/base/regex.jl#L72-L78

At this point, pattern is an ASCIIString with a data field of 0xffffffffffffffeb0000000106c75770, as expected – it's a negative length and a pointer to the actual string data, which starts with 0x7c6a7c417c612825295e7c5d255e5b28 – or in other words, the data ([^%]|^)%(a|A|j|, which is the initial part of a regex pattern that gets passed to the r_str macro. So far so good. This in turn invokes the Regex constructor, here:

https://github.com/JuliaLang/julia/blob/251202fa8f8445053f66fd/base/regex.jl#L18-L32

But now, surprisingly, pattern is no longer 0xffffffffffffffeb0000000106c75770 but seems to be the value that this used to point to – i.e. 0x7c6a7c417c612825295e7c5d255e5b28. I have no idea how this is happening, since the same value of pattern is passed from the macro to the constructor function. Any ideas?

@@ -615,8 +630,7 @@ static inline void gc_wb_back(void *ptr) // ptr isa jl_value_t*
#define jl_gc_unpreserve()
#define jl_gc_n_preserved_values() (0)

#define allocb(nb) malloc(nb)
DLLEXPORT jl_value_t *allocobj(size_t sz);
#define allocb(nb) malloc(nb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this line changing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated whitespace cleanup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the DLLEXPORT allocobj line also disappeared

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was declared twice:

DLLEXPORT jl_value_t *allocobj(size_t sz);

DLLEXPORT jl_value_t *allocobj(size_t sz);

Since it only needs to be declared once, I deleted one of the declarations.

DLLEXPORT jl_bytevec_struct_t jl_bytevec(const uint8_t *data, size_t n)
{
jl_bytevec_struct_t b;
if (n < 2*sizeof(void*)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n < sizeof(b.here.data) to leave space for the null byte?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the length of the string is encoded in the last byte.

@vtjnash
Copy link
Member

vtjnash commented May 11, 2015

there seems to be some confusion in the codegen about the proper layout for this object composed of a recursive immutable:

# pattern.data.x
  %27 = extractvalue %ASCIIString %1, 0, !dbg !20
  %28 = bitcast %jl_value_t* %27 to i128*, !dbg !20
  %29 = load i128* %28, align 8, !dbg !20, !tbaa !21, !julia_type !23
(lldb) p *(jl_datatype_t*)jl_typeof(args[1])
(jl_datatype_t) $5 = {
  fieldptr0 = {}
  name = 0x0000000103a0d6d0
  super = 0x0000000103a4faf0
  parameters = 0x0000000103a08010
  types = 0x0000000103a562d0
  instance = 0x0000000000000000
  size = 16
  abstract = '\0'
  mutabl = '\0'
  pointerfree = '\x01'
  nfields = 1
  ninitialized = 1
  alignment = 8
  uid = 144
  struct_decl = 0x00000003082953f0
  ditype = 0x0000000000000000
  fields = {}
}
(lldb) p ((Type*)((jl_datatype_t*)jl_typeof(args[1]))->struct_decl)->dump()
%ASCIIString = type { %jl_value_t* }
(lldb) p jl_(((jl_datatype_t*)jl_typeof(args[1]))->name)
TypeName(name=:ASCIIString, module=Core, names=svec(:data), primary=ASCIIString, cache=svec(), linearcache=svec(), uid=143)
(lldb) p jl_(((jl_datatype_t*)jl_typeof(args[1]))->parameters)
svec()
(lldb) p jl_(((jl_datatype_t*)jl_typeof(args[1]))->types)
svec(ByteVec)

it appears that the definition of ByteVec is corrupted and pointerfree is unexpectedly set to false at this point:

(gdb) p jl_(0x7ffdf2c572b0)
ByteVec
$19 = void
(gdb) p *((jl_datatype_t*)0x7ffdf2c572b0)
$22 = {fieldptr0 = 0x7ffdf2c572b0, name = 0x7ffdf2c5ce10, super = 0x7ffdf2c57310, parameters = 0x7ffdf2c58010, types = 0x7ffdf2c66130, 
  instance = 0x0, size = 16, abstract = 0 '\000', mutabl = 0 '\000', pointerfree = 0 '\000', nfields = 1, ninitialized = 1, 
  alignment = 8, uid = 103, struct_decl = 0xa81f600, ditype = 0x0, fields = 0x7ffdf2c57300}
(gdb) p ((jl_datatype_t*)0x7ffdf2c572b0)->fields[0]
$23 = {offset = 0, size = 16, isptr = 0}

@vtjnash
Copy link
Member

vtjnash commented May 12, 2015

just a thought: now that tuples are inline immutable, it might be interesting to just declare this:

immutable ASCIIString <: AbstractString
  data::NTuple{UInt8}
end

and then make jeff go optimize it (for example #11187)

@StefanKarpinski
Copy link
Member Author

just a thought: now that tuples are inline immutable...

Yes, @JeffBezanson had the same thought. I was hoping to get somewhere with this approach before having to wait for that optimization, but maybe it's better just to do it that way. Another thought is to just have a ByteVec type that always has a buffer but avoids all the overhead of Arrays. But we've already got too many vector-like types around, so that's kind of unappealing.

@simonster
Copy link
Member

@ScottPJones
Copy link
Contributor

@simonster Great post... interesting that 1) they consider string handling performance so important that they are willing to have some extra complexity 2) \0 termination is only guaranteed for certain types of strings

@vtjnash
Copy link
Member

vtjnash commented May 23, 2015

My tl;dr interpretation is they merge our ByteString/ByteVec/RopeString/SubString types into a single unified interface called a JavaScript string, and use copy-on-read to ensure O(n) efficiency of the common += operations, while preserving memory efficiency and \0 termination at all user-visible points.

@DilumAluthge DilumAluthge deleted the sk/str branch March 25, 2021 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants