WIP: ByteVec string implementation #11235

StefanKarpinski · 2015-05-11T20:57:19Z

@JeffBezanson, @Keno, @vtjnash, I'm having trouble debugging what's going on here – I'm pushing it in an incomplete state with lots of debugging output in hopes someone can help figure out what's going on here. So this croaks during system image build as soon as it hits code that uses a Regex:

libc.jl
"A"
ASCIIString
Int128(0xffffffffffffffeb0000000106c75770)
Int128(0x7c6a7c417c612825295e7c5d255e5b28)
()
"B"
ASCIIString
Int128(0x7c6a7c417c612825295e7c5d255e5b28)
"C"
ASCIIString
Int128(0x7c6a7c417c612825295e7c5d255e5b28)
"D"
Int128(0x7c6a7c417c612825295e7c5d255e5b28)

signal (11): Segmentation fault: 11
length at bytevec.jl:18
print_to_string at string.jl:24
compile at pcre.jl:95

What you can see here is that it's hitting the r"..." macro just fine here:

https://github.com/JuliaLang/julia/blob/251202fa8f8445053f66fd/base/regex.jl#L72-L78

At this point, pattern is an ASCIIString with a data field of 0xffffffffffffffeb0000000106c75770, as expected – it's a negative length and a pointer to the actual string data, which starts with 0x7c6a7c417c612825295e7c5d255e5b28 – or in other words, the data ([^%]|^)%(a|A|j|, which is the initial part of a regex pattern that gets passed to the r_str macro. So far so good. This in turn invokes the Regex constructor, here:

https://github.com/JuliaLang/julia/blob/251202fa8f8445053f66fd/base/regex.jl#L18-L32

But now, surprisingly, pattern is no longer 0xffffffffffffffeb0000000106c75770 but seems to be the value that this used to point to – i.e. 0x7c6a7c417c612825295e7c5d255e5b28. I have no idea how this is happening, since the same value of pattern is passed from the macro to the constructor function. Any ideas?

vtjnash · 2015-05-11T21:11:07Z

src/julia.h

@@ -615,8 +630,7 @@ static inline void gc_wb_back(void *ptr) // ptr isa jl_value_t*
 #define jl_gc_unpreserve()
 #define jl_gc_n_preserved_values() (0)

-#define allocb(nb)    malloc(nb)
-DLLEXPORT jl_value_t *allocobj(size_t sz);
+#define allocb(nb) malloc(nb)


why is this line changing?

unrelated whitespace cleanup

the DLLEXPORT allocobj line also disappeared

It was declared twice:

julia/src/julia.h

Line 563 in b458ea3

DLLEXPORT jl_value_t *allocobj(size_t sz);

julia/src/julia.h

Line 619 in b458ea3

DLLEXPORT jl_value_t *allocobj(size_t sz);

Since it only needs to be declared once, I deleted one of the declarations.

vtjnash · 2015-05-11T21:36:59Z

src/alloc.c

+DLLEXPORT jl_bytevec_struct_t jl_bytevec(const uint8_t *data, size_t n)
+{
+    jl_bytevec_struct_t b;
+    if (n < 2*sizeof(void*)) {


n < sizeof(b.here.data) to leave space for the null byte?

No, the length of the string is encoded in the last byte.

vtjnash · 2015-05-11T23:24:19Z

there seems to be some confusion in the codegen about the proper layout for this object composed of a recursive immutable:

# pattern.data.x
  %27 = extractvalue %ASCIIString %1, 0, !dbg !20
  %28 = bitcast %jl_value_t* %27 to i128*, !dbg !20
  %29 = load i128* %28, align 8, !dbg !20, !tbaa !21, !julia_type !23

(lldb) p *(jl_datatype_t*)jl_typeof(args[1])
(jl_datatype_t) $5 = {
  fieldptr0 = {}
  name = 0x0000000103a0d6d0
  super = 0x0000000103a4faf0
  parameters = 0x0000000103a08010
  types = 0x0000000103a562d0
  instance = 0x0000000000000000
  size = 16
  abstract = '\0'
  mutabl = '\0'
  pointerfree = '\x01'
  nfields = 1
  ninitialized = 1
  alignment = 8
  uid = 144
  struct_decl = 0x00000003082953f0
  ditype = 0x0000000000000000
  fields = {}
}
(lldb) p ((Type*)((jl_datatype_t*)jl_typeof(args[1]))->struct_decl)->dump()
%ASCIIString = type { %jl_value_t* }
(lldb) p jl_(((jl_datatype_t*)jl_typeof(args[1]))->name)
TypeName(name=:ASCIIString, module=Core, names=svec(:data), primary=ASCIIString, cache=svec(), linearcache=svec(), uid=143)
(lldb) p jl_(((jl_datatype_t*)jl_typeof(args[1]))->parameters)
svec()
(lldb) p jl_(((jl_datatype_t*)jl_typeof(args[1]))->types)
svec(ByteVec)

it appears that the definition of ByteVec is corrupted and pointerfree is unexpectedly set to false at this point:

(gdb) p jl_(0x7ffdf2c572b0)
ByteVec
$19 = void
(gdb) p *((jl_datatype_t*)0x7ffdf2c572b0)
$22 = {fieldptr0 = 0x7ffdf2c572b0, name = 0x7ffdf2c5ce10, super = 0x7ffdf2c57310, parameters = 0x7ffdf2c58010, types = 0x7ffdf2c66130, 
  instance = 0x0, size = 16, abstract = 0 '\000', mutabl = 0 '\000', pointerfree = 0 '\000', nfields = 1, ninitialized = 1, 
  alignment = 8, uid = 103, struct_decl = 0xa81f600, ditype = 0x0, fields = 0x7ffdf2c57300}
(gdb) p ((jl_datatype_t*)0x7ffdf2c572b0)->fields[0]
$23 = {offset = 0, size = 16, isptr = 0}

vtjnash · 2015-05-12T03:10:32Z

just a thought: now that tuples are inline immutable, it might be interesting to just declare this:

immutable ASCIIString <: AbstractString
  data::NTuple{UInt8}
end

and then make jeff go optimize it (for example #11187)

StefanKarpinski · 2015-05-12T11:14:24Z

just a thought: now that tuples are inline immutable...

Yes, @JeffBezanson had the same thought. I was hoping to get somewhere with this approach before having to wait for that optimization, but maybe it's better just to do it that way. Another thought is to just have a ByteVec type that always has a buffer but avoids all the overhead of Arrays. But we've already got too many vector-like types around, so that's kind of unappealing.

simonster · 2015-05-23T20:52:01Z

This blog post may be of some interest: https://blog.mozilla.org/ejpbruel/2012/02/06/how-strings-are-implemented-in-spidermonkey-2/

ScottPJones · 2015-05-23T21:03:23Z

@simonster Great post... interesting that 1) they consider string handling performance so important that they are willing to have some extra complexity 2) \0 termination is only guaranteed for certain types of strings

vtjnash · 2015-05-23T21:28:47Z

My tl;dr interpretation is they merge our ByteString/ByteVec/RopeString/SubString types into a single unified interface called a JavaScript string, and use copy-on-read to ensure O(n) efficiency of the common += operations, while preserving memory efficiency and \0 termination at all user-visible points.

StefanKarpinski added 7 commits May 7, 2015 19:10

Str: immediate / remote string type based on ByteVec type.

8d8a4e5

remove Str type and methods

c3a4fe1

wip

2c1a69d

wip

727c440

wip

98d4198

wip

685e63e

wip

251202f

vtjnash reviewed May 11, 2015
View reviewed changes

wip

ed760e4

vtjnash reviewed May 11, 2015
View reviewed changes

StefanKarpinski closed this Jun 21, 2016

DilumAluthge deleted the sk/str branch March 25, 2021 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: ByteVec string implementation #11235

WIP: ByteVec string implementation #11235

StefanKarpinski commented May 11, 2015

vtjnash May 11, 2015

StefanKarpinski May 11, 2015

vtjnash May 11, 2015

StefanKarpinski May 11, 2015

vtjnash May 11, 2015

StefanKarpinski May 11, 2015

vtjnash commented May 11, 2015

vtjnash commented May 12, 2015

StefanKarpinski commented May 12, 2015

simonster commented May 23, 2015

ScottPJones commented May 23, 2015

vtjnash commented May 23, 2015

WIP: ByteVec string implementation #11235

WIP: ByteVec string implementation #11235

Conversation

StefanKarpinski commented May 11, 2015

vtjnash May 11, 2015

Choose a reason for hiding this comment

StefanKarpinski May 11, 2015

Choose a reason for hiding this comment

vtjnash May 11, 2015

Choose a reason for hiding this comment

StefanKarpinski May 11, 2015

Choose a reason for hiding this comment

vtjnash May 11, 2015

Choose a reason for hiding this comment

StefanKarpinski May 11, 2015

Choose a reason for hiding this comment

vtjnash commented May 11, 2015

vtjnash commented May 12, 2015

StefanKarpinski commented May 12, 2015

simonster commented May 23, 2015

ScottPJones commented May 23, 2015

vtjnash commented May 23, 2015