Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big strings in string columns #154

Closed
wants to merge 76 commits into from
Closed

Conversation

astigsen
Copy link
Contributor

Note: Contains everything from https://github.com/Tightdb/tightdb/tree/breaking-updates

Here is what is really in this pull request: https://github.com/astigsen/tightdb/compare/Tightdb:breaking-updates...big_blobs

This adds support for big strings in string columns. It used to be that each leaf in a string column would contain MAX_LIST_SIZE strings, which made the leafs very big if the individual strings were big.

This caused problems, both in terms of performance, having to copy-on-write these big leafs on every change, but also by putting a relatively small limit on the sizes of the strings. The fact that all these strings was put together in one leaf, meant that the total length could not be longer that what could be written in the size field in the header (3 bytes).

This PR add a new step for "big" strings which puts each string in it's own Array. This fixes the performance issues and extends the possible size of individual strings to what fits in the Arrays size field (this limitation will be fixed in a future PR).

This same approach will also be used for binary columns.

NOTE: This is a non-breaking change in the file format.
NOTE: This is effectively a breaking change in the file format.

kspangsege and others added 30 commits April 23, 2013 02:00
Conflicts:
	src/tightdb/array_string.hpp
…eral).

Also got rid of Column::NodeGetOffsets(), Column::NodeGetRefs(), Array::GetSubArray(),
and the const-violating attempt at a moving constructor for Array.
Also, Table::to_dot() and friends had a major overhaul.
Also no more warnings from GCC 4.4 23/64-bit (Ubuntu 10.04).
Conflicts:
	src/tightdb/column_mixed_tpl.hpp
	test/test_array_blob.cpp
	test/test_table.cpp
@astigsen
Copy link
Contributor Author

I think that it will be most effective for you to do the merge as you have the best overview of how the b-tree handling has been changed. My changes should be limited to string columns.

@@ -133,6 +151,7 @@ class AdaptiveStringColumn: public ColumnBase {

private:
static const size_t short_string_max_size = 15;
static const size_t long_string_max_size = 63;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with "63" as the maximum long string size?

Conflicts:
	test/test_column.cpp
	test/test_column_mixed.cpp
@kspangsege
Copy link
Contributor

Sorry to say, but there Valgrind finds some issues. See attached.

On Mon, Sep 23, 2013 at 4:04 AM, astigsen notifications@github.com wrote:

I think that it will be most effective for you to do the merge as you have
the best overview of how the b-tree handling has been changed. My changes
should be limited to string columns.


Reply to this email directly or view it on GitHubhttps://github.com//pull/154#issuecomment-24895893
.

@kspangsege
Copy link
Contributor

These memory bugs are relative to my last commit to your branch, where I
have merged in the latest stuff from master, and fixed a few compile-time
errors.

As far as I can tell, you have created a few more branches based on top of
"big_blobs". Please pull these changes into those other branches and check.

I'll continue to attempt a merge "big_blobs" and "node-consolidation", but
I need a clean slate from Valgrind to be sure that I have done it right.

By the way, "big_blobs" also has a memory leak reported by Valgrind. Since
it does not occur with test_query.cpp
as the only enabled test file, the leak probably occurs only in
test_array_blobs.cpp:

==27916== 640 bytes in 5 blocks are definitely lost in loss record 1 of 1
==27916== at 0x4C2B6CD: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27916== by 0x4F225D4: (anonymous
namespace)::DefaultAllocator::alloc(unsigned long) (alloc.cpp:43)
==27916== by 0x4F27F10:
tightdb::Array::create_empty_array(tightdb::Array::Type,
tightdb::Array::WidthType, tightdb::Allocator&) (array.cpp:1242)
==27916== by 0x410EE3:
tightdb::Array::create_empty_array(tightdb::Array::Type,
tightdb::Allocator&) (array.hpp:1580)
==27916== by 0x410AC7: tightdb::Array::create(tightdb::Array::Type)
(array.hpp:1072)
==27916== by 0x410A73: tightdb::Array::Array(tightdb::Array::Type,
tightdb::ArrayParent_, unsigned long, tightdb::Allocator&) (array.hpp:1018)
==27916== by 0x504A3D1:
tightdb::ArrayBlob::ArrayBlob(tightdb::ArrayParent_, unsigned long,
tightdb::Allocator&) (array_blob.hpp:77)
==27916== by 0x504B88F: tightdb::ArrayBigBlobs::add(tightdb::BinaryData,
bool) (array_blobs_big.cpp:34)
==27916== by 0x40E47E:
db_setup_big_blobsArrayBigBlobsEraseHelper::RunImpl()
(test_array_blobs_big.cpp:163)
==27916== by 0x412794: void
UnitTest::ExecuteTest<db_setup_big_blobsArrayBigBlobsEraseHelper>(db_setup_big_blobsArrayBigBlobsEraseHelper&,
UnitTest::TestDetails const&) (ExecuteTest.h:25)
==27916== by 0x40E1FC:
Testdb_setup_big_blobsArrayBigBlobsErase::RunImpl() const
(test_array_blobs_big.cpp:159)
==27916== by 0x413881: void
UnitTest::ExecuteTestUnitTest::Test(UnitTest::Test&,
UnitTest::TestDetails const&) (in
/home/kristian/tightdb/test/tightdb-tests-dbg)

Note that none of these memory errors occur in master.

On Mon, Sep 23, 2013 at 4:47 PM, Kristian Spangsege ks@tightdb.com wrote:

Sorry to say, but there Valgrind finds some issues. See attached.

On Mon, Sep 23, 2013 at 4:04 AM, astigsen notifications@github.comwrote:

I think that it will be most effective for you to do the merge as you
have the best overview of how the b-tree handling has been changed. My
changes should be limited to string columns.


Reply to this email directly or view it on GitHubhttps://github.com//pull/154#issuecomment-24895893
.

@kspangsege
Copy link
Contributor

I have merged node-consolidation into your branch. Here the status:

ArrayBigBlobs::to_dot() needs to be implemented.

Error in query engine. This testcase

TIGHTDB_TABLE_1(TupleTableType,
                second, String)

TEST(TestQueryStrEnum)
{
    TupleTableType t;
    for (size_t i = 0; i < 69; ++i)
        t.add("");
    t.optimize();
    t.where().second.equal("").count();
}

gets you to the ASSERT below

void StringNode<Equal>::clear_leaf_state()
{
    switch (m_leaf_type) {
        case AdaptiveStringColumn::leaf_type_Small:
            delete static_cast<ArrayString*>(m_leaf);
            return;
        case AdaptiveStringColumn::leaf_type_Medium:
            delete static_cast<ArrayStringLong*>(m_leaf);
            return;
        case AdaptiveStringColumn::leaf_type_Big:
            delete static_cast<ArrayBigBlobs*>(m_leaf);
            return;
    }
    TIGHTDB_ASSERT(false);
}

@kspangsege
Copy link
Contributor

Please pull my changes into your other branches that are based on this one.

@kspangsege
Copy link
Contributor

I have attached the latest from Valgrind against the current state of your
repo on GitHub (that is, with node-consolidation merged in).

Note that the memory errors might be related to the mentioned assert-bug in
query_engine. I don't know if they are.

Everything except the leaks comes from test_query.cpp.

On Mon, Sep 23, 2013 at 5:32 PM, Kristian Spangsege ks@tightdb.com wrote:

These memory bugs are relative to my last commit to your branch, where I
have merged in the latest stuff from master, and fixed a few compile-time
errors.

As far as I can tell, you have created a few more branches based on top of
"big_blobs". Please pull these changes into those other branches and check.

I'll continue to attempt a merge "big_blobs" and "node-consolidation", but
I need a clean slate from Valgrind to be sure that I have done it right.

By the way, "big_blobs" also has a memory leak reported by Valgrind. Since
it does not occur with test_query.cpp
as the only enabled test file, the leak probably occurs only in
test_array_blobs.cpp:

==27916== 640 bytes in 5 blocks are definitely lost in loss record 1 of 1
==27916== at 0x4C2B6CD: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27916== by 0x4F225D4: (anonymous
namespace)::DefaultAllocator::alloc(unsigned long) (alloc.cpp:43)
==27916== by 0x4F27F10:
tightdb::Array::create_empty_array(tightdb::Array::Type,
tightdb::Array::WidthType, tightdb::Allocator&) (array.cpp:1242)
==27916== by 0x410EE3:
tightdb::Array::create_empty_array(tightdb::Array::Type,
tightdb::Allocator&) (array.hpp:1580)
==27916== by 0x410AC7: tightdb::Array::create(tightdb::Array::Type)
(array.hpp:1072)
==27916== by 0x410A73: tightdb::Array::Array(tightdb::Array::Type,
tightdb::ArrayParent_, unsigned long, tightdb::Allocator&) (array.hpp:1018)
==27916== by 0x504A3D1:
tightdb::ArrayBlob::ArrayBlob(tightdb::ArrayParent_, unsigned long,
tightdb::Allocator&) (array_blob.hpp:77)
==27916== by 0x504B88F:
tightdb::ArrayBigBlobs::add(tightdb::BinaryData, bool)
(array_blobs_big.cpp:34)
==27916== by 0x40E47E:
db_setup_big_blobsArrayBigBlobsEraseHelper::RunImpl()
(test_array_blobs_big.cpp:163)
==27916== by 0x412794: void
UnitTest::ExecuteTest<db_setup_big_blobsArrayBigBlobsEraseHelper>(db_setup_big_blobsArrayBigBlobsEraseHelper&,
UnitTest::TestDetails const&) (ExecuteTest.h:25)
==27916== by 0x40E1FC:
Testdb_setup_big_blobsArrayBigBlobsErase::RunImpl() const
(test_array_blobs_big.cpp:159)
==27916== by 0x413881: void
UnitTest::ExecuteTestUnitTest::Test(UnitTest::Test&,
UnitTest::TestDetails const&) (in
/home/kristian/tightdb/test/tightdb-tests-dbg)

Note that none of these memory errors occur in master.

On Mon, Sep 23, 2013 at 4:47 PM, Kristian Spangsege ks@tightdb.comwrote:

Sorry to say, but there Valgrind finds some issues. See attached.

On Mon, Sep 23, 2013 at 4:04 AM, astigsen notifications@github.comwrote:

I think that it will be most effective for you to do the merge as you
have the best overview of how the b-tree handling has been changed. My
changes should be limited to string columns.


Reply to this email directly or view it on GitHubhttps://github.com//pull/154#issuecomment-24895893
.

@astigsen
Copy link
Contributor Author

I have fixed the issues in the unit tests.

I did try to merge it into the big_blobs_binary branch, but there were too many btree related conflicts needing understanding of your btree related changes. I think that you will be the most appropriate to do this merge.

@kspangsege
Copy link
Contributor

Merged into https://github.com/Tightdb/tightdb/tree/breaking-updates

This concludes this pull request.

@kspangsege kspangsege closed this Sep 24, 2013
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants