Parallelize index creation #346

erizocosmico · 2018-09-05T12:47:22Z

Goals

Now that we have partitions, we can parallelise the creation of indexes so it takes a lot less time to create them instead of doing so sequentially.

Challenges

The indexes are a big matrix of 1s and 0s, being the columns the database row column and the rows the unique value of the rows for the particular table field.

We also keep a mapping from value to the row id in the bitmap. So, for that you need to keep track of the row ids.

The problem comes from the columns. The column id is sequentially incremented.

Ideas

We could share a global counter protected by a mutex. That way, there would be no need to pass the colID and keep it sequential. This has one downside, though: index creation is not idempotent and the order in which the rows are stored (and thus, returned) will be different every time you create the same index.

Also, we would need to stop saving batches and fields in the driver structure, as multiple operations might be taking place at the same time.

UPDATE: this can't be done or indexes can't be combined.

erizocosmico · 2018-09-20T09:48:56Z

https://github.com/src-d/go-mysql-server/blob/master/sql/index/pilosalib/driver.go#L308
We need to change putLocation from indexName to fieldName.

That way, we don't need colID to be passed and we can only store what's inside the partition. Then, we can parallelise safely all the partition creation.

Updates

With this, we can safely just invalidate the fields and their mappings and recreate them if they changed.

kuba-- · 2018-10-16T09:30:24Z

@erizocosmico - regarding the latest proposal, if I understand correctly - if we save colID in fields how do we sync-up the same locations across multiple partitions (fields) if they may have different colID per field.

kuba-- · 2019-02-06T11:07:49Z

Because the main problem is related to column IDs (which are global per index), I suggest 2 alternative scenarios:

S1

Replace iteration over columns by bucket.Sequence(). So far, in mapping we have one bucket per field (we keep mapping value - rowID) and bucket per index (colID - location).
In other words, we'll first save mapping (colID - location) which will utilise bucket sequencer (for performance and thread safety) and then set bit in bitmap.
I think it's better than wrapping a loop with colID with mutex. Moreover it will keep all the synchronization calls in mapping.
So our existing loop:

for colID = offset; err == nil; colID++ {
...
  rowID, err := idx.mapping.getRowID(field.Name(), values[i])
  b.bitBatches[i].Add(rowID, colID)
  idx.mapping.putLocation(pilosaIndex.Name(), colID, location)
}

may look like:

for {
   ...
  rowID, err := idx.mapping.getRowID(field.Name(), values[i])

  // which will do similar things as we do in getRowID - create a new if doesn't exists
  colID, err := idx.mapping.getColID(idx.Name(), location)

  b.bitBatches[i].Add(rowID, colID)
}

S2

The second scenario is totally different. If we want to populate N fields with data then:

We create N go routines. Every go routine will receive the value and location (maybe over the channel) and save it in mapping and in pilosa or in batch.
The main thread will iterate over columns (or get the next column ID from mapping).
Assign location to column ID and send tuples (value, location) over the channel to every go routine.

In other words, the main thread will be responsible for mapping location and generate columnID, but background go routines will save mapping and set bits in own fields.

erizocosmico added enhancement New feature or request performance Performance improvements labels Sep 5, 2018

erizocosmico self-assigned this Sep 5, 2018

erizocosmico removed their assignment Oct 19, 2018

kuba-- self-assigned this Feb 7, 2019

ajnavarro added this to the OKR-2019-Q1-P2 milestone Feb 8, 2019

erizocosmico assigned erizocosmico and unassigned kuba-- and erizocosmico Mar 20, 2019

erizocosmico mentioned this issue Mar 22, 2019

sql/index/pilosa: parallelize index creation #644

Merged

ajnavarro closed this as completed in #644 Apr 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize index creation #346

Parallelize index creation #346

erizocosmico commented Sep 5, 2018 •

edited

Loading

erizocosmico commented Sep 20, 2018

kuba-- commented Oct 16, 2018

kuba-- commented Feb 6, 2019

Parallelize index creation #346

Parallelize index creation #346

Comments

erizocosmico commented Sep 5, 2018 • edited Loading

Goals

Challenges

Ideas

erizocosmico commented Sep 20, 2018

Updates

kuba-- commented Oct 16, 2018

kuba-- commented Feb 6, 2019

S1

S2

erizocosmico commented Sep 5, 2018 •

edited

Loading