limits tablets and offers bulk import as option for ingest #287

keith-turner · 2024-11-17T22:07:08Z

Two new continuous ingest features are introduced in this change. First options were added to limit the number of tablets written. Second an options to use bulk ingest was added instead of a batch writer.

These features support running a test like the following.

create a continuous ingest table with 1000 tablets
start 100 continuous ingest clients
have each client continually bulk import data to 10 random tablets

This test situation will create a lot of bulk import and subsequent compaction activity for Accumulo to handle.

These changes add bulk import to the cingest ingest command. There is an existing cingest bulk command that runs a map reduce job to create bulk files. These changes do not remove the need for the existing map reduce job, they fill a different purpose. The map reduce job can generate really large amount of data to bulk import. These changes allow generating lots of bulk imports w/ small amounts of data. These changes could never generate the amount of data for a single bulk import that the map reduce job could. The following is an example of test scenario that could use both.

create a continuous ingest table with 1000 tablets
use map reduce bulk job to create an initial 10 billion entries in the table
start 100 continuous ingest clients
have each client continually bulk import data to 10 random tablets
stop clients after 12 hours and verify data

Two new continuous ingest features are introduced in this changes. First options were added to limit the number of tablets written. Second an options to use bulk ingest was added instead of a batch writer. These features support running a test like the following. * create a continuous ingest table with 1000 tablets * start 100 continuous ingest clients * have each client continually bulk import data to 10 random tablets This test situation will create a lot of bulk import and subsequent compaction activity for Accumulo to handle. These changes add bulk import to the `cingest ingest` command. There is an existing `cingest bulk` command that runs a map reduce job to create bulk files. These changes do not remove the need for the existing map reduce job, they fill a different purpose. The map reduce job can generate really large amount of data to bulk import. These changes allow generating lots of bulk imports w/ small amounts of data. These changes could never generate the amount of data for a single bulk import that the map reduce job could. The following is an example of test scenario that could use both. * create a continuous ingest table with 1000 tablets * use map reduce bulk job to create an initial 10 billion entries in the table * start 100 continuous ingest clients * have each client continually bulk import data to 10 random tablets * stop clients after 12 hours and verify data

keith-turner · 2024-11-17T22:19:27Z

The following is example output that shows running these changes where it continually bulk imports into 3 random tablets on a table with 20 tablets. Around every 6s it bulk imports 3 files to 3 tablets w/ one million total key values.

$ ./bin/cingest ingest -o test.ci.ingest.bulk.workdir=hdfs://localhost:8020/ci_bulk -o test.ci.ingest.max.tablets=3
2024-11-17T22:11:52,292 [testing.continuous.ContinuousIngest] INFO : Ingest instance ID: 655f7838-dde9-4c28-8c6f-a05211d5c4e1 current time: 1731881512290ms
2024-11-17T22:11:52,293 [testing.continuous.ContinuousIngest] INFO : A flush will occur after every 1000000 entries written
2024-11-17T22:11:52,305 [testing.continuous.ContinuousIngest] INFO : Total entries to be written: 9223372036854775807
2024-11-17T22:11:52,307 [testing.continuous.ContinuousIngest] INFO : DELETES will occur with a probability of 0.10
2024-11-17T22:11:54,385 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00000.rf for range 0cccccccccccccce 1333333333333335
2024-11-17T22:11:55,480 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00001.rf for range 6cccccccccccccd7 733333333333333e
2024-11-17T22:11:56,367 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00002.rf for range 733333333333333e 79999999999999a5
2024-11-17T22:11:58,049 [testing.continuous.BulkBatchWriter] DEBUG: Bulk imported dir hdfs://localhost:8020/ci_bulk/5455e07d-ee97-4fcd-b526-65d75a31697c destinations:3 mutations:1000000 memUsed:331000000 time:791ms
2024-11-17T22:11:58,054 [testing.continuous.ContinuousIngest] INFO : FLUSH - duration: 4410ms, since last flush: 5749ms, total written: 1000000, total deleted: 0
2024-11-17T22:11:59,862 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00000.rf for range 399999999999999f 4000000000000006
2024-11-17T22:12:00,930 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00001.rf for range 466666666666666d 4cccccccccccccd4
2024-11-17T22:12:01,976 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00002.rf for range 6000000000000009 666666666666667
2024-11-17T22:12:03,742 [testing.continuous.BulkBatchWriter] DEBUG: Bulk imported dir hdfs://localhost:8020/ci_bulk/05d3d626-265e-4b01-a1bd-6053d108cca6 destinations:3 mutations:1000000 memUsed:347000000 time:734ms
2024-11-17T22:12:03,744 [testing.continuous.ContinuousIngest] INFO : FLUSH - duration: 4515ms, since last flush: 5690ms, total written: 2000000, total deleted: 0
2024-11-17T22:12:05,524 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00000.rf for range 199999999999999c 2000000000000003
2024-11-17T22:12:06,573 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00001.rf for range 4000000000000006 466666666666666d
2024-11-17T22:12:07,602 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00002.rf for range 59999999999999a2 6000000000000009
2024-11-17T22:12:09,342 [testing.continuous.BulkBatchWriter] DEBUG: Bulk imported dir hdfs://localhost:8020/ci_bulk/eee6174e-f539-4a81-a28f-84a0dd8d9932 destinations:3 mutations:1000000 memUsed:347000000 time:719ms
2024-11-17T22:12:09,344 [testing.continuous.ContinuousIngest] INFO : FLUSH - duration: 4427ms, since last flush: 5600ms, total written: 3000000, total deleted: 0
2024-11-17T22:12:11,094 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00000.rf for range 2000000000000003 266666666666666a
2024-11-17T22:12:12,168 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00001.rf for range 4000000000000006 466666666666666d
2024-11-17T22:12:13,186 [testing.continuous.BulkBatchWriter] DEBUG: Created new file bbw-00002.rf for range 59999999999999a2 6000000000000009
2024-11-17T22:12:14,924 [testing.continuous.BulkBatchWriter] DEBUG: Bulk imported dir hdfs://localhost:8020/ci_bulk/4cf6970c-fed7-4f51-bb1c-03274ae05c48 destinations:3 mutations:1000000 memUsed:347000000 time:715ms
2024-11-17T22:12:14,926 [testing.continuous.ContinuousIngest] INFO : FLUSH - duration: 4405ms, since last flush: 5582ms, total written: 4000000, total deleted: 0accu

I have run the bulk changes and the live ingest into the same table and then successfully ran the verify map reduce job.

consolidate getting table splits in the code

50ab83b

ddanielr approved these changes Nov 18, 2024

View reviewed changes

ddanielr merged commit bc5379c into apache:elasticity Nov 18, 2024
1 check passed

keith-turner mentioned this pull request Dec 4, 2024

Narrow the set of files checked by compaction commit conditional mutation apache/accumulo#5117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

limits tablets and offers bulk import as option for ingest #287

limits tablets and offers bulk import as option for ingest #287

keith-turner commented Nov 17, 2024

keith-turner commented Nov 17, 2024

limits tablets and offers bulk import as option for ingest #287

limits tablets and offers bulk import as option for ingest #287

Conversation

keith-turner commented Nov 17, 2024

keith-turner commented Nov 17, 2024