Data retention CLI #684

DaimonPl · 2020-05-28T10:53:19Z

Is there any easy way to configure automatic retention for spline data? For example automatically delete entires older than 3 months ?

wajda · 2020-05-28T11:41:38Z

No, not yet. It could be added in the future though, it was slightly mentioned a couple of times in some local discussions.

DaimonPl · 2020-05-28T11:50:18Z

Ok, I think it might be useful in big setups. We have hundreds of jobs running everyday and dozens of jobs running every hour pretty sure we would need to scale arangodb quite often without retention

Any idea if retention can be somehow added in arangodb itself ?

wajda · 2020-05-28T12:24:08Z

Well...not in ArangoDB natively (afaik), but you could use your custom script. However this approach is kinda fragile and dirty, and I wouldn't recommend doing it. Especially in a busy environment.
To do it correctly and avoiding breaking the database consistency you need to respect the graph structure and cascade the deletion properly.

On the other hand if you could find a logical time gap in your data pipelines, big enough (e.g. one hour), you can try to cut the DB in the middle or even closer to the end of that time gap. Simply delete all documents with _created < my_timestamp from all collections excepts dataSource, all in one transaction.
This is still a hacking approach but having the idle time frame large enough it practically should not create any problems.
Also keep in mind that the described approach might not work in Spline 0.6+

DaimonPl · 2020-05-28T12:27:58Z

Thx for detailed answer :)

No way to find any time gap on our Hadoop it's processing 24/7/365 something all the time :)

wajda · 2020-05-28T12:30:01Z

Awesome. I see you case will be a really great test case for Spline :))

pratapmmmec · 2020-07-03T07:01:10Z

Thanks for responding. Do we have any tentative timeline for its release?

wajda · 2020-07-03T08:46:10Z

We'll try to implement something simple in 0.5.4 which should be out in the course of the month. We are addressing performance bottlenecks in this release, so this feature logically fits here.

wajda · 2020-08-10T09:47:55Z

Solution:

Add a command to the Admin CLI to remove the lineage data older than given date.
~~Add an option to the Gateway to do the same thing periodically based on a provided cron expression.~~

wajda · 2020-08-17T17:03:14Z

After the second thought I realized that having CLI command for removing old data makes it unnecessary to make it a Spline Server responsibility. System cronjobs can be used instead to simply run java -jar admin-cli.jar db-prune ...

…rvices (#763) * spline #761 ArangoDB: WITH keyword is required in a cluster environment * spline #684 Add Foxx service * spline #684 Call Foxx service * spline #761 ArangoDB: Add missing WITH keywords in a Foxx service * spline #761 ArangoDB: Remove Spline AQL UDFs * spline #761 Upgrade ArangoDB driver + method to register foxx service programmatically * spline #761 Register Foxx services / migration * spline #761 Undo changes to 0.5.0-0.5.4 migration script as we're in the version 0.5.5 now, so cannot change an older migration * spline #761 Remove redundant extension class

wajda · 2022-08-14T17:45:43Z

Just out of my curiosity, what's the size of your collections?

hugeshi · 2022-08-22T08:30:56Z

Hi @wajda ,
Thanks for your contribution to this issue, I have kept an eye on this feature for a long time. It's quite tough to clean up data on a big dataset. I think it's the main reason why several people request the data retention feature.
what's the rough ETA of this feature?

wajda · 2022-08-22T09:04:02Z

Unfortunately I cannot give any ETA. On one side, it's not a priority feature for our company at the moment. But, I can say that a lot of things are changing right now, and there is a chance for the project to receive some boost very soon. On the other side, as I tried to explain above this feature depends on another feature that introduces a set of fundamental changed to the way how transactions are managed, and that one has to be finished first. I hope to give some update on plans say next month.

Aditya-Sood · 2022-09-21T06:21:39Z

hi @wajda
Apologies for the delay in response, we were caught up in some other tasks

Our total spline database size was around ~100GB on the system that we were trying to cleanup

We dealt with the dataSource objects deletion by taking two steps:

Trim down the affects and depends collections down to the target duration that we finally want, before initiating search for stale data sources (this reduced the overall dataset size that had to be traversed to identify the stale sources)
Rewrite the query being used to identify stale dataSource objects - there was an optimisation opportunity since using an equality operation on the non-indexed _id attribute was performing better than a string match on the indexed _key attribute

Combining this with the previous script changes have allowed us to trim a day's worth of stale objects in 1-1.5 hrs

Aditya-Sood · 2022-09-21T06:43:36Z

Following is the final cleanup script that we are using: https://gist.github.com/Aditya-Sood/ecc07c9f296dbdf03d4946c5d1b4efce

Could you please review the script and let us know if there are any changes that you would like?

Post that we would like to discuss integrating the cleanup functionality with the project, possibly as a configurable service added to the Admin utility

wajda · 2022-09-26T10:05:03Z

Thank you @Aditya-Sood. I will review it this week and will get back to you.

wajda · 2022-10-13T09:18:15Z

@Aditya-Sood, we've reviewed your script and found it generally correct. But I've got a few minor questions/suggestions about it:

Stage 1:
- FILTER progEvent.timestamp vs. FILTER progEvent._created in the FOR progEvent IN progress. The difference is that timestamp denotes the event time, while _created denotes the processing time. Although having a large enough time window there is little to no practical difference, but I guess it logically it would be more correct to use the former.
- Using _created on progressOf. Again, while it might be highly unlikely but still there is a risk of grabbing an edge that belongs to a living node just because its processing time happened to be 1ms earlier than the adjacent progress node. Which could happen as there is no predetermined order of both entities creation and the _created time might not be exactly the same for all nodes inserted by a single logical transaction. This is the same reason why we can't just wipe all the docs from all collections in one go by the _created_ predicate alone. I would suggest remove progressOf by looking at its relation to the progress being removed. Use either traversal 1 OUTBOUND prog progressOf or a simple document query FOR progOf IN progressOf FILTER progOf._from == prog._id that could potentially be faster for the case.
At the stage 2 where you collect orphan execution plan IDs:
- Why do you need a second loop. Wouldn't it better to collect them at the stage 1 when removing progress and progressOf nodes?
- When checking for node orphanage, why do you use NOT IN and require two-step process? Can't you simply count the number of adjacent edges and compare it with 0?
What do you mean by "using an equality operation on the non-indexed _id attribute was performing better than a string match on the indexed _key attribute"? In ArangoDB attributes _id, _key, _from and _to are always indexed.

…with current develop - TODO test

…b.com/Aditya-Sood/ecc07c9f296dbdf03d4946c5d1b4efce - naively tested with test data (multiple lineages at different times - purge with time between - correct outcome - older purged, newer kept)

Aditya-Sood · 2022-10-26T04:06:14Z

hi @wajda
thanks for the review! please excuse the delay in response, I was busy with work and then Diwali over the past few days

I've understood the changes required for the first 3 subpoints
Regarding sub-point 4 (When checking for node orphanage,...), if I'm unable to use your suggestion from sub-point 3 (i.e. collect orphan exec plans in Stage 1 itself), then I'll add this as well

regarding point (3), the dataSource does not have an index on the _id attribute in our DB:

also I think the major difference came in because a sub-string search on an indexed attribute will not benefit from indexing, since all values will have to be scanned by such an operation

lastly, shall I continue working on this feature or has it been re-assigned?
I was under the impression this requirement was on the back-burner and not urgent, but if that has changed and you would prefer having @dk1844 continue do let us know

thanks

wajda · 2022-10-26T11:40:53Z

the dataSource does not have an index on the _id attribute in our DB:

There is no need for a separate index for _id. In ArangoDB, _id (also called a document handle) is just a combination of a collection name and a primary key (_key). If you take a look at the explain output, you'll see that the same primary index is used for a selection by the _id or _key attribute.

wajda · 2022-10-26T11:50:54Z

lastly, shall I continue working on this feature or has it been re-assigned?
I was under the impression this requirement was on the back-burner and not urgent, but if that has changed and you would prefer having @dk1844 continue do let us know

As we didn't hear from you for some time we decided to speed things up a bit. We took your gist and (with some minor changes) basically incorporated it into our older PR that has been started and abandoned some long time ago. @Aditya-Sood it would be awesome if you help us testing and optimize @dk1844's PR trying it on your large database.

dk1844 · 2022-11-11T10:14:17Z

In reference to #684 (comment), I did some measurement runs for stage 1 and stage 2 with different AQL to get a sense how they perform.

Common info

All measuring was done on the same 12200 lineage records generated by our data-gen using
bash for ($i = 0; $i -lt 1220; $i++){ echo "loop $i" ; docker run -e SPLINE_URL=http://host.docker.internal:8080/spline -e OPERATIONS=1-10@1 spline_datagen}

Arango profiling was always used, done on "my machine", so the results should only be interpretted as relative to each other.
Also, timestamp filtering was commented out each time.

Stage 1

Option A

This option come from the original PR#762 on this very topic.

FOR p IN progress
    //FILTER p.timestamp < ${timestamp}
    REMOVE p IN progress
    FOR po IN progressOf
        FILTER po._from == p._id
        REMOVE po IN progressOf
        RETURN DISTINCT po._to

Option B

This option comes from the @Aditya-Sood's script. It has the theoretical downside of possibly creating orphanated progress and progressOf entries (that would get cleaned up on the next purge run)

FOR progEvent IN progress
    // FILTER progEvent._created < ${purgeLimitTimestamp}
    REMOVE progEvent IN progress

FOR progOf IN progressOf
    // FILTER progOf._created < ${purgeLimitTimestamp}
    REMOVE progOf IN progressOf

Option C

This options comes from the reference suggestion message.

FOR p IN progress
    //FILTER p.timestamp < ${timestamp}
    FOR prog, progOf IN 1 OUTBOUND p progressOf
        REMOVE p IN progress
        REMOVE progOf IN progressOf

Results of profiling

All results have max index selectivity (100%) on one or multiple indices.

Option A - Peak Mem 1376256B; median time 2.20633s (2.17843s, 2.20633s, 2.42307s)
Option B - Peak Mem 98304B; median time 1.51415s (1.53055s, 1.51415s, 1.53688s)
Option C - Peak Mem 1540096B - median time 2.48856s (1.82244s, 2.26623s, 2.47635s, 2.48856s, 2.60309s, 2.29054s, 3.11524s)

Possible conclusion

@Aditya-Sood's approach is the fastest and most memory conservative (Option B). If we don't want to use it because of possible inconsistency issue, other two options are relatively similar with Option A seeming a bit better,

Stage 2

Stage 2 has been attempted on the same data after one of the Stage 1 options has been run (regardless of which).

Option A

This is @Aditya-Sood's approach with the two commands passed together (possibly faster).

    LET execPlanIDsWithProgressEventsArray = (FOR execPlan IN executionPlan
    FOR progOf IN progressOf
        FILTER progOf._to == execPlan._id // && execPlan._created < ${timestamp}
        COLLECT ids = execPlan._id
        RETURN ids)

    FOR execPlan IN executionPlan
        FILTER execPlan._id NOT IN execPlanIDsWithProgressEventsArray //&& execPlan._created < ${timestamp}
        RETURN execPlan._id

Option B

This options comes from the reference suggestion message.

    FOR execPlan IN executionPlan
        LET connectionCount = LENGTH(FOR progOf IN progressOf
            FILTER progOf._to == execPlan._id // && execPlan._created < ${timestamp}
            RETURN 1
        )
        FILTER connectionCount == 0
        RETURN execPlan._id

Results of profiling

All results have max index selectivity (100%) on one or multiple indices.

Option A - Peak Mem 327680B; median first-time run 0.70598s, median of subsequent runs (cached?): 0.03214s
Option B - Peak Mem 950272B; median first-time run 0.75554s, median of subsequent runs (cached?): 0.04231s

Possible conclusion

Option A seems a bit better (less than half of memory needed, run time seems slightly better)

* #684 data retention - original code from PR#762 - brought up to date with current develop - updated to reflect logic from https://gist.github.com/Aditya-Sood/ecc07c9f296dbdf03d4946c5d1b4efce - - Sonar/Codacy code quality updates - db prune - stage1 option A, stage 2 option A applied

DaimonPl added the question label May 28, 2020

wajda added feature and removed question labels May 28, 2020

wajda mentioned this issue Jun 25, 2020

Purge historic lineage from admin jar #739

Closed

wajda added this to the 0.5.4 milestone Jul 3, 2020

wajda added the component: Persistence label Jul 3, 2020

wajda self-assigned this Aug 10, 2020

wajda added a commit that referenced this issue Aug 17, 2020

spline #684 Spline - automatic data retention - CLI

39056bd

wajda added a commit that referenced this issue Aug 18, 2020

spline #684 Spline - fix test

81b0bde

wajda added a commit that referenced this issue Aug 20, 2020

spline #684 Add Foxx service

e623f0d

wajda added a commit that referenced this issue Aug 20, 2020

spline #684 Add Foxx service

38fc8a9

wajda added a commit that referenced this issue Aug 20, 2020

spline #684 Call Foxx service

4b6c3a5

wajda modified the milestones: 0.5.4, 0.5.5 Aug 20, 2020

wajda added a commit that referenced this issue Sep 3, 2020

spline #684 Add Foxx service

87820ca

wajda added a commit that referenced this issue Sep 3, 2020

spline #684 Call Foxx service

ee33c86

wajda added a commit that referenced this issue Sep 14, 2020

spline #684 Spline - automatic data retention - CLI

e8252f6

wajda added a commit that referenced this issue Sep 14, 2020

spline #684 Spline - fix test

021f28e

wajda added a commit that referenced this issue Sep 14, 2020

spline #684 refactoring

5a52371

wajda added a commit that referenced this issue Sep 16, 2020

spline #684 refactoring

da38190

dk1844 self-assigned this Oct 5, 2022

dk1844 modified the milestones: 1.1.0, 1.0.0 Oct 5, 2022

dk1844 added a commit that referenced this issue Oct 18, 2022

#684 data retention - original code from PR#762 - brought up to date …

a3fcaca

…with current develop - TODO test

dk1844 mentioned this issue Oct 18, 2022

Feature/spline 684 data retention 2 #1113

Merged

wajda removed their assignment Oct 19, 2022

dk1844 added a commit that referenced this issue Oct 21, 2022

#684 minor updates

c5e7a36

dk1844 added a commit that referenced this issue Oct 21, 2022

#684 Sonar/Codacy code quality updates

e366788

wajda changed the title ~~Spline - automatic data retention?~~ Spline - automatic data retention Oct 27, 2022

wajda changed the title ~~Spline - automatic data retention~~ Spline - data retention Oct 27, 2022

wajda changed the title ~~Spline - data retention~~ Data retention CLI Oct 27, 2022

dk1844 added a commit that referenced this issue Oct 31, 2022

#684 prune db time measurment logging added

26cfa2c

dk1844 added a commit that referenced this issue Nov 14, 2022

#684 dataSourceKeysInAffectsDependsUseArray in one query + toArray fix

35d97d0

dk1844 added a commit that referenced this issue Nov 16, 2022

#684 PR review - logging update

1bfb6be

dk1844 added a commit that referenced this issue Nov 21, 2022

#684 indentation fix

fb72253

wajda closed this as completed Nov 28, 2022

wajda mentioned this issue Aug 23, 2023

Delete older or unusable execution events #1258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data retention CLI #684

Data retention CLI #684

DaimonPl commented May 28, 2020

wajda commented May 28, 2020

DaimonPl commented May 28, 2020

wajda commented May 28, 2020 •

edited

Loading

DaimonPl commented May 28, 2020

wajda commented May 28, 2020

pratapmmmec commented Jul 3, 2020

wajda commented Jul 3, 2020

wajda commented Aug 10, 2020 •

edited

Loading

wajda commented Aug 17, 2020

wajda commented Aug 14, 2022

hugeshi commented Aug 22, 2022

wajda commented Aug 22, 2022

Aditya-Sood commented Sep 21, 2022

Aditya-Sood commented Sep 21, 2022

wajda commented Sep 26, 2022

wajda commented Oct 13, 2022

Aditya-Sood commented Oct 26, 2022

wajda commented Oct 26, 2022

wajda commented Oct 26, 2022 •

edited

Loading

dk1844 commented Nov 11, 2022 •

edited

Loading

Data retention CLI #684

Data retention CLI #684

Comments

DaimonPl commented May 28, 2020

wajda commented May 28, 2020

DaimonPl commented May 28, 2020

wajda commented May 28, 2020 • edited Loading

DaimonPl commented May 28, 2020

wajda commented May 28, 2020

pratapmmmec commented Jul 3, 2020

wajda commented Jul 3, 2020

wajda commented Aug 10, 2020 • edited Loading

wajda commented Aug 17, 2020

wajda commented Aug 14, 2022

hugeshi commented Aug 22, 2022

wajda commented Aug 22, 2022

Aditya-Sood commented Sep 21, 2022

Aditya-Sood commented Sep 21, 2022

wajda commented Sep 26, 2022

wajda commented Oct 13, 2022

Aditya-Sood commented Oct 26, 2022

wajda commented Oct 26, 2022

wajda commented Oct 26, 2022 • edited Loading

dk1844 commented Nov 11, 2022 • edited Loading

Common info

Stage 1

Option A

Option B

Option C

Results of profiling

Possible conclusion

Stage 2

Option A

Option B

Results of profiling

Possible conclusion

wajda commented May 28, 2020 •

edited

Loading

wajda commented Aug 10, 2020 •

edited

Loading

wajda commented Oct 26, 2022 •

edited

Loading

dk1844 commented Nov 11, 2022 •

edited

Loading