Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data retention CLI #684

Closed
DaimonPl opened this issue May 28, 2020 · 34 comments
Closed

Data retention CLI #684

DaimonPl opened this issue May 28, 2020 · 34 comments

Comments

@DaimonPl
Copy link
Contributor

Is there any easy way to configure automatic retention for spline data? For example automatically delete entires older than 3 months ?

@wajda
Copy link
Contributor

wajda commented May 28, 2020

No, not yet. It could be added in the future though, it was slightly mentioned a couple of times in some local discussions.

@DaimonPl
Copy link
Contributor Author

Ok, I think it might be useful in big setups. We have hundreds of jobs running everyday and dozens of jobs running every hour pretty sure we would need to scale arangodb quite often without retention

Any idea if retention can be somehow added in arangodb itself ?

@wajda
Copy link
Contributor

wajda commented May 28, 2020

Well...not in ArangoDB natively (afaik), but you could use your custom script. However this approach is kinda fragile and dirty, and I wouldn't recommend doing it. Especially in a busy environment.
To do it correctly and avoiding breaking the database consistency you need to respect the graph structure and cascade the deletion properly.

On the other hand if you could find a logical time gap in your data pipelines, big enough (e.g. one hour), you can try to cut the DB in the middle or even closer to the end of that time gap. Simply delete all documents with _created < my_timestamp from all collections excepts dataSource, all in one transaction.
This is still a hacking approach but having the idle time frame large enough it practically should not create any problems.
Also keep in mind that the described approach might not work in Spline 0.6+

@DaimonPl
Copy link
Contributor Author

Thx for detailed answer :)

No way to find any time gap on our Hadoop it's processing 24/7/365 something all the time :)

@wajda
Copy link
Contributor

wajda commented May 28, 2020

Awesome. I see you case will be a really great test case for Spline :))

@pratapmmmec
Copy link

Thanks for responding. Do we have any tentative timeline for its release?

@wajda
Copy link
Contributor

wajda commented Jul 3, 2020

We'll try to implement something simple in 0.5.4 which should be out in the course of the month. We are addressing performance bottlenecks in this release, so this feature logically fits here.

@wajda wajda added this to the 0.5.4 milestone Jul 3, 2020
@wajda wajda self-assigned this Aug 10, 2020
@wajda
Copy link
Contributor

wajda commented Aug 10, 2020

Solution:

  1. Add a command to the Admin CLI to remove the lineage data older than given date.
  2. Add an option to the Gateway to do the same thing periodically based on a provided cron expression.

@wajda
Copy link
Contributor

wajda commented Aug 17, 2020

After the second thought I realized that having CLI command for removing old data makes it unnecessary to make it a Spline Server responsibility. System cronjobs can be used instead to simply run java -jar admin-cli.jar db-prune ...

wajda added a commit that referenced this issue Aug 18, 2020
wajda added a commit that referenced this issue Aug 20, 2020
wajda added a commit that referenced this issue Aug 20, 2020
wajda added a commit that referenced this issue Aug 20, 2020
@wajda wajda modified the milestones: 0.5.4, 0.5.5 Aug 20, 2020
wajda added a commit that referenced this issue Sep 3, 2020
wajda added a commit that referenced this issue Sep 3, 2020
wajda added a commit that referenced this issue Sep 8, 2020
…rvices (#763)

* spline #761 ArangoDB: WITH keyword is required in a cluster environment

* spline #684 Add Foxx service

* spline #684 Call Foxx service

* spline #761 ArangoDB: Add missing WITH keywords in a Foxx service

* spline #761 ArangoDB: Remove Spline AQL UDFs

* spline #761 Upgrade ArangoDB driver + method to register foxx service programmatically

* spline #761 Register Foxx services / migration

* spline #761 Undo changes to 0.5.0-0.5.4 migration script as we're in the version 0.5.5 now, so cannot change an older migration

* spline #761 Remove redundant extension class
wajda added a commit that referenced this issue Sep 14, 2020
wajda added a commit that referenced this issue Sep 14, 2020
wajda added a commit that referenced this issue Sep 16, 2020
@wajda
Copy link
Contributor

wajda commented Aug 14, 2022

Just out of my curiosity, what's the size of your collections?

@hugeshi
Copy link

hugeshi commented Aug 22, 2022

Hi @wajda ,
Thanks for your contribution to this issue, I have kept an eye on this feature for a long time. It's quite tough to clean up data on a big dataset. I think it's the main reason why several people request the data retention feature.
what's the rough ETA of this feature?

@wajda
Copy link
Contributor

wajda commented Aug 22, 2022

Unfortunately I cannot give any ETA. On one side, it's not a priority feature for our company at the moment. But, I can say that a lot of things are changing right now, and there is a chance for the project to receive some boost very soon. On the other side, as I tried to explain above this feature depends on another feature that introduces a set of fundamental changed to the way how transactions are managed, and that one has to be finished first. I hope to give some update on plans say next month.

@Aditya-Sood
Copy link

hi @wajda
Apologies for the delay in response, we were caught up in some other tasks

Our total spline database size was around ~100GB on the system that we were trying to cleanup

We dealt with the dataSource objects deletion by taking two steps:

  1. Trim down the affects and depends collections down to the target duration that we finally want, before initiating search for stale data sources (this reduced the overall dataset size that had to be traversed to identify the stale sources)
  2. Rewrite the query being used to identify stale dataSource objects - there was an optimisation opportunity since using an equality operation on the non-indexed _id attribute was performing better than a string match on the indexed _key attribute

Combining this with the previous script changes have allowed us to trim a day's worth of stale objects in 1-1.5 hrs

@Aditya-Sood
Copy link

Following is the final cleanup script that we are using: https://gist.github.com/Aditya-Sood/ecc07c9f296dbdf03d4946c5d1b4efce

Could you please review the script and let us know if there are any changes that you would like?

Post that we would like to discuss integrating the cleanup functionality with the project, possibly as a configurable service added to the Admin utility

@wajda
Copy link
Contributor

wajda commented Sep 26, 2022

Thank you @Aditya-Sood. I will review it this week and will get back to you.

@dk1844 dk1844 self-assigned this Oct 5, 2022
@dk1844 dk1844 modified the milestones: 1.1.0, 1.0.0 Oct 5, 2022
@wajda
Copy link
Contributor

wajda commented Oct 13, 2022

@Aditya-Sood, we've reviewed your script and found it generally correct. But I've got a few minor questions/suggestions about it:

  1. Stage 1:
    • FILTER progEvent.timestamp vs. FILTER progEvent._created in the FOR progEvent IN progress. The difference is that timestamp denotes the event time, while _created denotes the processing time. Although having a large enough time window there is little to no practical difference, but I guess it logically it would be more correct to use the former.
    • Using _created on progressOf. Again, while it might be highly unlikely but still there is a risk of grabbing an edge that belongs to a living node just because its processing time happened to be 1ms earlier than the adjacent progress node. Which could happen as there is no predetermined order of both entities creation and the _created time might not be exactly the same for all nodes inserted by a single logical transaction. This is the same reason why we can't just wipe all the docs from all collections in one go by the _created_ predicate alone. I would suggest remove progressOf by looking at its relation to the progress being removed. Use either traversal 1 OUTBOUND prog progressOf or a simple document query FOR progOf IN progressOf FILTER progOf._from == prog._id that could potentially be faster for the case.
  2. At the stage 2 where you collect orphan execution plan IDs:
    • Why do you need a second loop. Wouldn't it better to collect them at the stage 1 when removing progress and progressOf nodes?
    • When checking for node orphanage, why do you use NOT IN and require two-step process? Can't you simply count the number of adjacent edges and compare it with 0?
  3. What do you mean by "using an equality operation on the non-indexed _id attribute was performing better than a string match on the indexed _key attribute"? In ArangoDB attributes _id, _key, _from and _to are always indexed.

dk1844 added a commit that referenced this issue Oct 18, 2022
dk1844 added a commit that referenced this issue Oct 18, 2022
…b.com/Aditya-Sood/ecc07c9f296dbdf03d4946c5d1b4efce - naively tested with test data (multiple lineages at different times - purge with time between - correct outcome - older purged, newer kept)
@wajda wajda removed their assignment Oct 19, 2022
dk1844 added a commit that referenced this issue Oct 21, 2022
dk1844 added a commit that referenced this issue Oct 21, 2022
@Aditya-Sood
Copy link

hi @wajda
thanks for the review! please excuse the delay in response, I was busy with work and then Diwali over the past few days

I've understood the changes required for the first 3 subpoints
Regarding sub-point 4 (When checking for node orphanage,...), if I'm unable to use your suggestion from sub-point 3 (i.e. collect orphan exec plans in Stage 1 itself), then I'll add this as well

regarding point (3), the dataSource does not have an index on the _id attribute in our DB:
Screenshot 2022-10-26 at 9 28 10 AM
also I think the major difference came in because a sub-string search on an indexed attribute will not benefit from indexing, since all values will have to be scanned by such an operation

lastly, shall I continue working on this feature or has it been re-assigned?
I was under the impression this requirement was on the back-burner and not urgent, but if that has changed and you would prefer having @dk1844 continue do let us know

thanks

@wajda
Copy link
Contributor

wajda commented Oct 26, 2022

the dataSource does not have an index on the _id attribute in our DB:

There is no need for a separate index for _id. In ArangoDB, _id (also called a document handle) is just a combination of a collection name and a primary key (_key). If you take a look at the explain output, you'll see that the same primary index is used for a selection by the _id or _key attribute.

@wajda
Copy link
Contributor

wajda commented Oct 26, 2022

lastly, shall I continue working on this feature or has it been re-assigned?
I was under the impression this requirement was on the back-burner and not urgent, but if that has changed and you would prefer having @dk1844 continue do let us know

As we didn't hear from you for some time we decided to speed things up a bit. We took your gist and (with some minor changes) basically incorporated it into our older PR that has been started and abandoned some long time ago. @Aditya-Sood it would be awesome if you help us testing and optimize @dk1844's PR trying it on your large database.

@wajda wajda changed the title Spline - automatic data retention? Spline - automatic data retention Oct 27, 2022
@wajda wajda changed the title Spline - automatic data retention Spline - data retention Oct 27, 2022
@wajda wajda changed the title Spline - data retention Data retention CLI Oct 27, 2022
@dk1844
Copy link
Contributor

dk1844 commented Nov 11, 2022

In reference to #684 (comment), I did some measurement runs for stage 1 and stage 2 with different AQL to get a sense how they perform.

Common info

All measuring was done on the same 12200 lineage records generated by our data-gen using
bash for ($i = 0; $i -lt 1220; $i++){ echo "loop $i" ; docker run -e SPLINE_URL=http://host.docker.internal:8080/spline -e OPERATIONS=1-10@1 spline_datagen}

Arango profiling was always used, done on "my machine", so the results should only be interpretted as relative to each other.
Also, timestamp filtering was commented out each time.

Stage 1

Option A

This option come from the original PR#762 on this very topic.

FOR p IN progress
    //FILTER p.timestamp < ${timestamp}
    REMOVE p IN progress
    FOR po IN progressOf
        FILTER po._from == p._id
        REMOVE po IN progressOf
        RETURN DISTINCT po._to

Option B

This option comes from the @Aditya-Sood's script. It has the theoretical downside of possibly creating orphanated progress and progressOf entries (that would get cleaned up on the next purge run)

FOR progEvent IN progress
    // FILTER progEvent._created < ${purgeLimitTimestamp}
    REMOVE progEvent IN progress

FOR progOf IN progressOf
    // FILTER progOf._created < ${purgeLimitTimestamp}
    REMOVE progOf IN progressOf

Option C

This options comes from the reference suggestion message.

FOR p IN progress
    //FILTER p.timestamp < ${timestamp}
    FOR prog, progOf IN 1 OUTBOUND p progressOf
        REMOVE p IN progress
        REMOVE progOf IN progressOf

Results of profiling

All results have max index selectivity (100%) on one or multiple indices.

Option A - Peak Mem 1376256B; median time 2.20633s (2.17843s, 2.20633s, 2.42307s)
Option B - Peak Mem 98304B; median time 1.51415s (1.53055s, 1.51415s, 1.53688s)
Option C - Peak Mem 1540096B - median time 2.48856s (1.82244s, 2.26623s, 2.47635s, 2.48856s, 2.60309s, 2.29054s, 3.11524s)

Possible conclusion

@Aditya-Sood's approach is the fastest and most memory conservative (Option B). If we don't want to use it because of possible inconsistency issue, other two options are relatively similar with Option A seeming a bit better,

Stage 2

Stage 2 has been attempted on the same data after one of the Stage 1 options has been run (regardless of which).

Option A

This is @Aditya-Sood's approach with the two commands passed together (possibly faster).

    LET execPlanIDsWithProgressEventsArray = (FOR execPlan IN executionPlan
    FOR progOf IN progressOf
        FILTER progOf._to == execPlan._id // && execPlan._created < ${timestamp}
        COLLECT ids = execPlan._id
        RETURN ids)

    FOR execPlan IN executionPlan
        FILTER execPlan._id NOT IN execPlanIDsWithProgressEventsArray //&& execPlan._created < ${timestamp}
        RETURN execPlan._id

Option B

This options comes from the reference suggestion message.

    FOR execPlan IN executionPlan
        LET connectionCount = LENGTH(FOR progOf IN progressOf
            FILTER progOf._to == execPlan._id // && execPlan._created < ${timestamp}
            RETURN 1
        )
        FILTER connectionCount == 0
        RETURN execPlan._id

Results of profiling

All results have max index selectivity (100%) on one or multiple indices.

Option A - Peak Mem 327680B; median first-time run 0.70598s, median of subsequent runs (cached?): 0.03214s
Option B - Peak Mem 950272B; median first-time run 0.75554s, median of subsequent runs (cached?): 0.04231s

Possible conclusion

Option A seems a bit better (less than half of memory needed, run time seems slightly better)

dk1844 added a commit that referenced this issue Nov 16, 2022
dk1844 added a commit that referenced this issue Nov 21, 2022
dk1844 added a commit that referenced this issue Nov 23, 2022
* #684 data retention 
-  original code from PR#762 - brought up to date with current develop
- updated to reflect logic from https://gist.github.com/Aditya-Sood/ecc07c9f296dbdf03d4946c5d1b4efce - 
 - Sonar/Codacy code quality updates
- db prune - stage1 option A, stage 2 option A applied
@wajda wajda closed this as completed Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

6 participants