-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data retention CLI #684
Comments
No, not yet. It could be added in the future though, it was slightly mentioned a couple of times in some local discussions. |
Ok, I think it might be useful in big setups. We have hundreds of jobs running everyday and dozens of jobs running every hour pretty sure we would need to scale arangodb quite often without retention Any idea if retention can be somehow added in arangodb itself ? |
Well...not in ArangoDB natively (afaik), but you could use your custom script. However this approach is kinda fragile and dirty, and I wouldn't recommend doing it. Especially in a busy environment. On the other hand if you could find a logical time gap in your data pipelines, big enough (e.g. one hour), you can try to cut the DB in the middle or even closer to the end of that time gap. Simply delete all documents with |
Thx for detailed answer :) No way to find any time gap on our Hadoop it's processing 24/7/365 something all the time :) |
Awesome. I see you case will be a really great test case for Spline :)) |
Thanks for responding. Do we have any tentative timeline for its release? |
We'll try to implement something simple in 0.5.4 which should be out in the course of the month. We are addressing performance bottlenecks in this release, so this feature logically fits here. |
Solution:
|
After the second thought I realized that having CLI command for removing old data makes it unnecessary to make it a Spline Server responsibility. System cronjobs can be used instead to simply run |
…rvices (#763) * spline #761 ArangoDB: WITH keyword is required in a cluster environment * spline #684 Add Foxx service * spline #684 Call Foxx service * spline #761 ArangoDB: Add missing WITH keywords in a Foxx service * spline #761 ArangoDB: Remove Spline AQL UDFs * spline #761 Upgrade ArangoDB driver + method to register foxx service programmatically * spline #761 Register Foxx services / migration * spline #761 Undo changes to 0.5.0-0.5.4 migration script as we're in the version 0.5.5 now, so cannot change an older migration * spline #761 Remove redundant extension class
Just out of my curiosity, what's the size of your collections? |
Hi @wajda , |
Unfortunately I cannot give any ETA. On one side, it's not a priority feature for our company at the moment. But, I can say that a lot of things are changing right now, and there is a chance for the project to receive some boost very soon. On the other side, as I tried to explain above this feature depends on another feature that introduces a set of fundamental changed to the way how transactions are managed, and that one has to be finished first. I hope to give some update on plans say next month. |
hi @wajda Our total spline database size was around ~100GB on the system that we were trying to cleanup We dealt with the
Combining this with the previous script changes have allowed us to trim a day's worth of stale objects in 1-1.5 hrs |
Following is the final cleanup script that we are using: https://gist.github.com/Aditya-Sood/ecc07c9f296dbdf03d4946c5d1b4efce Could you please review the script and let us know if there are any changes that you would like? Post that we would like to discuss integrating the cleanup functionality with the project, possibly as a configurable service added to the Admin utility |
Thank you @Aditya-Sood. I will review it this week and will get back to you. |
@Aditya-Sood, we've reviewed your script and found it generally correct. But I've got a few minor questions/suggestions about it:
|
…b.com/Aditya-Sood/ecc07c9f296dbdf03d4946c5d1b4efce - naively tested with test data (multiple lineages at different times - purge with time between - correct outcome - older purged, newer kept)
hi @wajda I've understood the changes required for the first 3 subpoints regarding point (3), the lastly, shall I continue working on this feature or has it been re-assigned? thanks |
There is no need for a separate index for |
As we didn't hear from you for some time we decided to speed things up a bit. We took your gist and (with some minor changes) basically incorporated it into our older PR that has been started and abandoned some long time ago. @Aditya-Sood it would be awesome if you help us testing and optimize @dk1844's PR trying it on your large database. |
In reference to #684 (comment), I did some measurement runs for stage 1 and stage 2 with different AQL to get a sense how they perform. Common infoAll measuring was done on the same 12200 lineage records generated by our data-gen using Arango profiling was always used, done on "my machine", so the results should only be interpretted as relative to each other. Stage 1Option AThis option come from the original PR#762 on this very topic.
Option BThis option comes from the @Aditya-Sood's script. It has the theoretical downside of possibly creating orphanated progress and progressOf entries (that would get cleaned up on the next purge run)
Option CThis options comes from the reference suggestion message.
Results of profilingAll results have max index selectivity (100%) on one or multiple indices. Option A - Peak Mem 1376256B; median time 2.20633s (2.17843s, 2.20633s, 2.42307s) Possible conclusion@Aditya-Sood's approach is the fastest and most memory conservative (Option B). If we don't want to use it because of possible inconsistency issue, other two options are relatively similar with Option A seeming a bit better, Stage 2Stage 2 has been attempted on the same data after one of the Stage 1 options has been run (regardless of which). Option AThis is @Aditya-Sood's approach with the two commands passed together (possibly faster).
Option BThis options comes from the reference suggestion message.
Results of profilingAll results have max index selectivity (100%) on one or multiple indices. Option A - Peak Mem 327680B; median first-time run 0.70598s, median of subsequent runs (cached?): 0.03214s Possible conclusionOption A seems a bit better (less than half of memory needed, run time seems slightly better) |
* #684 data retention - original code from PR#762 - brought up to date with current develop - updated to reflect logic from https://gist.github.com/Aditya-Sood/ecc07c9f296dbdf03d4946c5d1b4efce - - Sonar/Codacy code quality updates - db prune - stage1 option A, stage 2 option A applied
Is there any easy way to configure automatic retention for spline data? For example automatically delete entires older than 3 months ?
The text was updated successfully, but these errors were encountered: