Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delete files in advance #29

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kekekenta
Copy link

Delete path prefix matched files in GCS bucket first if needed.

@sakama
Copy link
Contributor

sakama commented Mar 3, 2020

Why do you want to delete files in advance? Could you provide the background?
In my understanding, the file will be overridden if the same file path already exists there.
Doesn't that work?

@kekekenta
Copy link
Author

kekekenta commented Mar 3, 2020

For example, when embulk task size is changed, the number of files will be changed, so unnecessary files will be remained.

$ embulk run -X min_output_tasks=4 -I lib/ example.yml
# this creates 4 files like below in my enviroment.
# 
# test.000.01.csv, test.001.01.csv, test.002.01.csv, test.003.01.csv

$ embulk run -X min_output_tasks=2 -I lib/ example.yml
# this creates 2 files like below in my environment and these files are overwritten, but 002.01.csv, 003.01.csv are remained.
# 
# test.000.01.csv, test.001.01.csv

I would like to assure only transferred files are present.

@kekekenta
Copy link
Author

In my understanding, the number of output files is varied by input record size, and so on.

Copy link
Contributor

@sakama sakama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I understand your intention and sounds nice.
I added several comments.
Could you also add unit tests if possible?

You don't have to mind CI failure. It is caused by encryption key settings on Travis CI.

logger.info("delete files");
Storage client = createClient(task);
try {
Objects listResult = client.objects().list(task.getBucket()).setDelimiter("/").setPrefix(task.getPathPrefix()).execute();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks.

try {
Objects listResult = client.objects().list(task.getBucket()).setDelimiter("/").setPrefix(task.getPathPrefix()).execute();
if (listResult.getItems() == null) {
logger.info("no files found");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: "no files were found" or "no file was found"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

}
for (StorageObject item : listResult.getItems()) {
logger.info("delete file: {}/{}", item.getBucket(), item.getName());
client.objects().delete(item.getBucket(), item.getName()).execute();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to move logger.info("delete file: {}/{}", item.getBucket(), item.getName()); after sending requests.
Only logging message will be shown in case delete operation fails,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also use retry with exponential backoff?

}
}
catch (ConfigException | IOException ex) {
throw Throwables.propagate(ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ConfigException is thrown from any part of this method.
But it would be nice to throw ConfigException if IOException happens.


@Config("delete_in_advance")
@ConfigDefault("false")
Boolean getDeleteInAdvance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boolean

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry..

@@ -78,6 +88,26 @@ public TransactionalFileOutput open(TaskSource taskSource, final int taskIndex)
return new GcsTransactionalFileOutput(task, client, taskIndex);
}

private void deleteFiles(PluginTask task)
{
logger.info("delete files");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message isn't clear to me what this plugin is doing.
"Start delete files operation" or something might be better?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I changed this to your suggestions.

Storage client = createClient(task);
try {
Objects listResult = client.objects().list(task.getBucket()).setDelimiter("/").setPrefix(task.getPathPrefix()).execute();
if (listResult.getItems() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (listResult.getItems().size() == 0)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listResult.getItems() returns null when no files are found.

@kekekenta
Copy link
Author

@sakama Thanks a lot. I will check your review comments.

@kekekenta kekekenta force-pushed the delete-in-advance branch 4 times, most recently from 71ed3ba to 1b7e0a9 Compare March 6, 2020 06:53
@kekekenta kekekenta requested a review from sakama March 6, 2020 06:53
@kekekenta
Copy link
Author

@sakama I added some test codes and fixed as you pointed out.
Check my code again, please.

@kekekenta
Copy link
Author

@sakama Sorry for bother you, would you review my code again in your free time!

@sakama
Copy link
Contributor

sakama commented Mar 16, 2020

Sorry to have kept you waiting. I'll review in few days.

@dmikurube
Copy link
Member

dmikurube commented Feb 19, 2022

@kekekenta Sorry for leaving this pull-request for a long time.

The CI settings of this plugin was not working well (especially with Travis-CI) for these months. We've gotten the CI back with GitHub Actions and a new GCS bucket configuration. It should be working now.

If you still have an intention to work on it to merge, could you rebase it to the latest master, and then (force-)push this, or push as another pull-request? We think about releasing it as v0.6.1 after we release v0.6.0 with the current master.

We had a lot of changes in v0.5.0 from v0.4.4, and going to have more echanges in v0.6.0 soon. For example, catching up with the Embulk v0.10, relocating Java packages, changing its license to the Apache License 2.0, and else. Rebasing wouldn't be very trivial.

@dmikurube dmikurube added this to the v0.6.1 milestone Feb 19, 2022
@dmikurube dmikurube self-requested a review February 19, 2022 13:31
@dmikurube
Copy link
Member

@chikamura
Copy link

chikamura commented Oct 29, 2024

@dmikurube @sakama

We apologise for the very late response. 🙇
I have modified it to match the current master branch.
Could you please review this PR when you are available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants