Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options to skip small files and not recurse on input paths #90

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

gsteelman
Copy link

Added support for a boolean configuration key "skip_indexing_small_files". If this is enabled, files smaller than one block in size will not be indexed. This is useful because indexing files smaller than a block is essentially wasteful. The default is false so the current behavior is preserved.

Added support for a boolean configuration key "recursive_indexing". If this is enabled, paths passed in on the command line will not be recursively searched for files to index. This allows for flexibility on specifying input paths for indexing. The default is true so the current behavior is preserved.

LOG.info("Unable to get status of path " + path);
return false;
}
return status.getLen() >= status.getBlockSize() ? true : false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too restrictive. With high compression levels and a large FS block, you might still want to split the FS block to get a spill-less mapper. I would make the threshold configurable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Does a default value (if user's config has a bunk value like "abc") of block size make sense?

@dvryaboy
Copy link
Contributor

This is odd -- I swear we did this years ago. @rangadi do you remember what the deal is? Is this something we put into EB instead of hadoop-lzo?

@gsteelman
Copy link
Author

It looks like the build is failing when using -P hadoop-old due to:
[ERROR] Failed to execute goal on project hadoop-lzo: Could not resolve dependencies for project com.hadoop.gplcompression:hadoop-lzo:jar:0.4.20-SNAPSHOT: Could not transfer artifact org.apache.hadoop:hadoop-core:jar:1.0.4 from/to central (http://repo.maven.apache.org/maven2): GET request of: org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.jar from central failed: Connection reset -> [Help 1]

@gsteelman
Copy link
Author

I've added a configuration option for what size of a file should be considered "small." By default it is Long.MIN_VALUE, which should preserve current behavior if it is not specified.

As it stands currently, the user configure lzo_skip_indexing_small_files = true and not configure lzo_small_file_size, which would leave the size as default Long.MIN_VALUE. In this case specifying to skip would not actually skip any files.

I see two possible remedies, any preferences on which one? I am leaning towards option 1.

  1. Ensure that skip and skip size are specified together (not just one or the other)
  2. Change default skip size to something like 1 block size.

@gsteelman
Copy link
Author

@gerashegalov @sjlee Thoughts?

@gsteelman
Copy link
Author

@dvryaboy It looks like a previous pull request #82 did something similar, but was also never merged. It's possible the change you're talking about is in elephantbird instead of hadoop-lzo, like you said.

private final String LZO_RECURSIVE_INDEXING_KEY = "lzo_recursive_indexing";
private final boolean LZO_SKIP_INDEXING_SMALL_FILES_DEFAULT = false;
private final boolean LZO_RECURSIVE_INDEXING_DEFAULT = true;
private final long LZO_SMALL_FILE_SIZE_DEFAULT = Long.MIN_VALUE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

33-38: if these are meant as constants (which I think they are), they should be private static final's.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if Long.MIN_VALUE is the best default value here, as it would be -2**63. Note that this is printed in the usage as well. If the goal is to disable the small-file-skipping feature if this configuration is not set, isn't 0 fine as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are meant to be constants. I missed the static modifier. 0 seems reasonable, I'll work that into the next commit.

@sjlee
Copy link
Collaborator

sjlee commented Oct 26, 2015

Sorry it took me a super long time to revisit this. I went over the PR, and have some comments (some more major than others). Comments coming...

A high level comment: it would be great if you can add some unit tests that cover this.


private static final String LZO_EXTENSION = new LzopCodec().getDefaultExtension();

private static final String LZO_SKIP_INDEXING_SMALL_FILES_KEY = "lzo_skip_indexing_small_files";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it's a common practice for hadoop and related code bases to use dots (".") as separators for config keys; e.g. "lzo_skip_indexing_small_files" -> "lzo.skip-indexing-small-files". By the same token, how about "lzo.skip-indexing-small-files.size" instead of "lzo_small_file_size", and "lzo.recursive-indexing.enabled" instead of "lzo_recursive_indexing"?

Another nit: let's pair each key definition and its default.

Another nit: have an empty line between the static members and the instance members.

…variable grouping, rename constants for grouping, rename constants values for hadoop naming style.
…Name instead of path.getString for extension filtering. Add comments. Reduce number of Path.getFileSystem and getFileStatus.
…nal access. Refactor configuration/job setup. Add unit tests. Remove unused variable in TestLzoRandData.
public static final long LZO_INDEXING_SMALL_FILE_SIZE_DEFAULT = 0;
public static final String LZO_INDEXING_RECURSIVE_KEY = "lzo.indexing.recursive.enabled";
public static final boolean LZO_INDEXING_RECURSIVE_DEFAULT = true;
private static final String TEMP_FILE_EXTENSION = "/_temporary";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really an extension but rather a file/directory name. Should it be more like TEMP_FILE_NAME or TEMP_DIRECTORY_NAME (depending on whether it is a file or directory)?

return -1;
}
/**
* Determine based on previous configuration of this indexer whether a file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Determine -> Determines

@sjlee
Copy link
Collaborator

sjlee commented Dec 17, 2015

Could you please add unit tests around the recursive behavior? There are quite a few tests around whether the file should be indexed, but I don't see tests for the recursion.

Also, it would be great if you can test this code against real data to see if there is any surprise that isn't caught by the unit tests (and review). Thanks again!

@CLAassistant
Copy link

CLAassistant commented Jul 18, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants