Options to skip small files and not recurse on input paths #90

gsteelman · 2014-08-11T21:03:32Z

Added support for a boolean configuration key "skip_indexing_small_files". If this is enabled, files smaller than one block in size will not be indexed. This is useful because indexing files smaller than a block is essentially wasteful. The default is false so the current behavior is preserved.

Added support for a boolean configuration key "recursive_indexing". If this is enabled, paths passed in on the command line will not be recursively searched for files to index. This allows for flexibility on specifying input paths for indexing. The default is true so the current behavior is preserved.

…e for files to index from input paths.

gerashegalov · 2014-08-12T21:45:23Z

src/main/java/com/hadoop/compression/lzo/DistributedLzoIndexer.java

+        LOG.info("Unable to get status of path " + path);
+        return false;
+      }
+      return status.getLen() >= status.getBlockSize() ? true : false;


This is too restrictive. With high compression levels and a large FS block, you might still want to split the FS block to get a spill-less mapper. I would make the threshold configurable.

That's a good point. Does a default value (if user's config has a bunk value like "abc") of block size make sense?

dvryaboy · 2014-08-12T22:07:44Z

This is odd -- I swear we did this years ago. @rangadi do you remember what the deal is? Is this something we put into EB instead of hadoop-lzo?

gsteelman · 2014-08-12T23:47:06Z

It looks like the build is failing when using -P hadoop-old due to:
[ERROR] Failed to execute goal on project hadoop-lzo: Could not resolve dependencies for project com.hadoop.gplcompression:hadoop-lzo:jar:0.4.20-SNAPSHOT: Could not transfer artifact org.apache.hadoop:hadoop-core:jar:1.0.4 from/to central (http://repo.maven.apache.org/maven2): GET request of: org/apache/hadoop/hadoop-core/1.0.4/hadoop-core-1.0.4.jar from central failed: Connection reset -> [Help 1]

gsteelman · 2014-08-14T21:17:30Z

I've added a configuration option for what size of a file should be considered "small." By default it is Long.MIN_VALUE, which should preserve current behavior if it is not specified.

As it stands currently, the user configure lzo_skip_indexing_small_files = true and not configure lzo_small_file_size, which would leave the size as default Long.MIN_VALUE. In this case specifying to skip would not actually skip any files.

I see two possible remedies, any preferences on which one? I am leaning towards option 1.

Ensure that skip and skip size are specified together (not just one or the other)
Change default skip size to something like 1 block size.

gsteelman · 2014-08-29T00:05:47Z

@gerashegalov @sjlee Thoughts?

gsteelman · 2014-08-29T01:23:31Z

@dvryaboy It looks like a previous pull request #82 did something similar, but was also never merged. It's possible the change you're talking about is in elephantbird instead of hadoop-lzo, like you said.

sjlee · 2014-08-29T16:58:51Z

src/main/java/com/hadoop/compression/lzo/DistributedLzoIndexer.java

+  private final String LZO_RECURSIVE_INDEXING_KEY = "lzo_recursive_indexing";
+  private final boolean LZO_SKIP_INDEXING_SMALL_FILES_DEFAULT = false;
+  private final boolean LZO_RECURSIVE_INDEXING_DEFAULT = true;
+  private final long LZO_SMALL_FILE_SIZE_DEFAULT = Long.MIN_VALUE;


33-38: if these are meant as constants (which I think they are), they should be private static final's.

I'm not sure if Long.MIN_VALUE is the best default value here, as it would be -2**63. Note that this is printed in the usage as well. If the goal is to disable the small-file-skipping feature if this configuration is not set, isn't 0 fine as well?

Yes, those are meant to be constants. I missed the static modifier. 0 seems reasonable, I'll work that into the next commit.

…with helper method.

sjlee · 2015-10-26T20:42:11Z

Sorry it took me a super long time to revisit this. I went over the PR, and have some comments (some more major than others). Comments coming...

A high level comment: it would be great if you can add some unit tests that cover this.

sjlee · 2015-10-26T20:44:29Z

src/main/java/com/hadoop/compression/lzo/DistributedLzoIndexer.java


+  private static final String LZO_EXTENSION = new LzopCodec().getDefaultExtension();
+
+  private static final String LZO_SKIP_INDEXING_SMALL_FILES_KEY = "lzo_skip_indexing_small_files";


Nit: I think it's a common practice for hadoop and related code bases to use dots (".") as separators for config keys; e.g. "lzo_skip_indexing_small_files" -> "lzo.skip-indexing-small-files". By the same token, how about "lzo.skip-indexing-small-files.size" instead of "lzo_small_file_size", and "lzo.recursive-indexing.enabled" instead of "lzo_recursive_indexing"?

Another nit: let's pair each key definition and its default.

Another nit: have an empty line between the static members and the instance members.

…o_indexer_skip_small_files

…variable grouping, rename constants for grouping, rename constants values for hadoop naming style.

…Name instead of path.getString for extension filtering. Add comments. Reduce number of Path.getFileSystem and getFileStatus.

…nal access. Refactor configuration/job setup. Add unit tests. Remove unused variable in TestLzoRandData.

…Add javadoc to DistributedLzoIndexer.

sjlee · 2015-12-17T22:28:09Z

src/main/java/com/hadoop/compression/lzo/DistributedLzoIndexer.java

+  public static final long LZO_INDEXING_SMALL_FILE_SIZE_DEFAULT = 0;
+  public static final String LZO_INDEXING_RECURSIVE_KEY = "lzo.indexing.recursive.enabled";
+  public static final boolean LZO_INDEXING_RECURSIVE_DEFAULT = true;
+  private static final String TEMP_FILE_EXTENSION = "/_temporary";


This isn't really an extension but rather a file/directory name. Should it be more like TEMP_FILE_NAME or TEMP_DIRECTORY_NAME (depending on whether it is a file or directory)?

…cking Path names.

sjlee · 2015-12-17T22:42:13Z

src/main/java/com/hadoop/compression/lzo/DistributedLzoIndexer.java

-      return -1;
-    }
+  /**
+   * Determine based on previous configuration of this indexer whether a file


Determine -> Determines

sjlee · 2015-12-17T22:52:51Z

Could you please add unit tests around the recursive behavior? There are quite a few tests around whether the file should be indexed, but I don't see tests for the recursion.

Also, it would be great if you can test this code against real data to see if there is any surprise that isn't caught by the unit tests (and review). Thanks again!

CLAassistant · 2019-07-18T15:08:10Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gsteelman and others added 2 commits August 11, 2014 09:56

Added option to skip indexing small files. Added option to not recurs…

3492976

…e for files to index from input paths.

Removed hard-coded default values in run().

1a12f89

gerashegalov reviewed Aug 12, 2014
View reviewed changes

Added config option for file size to be considered 'small'.

8759f78

sjlee reviewed Aug 29, 2014
View reviewed changes

gsteelman added 2 commits September 1, 2014 15:39

Added static modifiers to constants, replaced unnecessary PathFilter …

37381da

…with helper method.

Removed 'this' modifier from static variables.

a42e041

sjlee reviewed Oct 26, 2015
View reviewed changes

gsteelman added 5 commits December 14, 2015 15:45

Merge remote-tracking branch 'twitter/master' into add_distributed_lz…

e8a9352

…o_indexer_skip_small_files

Remove unnecessary this keyword scoping, fix style for static/member …

6d6dda1

…variable grouping, rename constants for grouping, rename constants values for hadoop naming style.

Make recursive behavior mimic Hadoop recursive behavior. Use path.get…

b0fef1d

…Name instead of path.getString for extension filtering. Add comments. Reduce number of Path.getFileSystem and getFileStatus.

Change many methods and constants to public for testing and for exter…

f76ef07

…nal access. Refactor configuration/job setup. Add unit tests. Remove unused variable in TestLzoRandData.

Comment cleanup in tests, remove useless setUp and tearDown methods. …

00fc2bf

…Add javadoc to DistributedLzoIndexer.

sjlee reviewed Dec 17, 2015
View reviewed changes

Rename TEMP_FILE_EXTENSION to TEMP_FILE_PATH and use toString for che…

c1968fa

…cking Path names.

sjlee reviewed Dec 17, 2015
View reviewed changes

sjlee mentioned this pull request Aug 16, 2016

Embed the native libraries in the hadoop-lzo jar #73

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options to skip small files and not recurse on input paths #90

Options to skip small files and not recurse on input paths #90

gsteelman commented Aug 11, 2014

gerashegalov Aug 12, 2014

gsteelman Aug 12, 2014

dvryaboy commented Aug 12, 2014

gsteelman commented Aug 12, 2014

gsteelman commented Aug 14, 2014

gsteelman commented Aug 29, 2014

gsteelman commented Aug 29, 2014

sjlee Aug 29, 2014

sjlee Aug 29, 2014

gsteelman Sep 1, 2014

sjlee commented Oct 26, 2015

sjlee Oct 26, 2015

sjlee Dec 17, 2015

sjlee Dec 17, 2015

sjlee commented Dec 17, 2015

CLAassistant commented Jul 18, 2019 •

edited

Loading


		private static final String LZO_EXTENSION = new LzopCodec().getDefaultExtension();

		private static final String LZO_SKIP_INDEXING_SMALL_FILES_KEY = "lzo_skip_indexing_small_files";

Options to skip small files and not recurse on input paths #90

Are you sure you want to change the base?

Options to skip small files and not recurse on input paths #90

Conversation

gsteelman commented Aug 11, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dvryaboy commented Aug 12, 2014

gsteelman commented Aug 12, 2014

gsteelman commented Aug 14, 2014

gsteelman commented Aug 29, 2014

gsteelman commented Aug 29, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjlee commented Oct 26, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjlee commented Dec 17, 2015

CLAassistant commented Jul 18, 2019 • edited Loading

CLAassistant commented Jul 18, 2019 •

edited

Loading