Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Mar 30, 2016

What changes were proposed in this pull request?

This pr is to add a config to control the maximum number of files as even small files have a non-trivial fixed cost. The current packing can put a lot of small files together which cases straggler tasks.

How was this patch tested?

I added tests to check if many files get split into partitions in FileSourceStrategySuite.

@SparkQA
Copy link

SparkQA commented Mar 30, 2016

Test build #54536 has finished for PR 12068 at commit 67cd08f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Mar 30, 2016

LGTM. I think there's more we can do here to bin pack a bit better (i.e. checking if small files can fit in existing partitions) but it would be good to get this in and have more experience with how to configure this.

@asfgit asfgit closed this in dadf013 Mar 30, 2016
isPublic = true)

val FILES_MAX_NUM_IN_PARTITION = longConf("spark.sql.files.maxNumInPartition",
defaultValue = Some(32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the default determined ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I have no reason to set this default value.

asfgit pushed a commit that referenced this pull request Apr 4, 2016
… opening

## What changes were proposed in this pull request?

This PR basically re-do the things in #12068 but with a different model, which should work better in case of small files with different sizes.

## How was this patch tested?

Updated existing tests.

Ran a query on thousands of partitioned small files locally, with all default settings (the cost to open a file should be over estimated), the durations of tasks become smaller and smaller, which is good (the last few tasks will be shortest).

Author: Davies Liu <davies@databricks.com>

Closes #12095 from davies/file_cost.
@maropu maropu deleted the SPARK-14259 branch July 5, 2017 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants