-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17159] [streaming]: optimise check for new files in FileInputDStream #14731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17159] [streaming]: optimise check for new files in FileInputDStream #14731
Conversation
|
Test build #64140 has finished for PR 14731 at commit
|
|
LGTM. Does this sort of change make sense elsewhere where |
|
I'm going to scan through and tune them elsewhere; really I'm going by uses of the listFiles calls There's actually no significant use elsewhere that I can see; just a couple of uses which filter on filename —so there is no cost penalty.
Returning to this patch, should I cut out the caching? I think it is superfluous. // Read-through cache of file mod times, used to speed up mod time lookups
@transient private var fileToModTime = new TimeStampedHashMap[String, Long](true) |
|
Why is the caching superfluous -- because no file is evaluated more than once here? |
|
to be precise: the caching of file modification times is superfluous. It's there to avoid the cost of executing |
|
Ah right, you already have the modification time for free. Sounds good, remove the caching. |
|
Test build #64142 has finished for PR 14731 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent is wrong here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also fs is pretty confusing, because in this context it is often used to refer to as FileSystem. We should pick a different word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this
|
Test build #64156 has finished for PR 14731 at commit
|
|
This is ready to go right @steveloughran ? LGTM |
|
LGTM. I was trying to see if there was a way to create a good test here by triggering the takes-too-long codepath and having a counter, but there's no obvious way to do that deterministically. I am doing a test for this against s3 in the spark-cloud module I'm writing; I can look at the printed counts of getFileStatus before/after the patch to see the difference, but the actual (testable) metrics are only accessible with forthcoming Hadoop 2.8 release. TL;DR: no easy test, so there's nothing left to do |
|
Actually, I've just noticed that DStream behaviour isn't in sync with the streaming programming guide, which says "files written in nested directories not supported)". That is: SPARK-14796 didn't patch the docs. it may as well be fixed in this patch. How about, in the bullet points underneath
Special points for object stores
There's another optimisation; use the Finally, any exception in the scan is caught and triggers a log @ warning and reset... It looks to me that this would include the FNFE raised by directory not existing. I think a better message can be displayed there and the reset() operation skipped...that's not going going to solve the problem in the filesystem |
|
I've now done the s3a streaming test/example this uses a pattern of s3a/path/sub* as the directory path; then creates a file in a directory and renames the dir to match the path; verifies that the file was found in the time period allocated https://gist.github.com/steveloughran/c8b39a7b87a9bd63d7a383bda8687e7e Notable that the scan of the empty dir took 150ms; once there's data in the tree the time jumps up to 500ms once there are two entries under the tree, one dir and one file summary stats show 72 getFileStatus calls at the FS API, mapping to 140 HEAD calls and 88 LIST operations, on Hadoop branch-2 I'm going to do a test run with the modification here and see what it does to listing and status |
b08e3c9 to
79b57a2
Compare
It's a bit hard to get some of that phrasing of what the wildcard does right; needs careful review. Tested using my s3 streaming test, which did use a * in the wildcard. All works, but no improvements in speed on what is a fairly unrealistic structure. The time to recursively list object stores remotely is tangibly slow. Maybe that should go in the text too: "it can be take seconds to scan object stores for new data, with the time being proportional to directory depth and the number of files in a directory. Shallow and wide directory trees are faster" |
|
Test build #64296 has finished for PR 14731 at commit
|
docs/streaming-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a regular expression though, what was the intent here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wasn't it? My mistake. I wanted to show something fairly complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(2015|2016) would be what means "the string '2015' or '2016'", and .. would mean "any two characters"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going with s3a://bucket/logs/(2015,2016)-*-friday
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need (2015|2016) rather than (2015,2016). Also this is going to match zero or more hyphens followed by "-friday". I think you mean ".." or ".{2}" or at least ".+" instead of "*" if this is a regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Round () or {} brackets? Because the wildcard pattern used in isGlobPath() is "{}[]*?\\". Curly only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, is this not a regular expression? I'd change the doc then to not describe it as one. It sounds like it's following some other kind of glob syntax.
|
The logic has got complex enough it merits unit tests. Pulling into SparkHadoopUtils itself and writing some for the possible: simple, glob matches one , glob matches 1+, glob doesn't match, file not found |
|
Having looked at the source code, For the docs, I'll just use a wildcard * in the example, rather than try anything more sophisticated. |
|
Test build #64368 has finished for PR 14731 at commit
|
b63abfe to
9bc0ea9
Compare
|
Test build #64486 has finished for PR 14731 at commit
|
|
Test build #64488 has finished for PR 14731 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that this method is necessarily ambiguous, because you can't actually distinguish globs from other paths. Is this really needed? that's why the FS API exposes two methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It goes with the globPathIfNecessary call —you want to rename it to be consistent.
Regarding the FS APIs, There's way to many list operations in the FS APIs, each with different flaws.
- The simple
list(path, filter): Array[FS]operations don't scale to a directory with hundreds of thousands of files, hence the remote iterator versions - None of them provide any consistency guarantees. Worth knowing. This is more common in remote iterators as the iteration window is bigger, but even in those that return arrays, in a large enough directory things may change during the enum
- Anything that treewaks is very suboptimal on blobstores, somewhat inefficient for deep trees.
listFiles(path, recursive=true)is the sole one which object stores can currently optimise by avoiding the treewalk and just doing a bulk list. HADOOP-13208 has added that for S3A.- ..but that method filters out all directories, which means that apps which do want directories too are out of luck.
- globStatus() is even less efficient than the others ... have a look at the source to see why.
- In HADOOP-13371 I'm exploring an optimised globber, but I don't want to write one which collapses at scale (i.e in production).
I've added some comments in HADOOP-13371 about what to do there, I will probably do that "no regexp -> simple return" strategy implemented in this patch. But it will only benefit s3a in Hadoop 2.8+; patching spark benefits everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'm narrowly concerned with the ambiguity of the behavior of a single method, because you can't distinguish between a path with a "?" in it and a glob wildcard for example. The rest seems orthogonal?
The change as it stood to resolve the issue in the OP seemed OK. This is bigger now and I'm not as sure about the rest of the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
essentially if anything which might be a wildcard is hit, it gets handed off to the globber for the full interpretation. Same for ^ and ], which are only part of a pattern within the context of an opening [
Its only those strings which can be verified to be regexp free in a simple context-free string scan that say "absolutely no patterns here"
regarding the bigger change: most of it is isolation of the sensitive code and the tests to verify behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but then that's wrong if for example my path actually has a ? or ^ or ] in it. It doesn't seem essential and seems even problematic to add this behavior change to an otherwise clear fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognise your concerns.
- the actual check is the one used already in
globPathIfNecessary; that's used inDataSourceandFileStreamSourceand is pretty much the main way wildcards are used to locate files. If it was broken, things would have surfaced. - it is essentially an optimisation to bypass a bit of code which, if you looked at its internals, isn't very efficient. The check for expansion characters is pessimistic, to avoid picking up on wildcard patterns.
- ...but, there are no standalone tests for
globPathIfNecessary—so it may be that there are some failure modes that haven't yet surfaced; regressions waiting to happen.
Given it's a less significant optimisation than dodging all the getFileStatus calls, how about I split that into its own patch, something which should also add those checks for globPathIfNecessary? I'll trim this patch down to the changes in the streaming package —production code and associated tests
|
Test build #64534 has finished for PR 14731 at commit
|
4134620 to
b60f175
Compare
|
Test build #64662 has finished for PR 14731 at commit
|
|
The Hadoop FS Spec has now been updated to declare exactly what HDFS does w.r.t timestamps, and warn that what other filesystems and object stores do are implementation and installation specific features: filesystem.md That is the associated documentation update with this one; some of the content there was originally here, but moved over to the hadoop docs for the HDFS team to take the blame for when it changes. |
|
Any more comments? |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to start from scratch every time I review this ... so this looks like it does more than just optimize a check for new files. It adds docs too. I don't know if the examples are essential. The extra info about how streaming works could be useful but isn't that separate? It's easier to get in small directed changes. This one has been going on for months and I know that's not in anyone's interest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't already have this defined and available elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but in a module that isn't the one where these tests were, so it'd need more dependency logic or pulling it up into a common module, which, if done properly, makes for a big diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it's only otherwise defined in the SQL test utils class. Well something we could unify one day, maybe not such a big deal now here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use Files.write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look @ that; I think I went with IOU as it was on the CP and I'd never had bad experiences of it. Guava, well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there's a straightforward reason: the test is using the hadoop FS APIs, opening an input stream from a Path and writing to it; Files.write is working with a local file. It doesn't work with hadoop FileSystem and Path classes, so could only be used by abusing knowledge of path URLs. Going through FileSystem/Path uses the same API as you'd use in production, so is the more rigorous test.
… filters and into filtering of the FileStatus instances returned in the results, so avoiding the need to create FileStatus intances for -This doesn't add overhead to the filtering process; that's done as post-processing in FileSystem anyway. At worst it may result in larger lists being built up and returned. -For every glob match, the code saves 2 RPC calls to the HDFS NN -The code saves 1-3 HTTP calls to S3 for the directory check (including a slow List call whenever the directory has children as it doesn't exist as a blob any more) -for the modtime check of every file, it saves an HTTP GET The whole modtime cache can be eliminated; it's a performance optimisation to avoid the overhead of the file checks, one that is no longer needed.
… time costs 0 to evaluate, caching it actually consumes memory and the time for a lookup.
…sues. Also note that 1s granularity is the resolution from HDFS; other filesystems may have a different resolution. The only one I know that is worse is FAT16/FAT32, which is accurate to 2s, but nobody should be using that except on SSD cards and USB sticks
…carded listing; handle FNFE specially, add the docs
… existing/similar method. Add tests for the behaviour. Update docs with suggested fixes, and review/edit.
…es and made more robust)
…taken for a public method and so needing to declare a return type.
…which doesn't shortcut on a non-wildcard operation
…s of how HDFS doesn't update file length or modtime until close or a block boundary is reached.
…ate the mtime field; this is covered in the streaming section to emphasise why write + rename is the strategy for streaming in files in HDFS. That strategy does also work in object stores, though the rename operation is O(data)
… tighten documentation
…a temp dur. Docs now refer reader to the Hadoop FS spec for any details about what object stores do
…and what is just punctuation
…store text and update slightly to make things a bit clearer. The more I learn about object stores, the less they resemble file systems.
724495b to
a3aaf26
Compare
|
Test build #74990 has finished for PR 14731 at commit
|
|
Is there anything else I need to do here? |
|
@srowen anything else I need to do here? |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've timed out on this change. It changes every time and still doesn't match the title. I don't think this is a great way to pursue changes like this.
|
Ok, I shall start again with a whole new PR of the current state |
|
Steve I think the main point is you should also respect the time of reviewers. The way most of your pull requests manifest have been suboptimal: they often start with a very early WIP (which is not necessarily a problem), and once in a while (e.g. a month or two) you update it to almost completely change it. The time itself is a problem. It requires a lot of context switching to review your pull requests. In addition, every time you update it it looks like a complete new giant pull request. |
|
Reynold, I know very much about the time of reviewers, I put 1+h a day on the hadoop codebase reviewing stuff, generally trying to review the work of non-colleagues, so as to pull in the broad set of contributions which are needed.. I have been trying to get some object store related patches into spark alongside the foundational work in fundamentally transforming how we work with object storage, especially S3, in Hadoop. Without the spark side changes, a lot gets lost: here the performance is approx 100-300mS/file when scanning an object store. here I've split things in two, docs and diff. Both are independent, both are reasonably tractable. If they can be reviewed fast and added, there's no problems of patches ageing, everyone having to resync. We can get this out the way, and you've have fewer reasons to be unhappy with me. |
What changes were proposed in this pull request?
This PR optimises the filesystem metadata reads in
FileInputDStream, by moving the filters used inFileSystem.globStatusandFileSystem.listStatusinto filtering of theFileStatusinstances returned in the results, so avoiding the need to createFileStatusinstances within theFileSystemoperation.FileSystemglob/list operations anyway.getFileStatus()calls in the listed files, it should reduce the risk of AWS S3 throttling the HTTP request, as it does when too many requests are made to parts of a single S3 bucket.How was this patch tested?
Running the spark streaming tests as a regression suite. In the SPARK-7481 cloud code, I could add a test against S3 which prints to stdout the exact number of HTTP requests made to S3 before and after the patch, so as to validate speedup. (the S3A metrics in Hadoop 2.8+ are accessible at the API level, but as they are only accessible in a new API added in 2.8; it'd stop that proposed module building against Hadoop 2.7. Logging and manual assessment is the only cross-version strategy.