-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added chicken param to remove methods, consistent apis #628
Conversation
return True | ||
|
||
if '//' in path: | ||
raise ValueError('Funny path, double slash: %s' % path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you care about this at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the '//' check? Somewhat a kludge to simplify the following code.
Also, since ultimately this path string might be passed to a hadoop fs -rmr process, I would definitely be concerned at the behavior of:
hadoop fs -rmr //
Not sure if the hadoop CLI would try to delete the root directory, or exit 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this code is neccessary but it can stay if you're worried about cases like //
. The code is at least easy to digest here.
You should amend your commits. Other than that this looks good I think. |
False | ||
""" | ||
if not path: | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happend if we don't have this? I mean, can't we assume that the user passes in a string? Can something bad happen if we don't guard against this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method itself doesn't know that.
The original impetus of this whole check is that, is to help prevent developers writing Luigi tasks from shooting themselves in the foot and giving some sysadmin guy a very long night restoring from backups (if they exist!).
I've seen (and also caught myself), that through whatever obscure luigi command line mechanism / programmatic calls, the ultimate path passed in to the remove method can be basically anything -- and when such luigi task is invoked by a cron job or some other automatic process, the anything could be doubly anything. :)
The path could actually be None or a blank string, depending on the calling method and how that method got its parameters (through the luigi.Parameter via some shell command line argument, and whether said parameter had default=None, etc.).
In the case something bad happens... presumedly the caller (remove) method asks if path=None or path='' is dangerous.
If it is not dangerous, the invoked CLI process might look like:
1) hadoop fs -rmr None (if the caller did naive string concatenation)
or
2) hadoop fs -rmr
For #1, best case is we get a file not found error from hadoop cli. Worst case is we deleted someone's file called 'None' :(
For #2, best case is hadoop cli complains that a required command line argument is missing. Worst case is some hadoop distribution (or maybe someone created a hadoop wrapper), assumed something else and inadvertently deletes something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, your worst case for 2) is completly unlikely. You mean default would be to delete /
? And your worst case for 1) is not that bad. Besides. case 1) will not happen. What will happen is that you get an exception thrown from path.strip()
.
I don't like this line as it's logic that gives no value, and I get confused reading it since I expect lines of code to be useful. :)
Responded to the various comments from @Tarrasch , thanks for the feedback! I think this is worth looking into, but maybe add checks for that sort of thing in a separate commit? |
Yea, wait with the |
Code updated per suggestions, hopefully it is more too your liking :) |
The commits are messed up now. The "updates to is_dangerous_rm_path(): " commit is shown twice. Can you properly rebase your commits and force push? :) |
Ugh, you might have to help me on that. How's the code aside from the source control mess?
|
try Then you can also try |
I woke up this morning to find that very few jobs had run overnight because something weird happened with our hadoop cluster that caused all of our jobs to hang permanently. When I checked in the morning, all workers were busy running jobs that had been going for about 10 hours, and thousands of pending jobs that would normally have been run overnight had piled up. After killing and restarting all of my workers, everything started running fine again because whatever weird issue had caused everything to hang was over. In order to automate this fix, I've added a worker_timeout property to tasks that tells the worker how long to attempt to run the task before killing it. This property defaults to 0, meaning do not time tasks out. It can be changed individually for a task by override or globally for all non-overridden tasks via client.cfg. This only applies when running with more than 1 worker process, as we need to run the task in a separate process in order to be able to kill it after the timeout.
as opposed to expecting lists always also cleared up the docs a bit
I've noticed that my visualiser has become unusable recently when it has too many done tasks scheduled. The tab freezes up for a long time while processing all the tasks, and then is very sluggish afterward. When I have this many done tasks, it's too much to really browse through anyway, so this limits the information shown to just the count. This trades off available information with usability of the tool. The maximum number of shown tasks is controlled by the scheduler and can be adjusted in client.cfg. It may be better to make this part of the query so that it could potentially be controlled by the end user. We may also in the future want to add additional server queries to update larger sections for searches, as this removes useful search information.
Add sphinx docs about how to enable the db task history, and what functionality it provides (including screen shots). Adds a bit more context to the existing docs on configuring the scheduler.
Adds support for tasks with non-alphanumeric values (e.g. dates with dashes in them).
I've been seeing half-second get_work runtimes recently on my scheduler. The cause of this has been a large backfill that scheduled a lot of done tasks. With about 100k done tasks and about 350 pending tasks, a lot of time is wasted looping over done tasks just to ignore them. This essentially caches these computations by keeping a dictionary to group tasks by status and updating it whenever task statuses are changed. This has sped my scheduling up by an order of magnitude, now taking about 20ms per get_work. In order to ensure updates happen every time, the task state has to take over any functions that modify status. This is a necessary for remote task storage anyway, so it had to be done eventually. There are no new tests here because there are no functionality changes, only improvements in efficiency and different internal representations.
* Use sphinx-build * User should not read tox.ini in order to understand how to use it, so use html as default output format. * Kept same output dir for html just in case.
$ tox -e docsapi $ git add doc/api
added docstring.
This class adds support for "taskless" parameters
Also made the default Config task include the section in the command line arguments
…e does not exist and cleanup pep8 warnings
This new style makes it possible to override on runtime (using eg. --hdfs-namenode-port) but also use configuration file like previously
Adapted from code submitted to Luigi user list by Felix Gao: * https://groups.google.com/forum/#!topic/luigi-user/BIqAChl8sdw
ignore autopep8 E309 since if class has docstring, it will be separated by blank line. changed sqlalchemy import statements. this is necessary to fix bug in sphinx-doc issue#1530: sphinx-doc/sphinx#1530 fixed luigi.interface.Interface.run sphinx complaints. improved luigi.interface.run docstring. fixed luigi.lock.getpcmd sphinx complaints. fixed luigi.contrib.mysqldb.MySqlTarget docstring complaints and made 1st character of parameter description lowercase. fixed luigi.contrib.redshift.RedshiftManifestTask sphinx complaints. fixed sphinx SEVERE: Title level inconsistent. fixed sphinx "ERROR: Unknown target name" errors. fixed sphinx "WARNING: Explicit markup ends without a blank line" errors. fixed sphinx "WARNING: Explicit markup ends without a blank line" errors. fixed sphinx "Duplicate explicit target name: 'spotify'" error. fixed broken URL to Yarn REST API. fixed URL to oozie. fixed URL for luigi-user Google group. fixed URL to spotify. changed URL to source code to make sphinx linkcheck pass. added new readme.rst to exclude_pattern to fix sphinx complaint that it's not part of a table of content. removed the html_static_path to fix sphinx build issue. sphinx warnings now make the build break. added docs-test tox environment to the travis build. problems with documentation will now make the build break. added more rules to .gitignore based on gitignore.io: Python, vim, OSX, Vagrant, PyCharm. fixed sphinx warning to include luigi.tools. merge tox docsapi with tox docs. tox docsapi environment creates new files that might break the documentation generation so moved the command into the tox docs environment. synchronize README.rst in github and readthedocs. copy of README.rst to doc directory. URLs pointing to github changed to point to local files. badges removed from the doc/README.rst. added comments to build doc commands. whitelist cp and sed commands. README.rst added to sphinx exclude_patterns. this will bypasses the warning sphinx gives due to the fact that README.rst is not part of a table of contents.
fixes 3 landscape errors on luigi.contrib.redshift.
During docs generation we copy README.rst to doc/ and clean it up a bit. Lets delete afterwards.
$ tox -e isort
such as mkdir and remove
Bah, I think I really messed things up on my branch, will try one more time on a new pull. |
I added the chicken parameter to the HdfsClient remove methods.
By default, this is True, which tells the remove method to raise an exception if the requested remove path is dangerous.
Such paths include the root and single level root paths, i.e. '/' or '/etc'.
Hopefully this prevents embarrassing 'hadoop fs -rmr /' attempts. :)
In addition, the HdfsClientCdh3 mkdir and remove method was api incompatible with the base class's methods, I went ahead and fixed those.