Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added chicken param to remove methods, consistent apis #628

Closed
wants to merge 169 commits into from
Closed

added chicken param to remove methods, consistent apis #628

wants to merge 169 commits into from

Conversation

coderfi
Copy link

@coderfi coderfi commented Jan 22, 2015

I added the chicken parameter to the HdfsClient remove methods.
By default, this is True, which tells the remove method to raise an exception if the requested remove path is dangerous.
Such paths include the root and single level root paths, i.e. '/' or '/etc'.
Hopefully this prevents embarrassing 'hadoop fs -rmr /' attempts. :)
In addition, the HdfsClientCdh3 mkdir and remove method was api incompatible with the base class's methods, I went ahead and fixed those.

@coderfi coderfi changed the title added chicken param to remove methods, consistent remove api added chicken param to remove methods, consistent apis Jan 22, 2015
return True

if '//' in path:
raise ValueError('Funny path, double slash: %s' % path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you care about this at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the '//' check? Somewhat a kludge to simplify the following code.
Also, since ultimately this path string might be passed to a hadoop fs -rmr process, I would definitely be concerned at the behavior of:
hadoop fs -rmr //

Not sure if the hadoop CLI would try to delete the root directory, or exit 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this code is neccessary but it can stay if you're worried about cases like //. The code is at least easy to digest here.

@Tarrasch
Copy link
Contributor

You should amend your commits. Other than that this looks good I think.

False
"""
if not path:
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happend if we don't have this? I mean, can't we assume that the user passes in a string? Can something bad happen if we don't guard against this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method itself doesn't know that.

The original impetus of this whole check is that, is to help prevent developers writing Luigi tasks from shooting themselves in the foot and giving some sysadmin guy a very long night restoring from backups (if they exist!).

I've seen (and also caught myself), that through whatever obscure luigi command line mechanism / programmatic calls, the ultimate path passed in to the remove method can be basically anything -- and when such luigi task is invoked by a cron job or some other automatic process, the anything could be doubly anything. :)

The path could actually be None or a blank string, depending on the calling method and how that method got its parameters (through the luigi.Parameter via some shell command line argument, and whether said parameter had default=None, etc.).

In the case something bad happens... presumedly the caller (remove) method asks if path=None or path='' is dangerous.

If it is not dangerous, the invoked CLI process might look like:
1) hadoop fs -rmr None (if the caller did naive string concatenation)
or
2) hadoop fs -rmr

For #1, best case is we get a file not found error from hadoop cli. Worst case is we deleted someone's file called 'None' :(
For #2, best case is hadoop cli complains that a required command line argument is missing. Worst case is some hadoop distribution (or maybe someone created a hadoop wrapper), assumed something else and inadvertently deletes something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, your worst case for 2) is completly unlikely. You mean default would be to delete /? And your worst case for 1) is not that bad. Besides. case 1) will not happen. What will happen is that you get an exception thrown from path.strip().

I don't like this line as it's logic that gives no value, and I get confused reading it since I expect lines of code to be useful. :)

@coderfi
Copy link
Author

coderfi commented Jan 23, 2015

Responded to the various comments from @Tarrasch , thanks for the feedback!
One thing I did not think to check is if the path is a full hdfs path, i.e. hdfs:/// or hdfs:// hdfs://master/
All three cases could potentially trigger a disastrous consequence.

I think this is worth looking into, but maybe add checks for that sort of thing in a separate commit?

@Tarrasch
Copy link
Contributor

Yea, wait with the hdfs:/// stuff please. Meanwhile read through my comments and let me know what you think. :)

@coderfi
Copy link
Author

coderfi commented Jan 29, 2015

Code updated per suggestions, hopefully it is more too your liking :)
I noticed conflicts since my branch was too far behind with your master, so I merged those in as well.

@Tarrasch
Copy link
Contributor

The commits are messed up now. The "updates to is_dangerous_rm_path(): " commit is shown twice. Can you properly rebase your commits and force push? :)

@coderfi
Copy link
Author

coderfi commented Jan 30, 2015

Ugh, you might have to help me on that.
I'm an old school svn guy, and it was miraculous for me to even get the
branches working -_-

How's the code aside from the source control mess?
On Jan 29, 2015 3:04 AM, "Arash Rouhani" notifications@github.com wrote:

The commits are messed up now. The "updates to is_dangerous_rm_path(): "
commit is shown twice. Can you properly rebase your commits and force push?
:)


Reply to this email directly or view it on GitHub
#628 (comment).

@erikbern
Copy link
Contributor

try git rebase master

Then you can also try git rebase -i HEAD~9 to squash the last commits (it will let you cherry pick which ones)

daveFNbuck and others added 17 commits February 4, 2015 23:45
I woke up this morning to find that very few jobs had run overnight because
something weird happened with our hadoop cluster that caused all of our jobs to
hang permanently. When I checked in the morning, all workers were busy running
jobs that had been going for about 10 hours, and thousands of pending jobs that
would normally have been run overnight had piled up.

After killing and restarting all of my workers, everything started running fine
again because whatever weird issue had caused everything to hang was over. In
order to automate this fix, I've added a worker_timeout property to tasks that
tells the worker how long to attempt to run the task before killing it.

This property defaults to 0, meaning do not time tasks out. It can be changed
individually for a task by override or globally for all non-overridden tasks
via client.cfg. This only applies when running with more than 1 worker process,
as we need to run the task in a separate process in order to be able to kill it
after the timeout.
as opposed to expecting lists always
also cleared up the docs a bit
I've noticed that my visualiser has become unusable recently when it has too
many done tasks scheduled. The tab freezes up for a long time while processing
all the tasks, and then is very sluggish afterward. When I have this many done
tasks, it's too much to really browse through anyway, so this limits the
information shown to just the count. This trades off available information with
usability of the tool.

The maximum number of shown tasks is controlled by the scheduler and can be
adjusted in client.cfg. It may be better to make this part of the query so that
it could potentially be controlled by the end user. We may also in the future
want to add additional server queries to update larger sections for searches,
as this removes useful search information.
Add sphinx docs about how to enable the db task history, and what
functionality it provides (including screen shots). Adds a bit more
context to the existing docs on configuring the scheduler.
Adds support for tasks with non-alphanumeric values (e.g. dates
with dashes in them).
I've been seeing half-second get_work runtimes recently on my scheduler. The
cause of this has been a large backfill that scheduled a lot of done tasks. With
about 100k done tasks and about 350 pending tasks, a lot of time is wasted
looping over done tasks just to ignore them. This essentially caches these
computations by keeping a dictionary to group tasks by status and updating it
whenever task statuses are changed. This has sped my scheduling up by an order
of magnitude, now taking about 20ms per get_work.

In order to ensure updates happen every time, the task state has to take over
any functions that modify status. This is a necessary for remote task storage
anyway, so it had to be done eventually. There are no new tests here because
there are no functionality changes, only improvements in efficiency and
different internal representations.
steenzout and others added 25 commits February 4, 2015 23:48
* Use sphinx-build
* User should not read tox.ini in order to understand how to use it,
  so use html as default output format.
* Kept same output dir for html just in case.
$ tox -e docsapi
$ git add doc/api
This class adds support for "taskless" parameters
Also made the default Config task include the section in the command line arguments
This new style makes it possible to override on runtime (using eg. --hdfs-namenode-port) but also use configuration file like previously
Adapted from code submitted to Luigi user list by Felix Gao:
* https://groups.google.com/forum/#!topic/luigi-user/BIqAChl8sdw
ignore autopep8 E309 since if class has docstring, it will be separated by blank line.

changed sqlalchemy import statements.

this is necessary to fix bug in sphinx-doc issue#1530:
sphinx-doc/sphinx#1530

fixed luigi.interface.Interface.run sphinx complaints.
improved luigi.interface.run docstring.

fixed luigi.lock.getpcmd sphinx complaints.

fixed luigi.contrib.mysqldb.MySqlTarget docstring complaints and made 1st character of parameter description lowercase.

fixed luigi.contrib.redshift.RedshiftManifestTask sphinx complaints.

fixed sphinx SEVERE: Title level inconsistent.

fixed sphinx "ERROR: Unknown target name" errors.

fixed sphinx "WARNING: Explicit markup ends without a blank line" errors.

fixed sphinx "WARNING: Explicit markup ends without a blank line" errors.

fixed sphinx "Duplicate explicit target name: 'spotify'" error.

fixed broken URL to Yarn REST API.

fixed URL to oozie.

fixed URL for luigi-user Google group.
fixed URL to spotify.

changed URL to source code to make sphinx linkcheck pass.

added new readme.rst to exclude_pattern to fix sphinx complaint that it's not part of a table of content.
removed the html_static_path to fix sphinx build issue.

sphinx warnings now make the build break.

added docs-test tox environment to the travis build.
problems with documentation will now make the build break.

added more rules to .gitignore based on gitignore.io: Python, vim, OSX, Vagrant, PyCharm.

fixed sphinx warning to include luigi.tools.

merge tox docsapi with tox docs.
tox docsapi environment creates new files that might break the documentation generation so  moved the command into the tox docs environment.

synchronize README.rst in github and readthedocs.

copy of README.rst to doc directory.
URLs pointing to github changed to point to local files.
badges removed from the doc/README.rst.
added comments to build doc commands.
whitelist cp and sed commands.

README.rst added to sphinx exclude_patterns.
this will bypasses the warning sphinx gives due to the fact that README.rst is not part of a table of contents.
fixes 3 landscape errors on luigi.contrib.redshift.
During docs generation we copy README.rst to doc/ and clean it up
a bit. Lets delete afterwards.
@coderfi
Copy link
Author

coderfi commented Feb 5, 2015

Bah, I think I really messed things up on my branch, will try one more time on a new pull.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.