Merge branch 'upstream-master' into feature/email-attachment

* upstream-master: (82 commits) S3 client refactor (spotify#2482) Rename to rpc_log_retries, and make it apply to all the logging involved Factor log_exceptions into a configuration parameter Fix attribute forwarding for tasks with dynamic dependencies (spotify#2478) Add a visiblity level for luigi.Parameters (spotify#2278) Add support for multiple requires and inherits arguments (spotify#2475) Add metadata columns to the RDBMS contrib (spotify#2440) Fix race condition in luigi.lock.acquire_for (spotify#2357) (spotify#2477) tests: Use RunOnceTask where possible (spotify#2476) Optional TOML configs support (spotify#2457) Added default port behaviour for Redshift (spotify#2474) Add codeowners file with default and specific example (spotify#2465) Add Data Revenue to the `blogged` list (spotify#2472) Fix Scheduler.add_task to overwrite accepts_messages attribute. (spotify#2469) Use task_id comparison in Task.__eq__. (spotify#2462) Add stale config Move github templates to .github dir Fix transfer config import (spotify#2458) Additions to provide support for the Load Sharing Facility (LSF) job scheduler (spotify#2373) Version 2.7.6 ...
dlstadther · Aug 14, 2018 · 328c6bf · 328c6bf
2 parents 70336bc + c696f40
commit 328c6bf
Show file tree

Hide file tree

Showing 90 changed files with 4,611 additions and 935 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -0,0 +1,12 @@
+# The following patterns are used to auto-assign review requests
+# to specific individuals. Order is important; the last matching
+# pattern takes the most precedence.
+
+# These owners will be the default owners for everything in
+# the repo. Unless a later match takes precedence,
+* @dlstadther @Tarrasch @ulzha
+
+# Specific files, directories, paths, or file types can be
+# assigned more specificially.
+contrib/redshift*.py @dlstadther
+
diff --git a/ISSUE_TEMPLATE.md → .github/ISSUE_TEMPLATE.md b/ISSUE_TEMPLATE.md → .github/ISSUE_TEMPLATE.md
diff --git a/PULL_REQUEST_TEMPLATE.md → .github/PULL_REQUEST_TEMPLATE.md b/PULL_REQUEST_TEMPLATE.md → .github/PULL_REQUEST_TEMPLATE.md
diff --git a/.github/stale.yml b/.github/stale.yml
@@ -0,0 +1,20 @@
+# Number of days of inactivity before an issue becomes stale
+daysUntilStale: 120
+# Number of days of inactivity before a stale issue is closed
+daysUntilClose: 14
+# Issues with these labels will never be considered stale
+exemptLabels:
+ - pinned
+ - security
+# Label to use when marking an issue as stale
+staleLabel: wontfix
+# Comment to post when marking an issue as stale. Set to `false` to disable
+markComment: >
+ This issue has been automatically marked as stale because it has not had
+ recent activity. It will be closed if no further activity occurs.
+ If closed, you may revisit when your time allows and reopen!
+ Thank you for your contributions.
+# Comment to post when closing a stale issue. Set to `false` to disable
+closeComment: false
+# Limit to only `issues` or `pulls`
+# only: issues
diff --git a/.gitignore b/.gitignore
@@ -15,15 +15,17 @@ pig_property_file
 
 packages.tar
 
+# Ignore the data files
+data
 test/data
+examples/data
 
 Vagrantfile
 
 *.pickle
 *.rej
 *.orig
 
-
 # Created by https://www.gitignore.io
 
 ### Python ###

diff --git a/.travis.yml b/.travis.yml
@@ -15,6 +15,9 @@ env:
  - BQ_TEST_PROJECT_ID=luigi-travistestenvironment
  - BQ_TEST_INPUT_BUCKET=luigi-bigquery-test
  - GOOGLE_APPLICATION_CREDENTIALS=test/gcloud-credentials.json
+ - AWS_DEFAULT_REGION=us-east-1
+ - AWS_ACCESS_KEY_ID=accesskey
+ - AWS_SECRET_ACCESS_KEY=secretkey
  matrix:
  - TOXENV=flake8
  - TOXENV=docs

diff --git a/README.rst b/README.rst
@@ -148,6 +148,8 @@ or held presentations about Luigi:
 * `Open Targets <https://www.opentargets.org/>`_ `(blog, 2017) <https://blog.opentargets.org/using-containers-with-luigi>`__
 * `Leipzig University Library <https://ub.uni-leipzig.de>`_ `(presentation, 2016) <https://de.slideshare.net/MartinCzygan/build-your-own-discovery-index-of-scholary-eresources>`__ / `(project) <https://finc.info/de/datenquellen>`__
 * `Synetiq <https://synetiq.net/>`_ `(presentation, 2017) <https://www.youtube.com/watch?v=M4xUQXogSfo>`__
+* `Glossier <https://www.glossier.com/>`_ `(blog, 2018) <https://medium.com/glossier/how-to-build-a-data-warehouse-what-weve-learned-so-far-at-glossier-6ff1e1783e31>`__
+* `Data Revenue <https://www.datarevenue.com/>`_ `(blog, 2018) <https://www.datarevenue.com/en/blog/how-to-scale-your-machine-learning-pipeline>`_
 
 Some more companies are using Luigi but haven't had a chance yet to write about it:
 
@@ -163,6 +165,7 @@ Some more companies are using Luigi but haven't had a chance yet to write about
 * `Deloitte <https://www.Deloitte.co.uk/>`_
 * `Stacktome <https://stacktome.com/>`_
 * `LINX+Neemu+Chaordic <https://www.chaordic.com.br/>`_
+* `Foxberry <https://www.foxberry.com/>`_
 
 We're more than happy to have your company added here. Just send a PR on GitHub.
 

diff --git a/codecov.yml b/codecov.yml
@@ -1,21 +1,32 @@
-# First just blindly copy paste what is default values from the docs page
-# https://github.com/codecov/support/wiki/codecov.yml
 coverage:
- precision: 2
- round: down
- range: "70...100"
+ precision: 2 # Just copied from default
+ round: down # Just copied from default
+ range: "70...100" # Just copied from default
 
  status:
  project:
+ default: false # disable the default status that measures entire project
+ core:
+ target: 92%
+ paths: "luigi/*.py" 
+ patch: # Just copied from default
  default:
- target: auto
  if_no_uploads: error
 
- patch:
- default:
- if_no_uploads: error
-
- changes: true
+ changes: true # Just copied from default
+
+ ignore:
+ - "examples/"
+ - "luigi/tools" # These are tested as actual run commands without coverage
+ # List modules who's tests are not run by Travis or
+ # are run in a subprocesses (like on cluster).
+ - "luigi/contrib/gcs.py"
+ - "luigi/contrib/bigquery.py"
+ - "luigi/contrib/bigquery_avro.py"
+ - "luigi/contrib/hdfs/"
+ - "luigi/contrib/hadoop.py"
+ - "luigi/contrib/mrrunner.py"
+ - "luigi/contrib/kubernetes.py"
 
-# But for luigi we do not want any comments
+# For luigi we do not want any comments
 comment: false
diff --git a/doc/command_line.rst b/doc/command_line.rst
diff --git a/doc/configuration.rst b/doc/configuration.rst
@@ -1,18 +1,35 @@
 Configuration
 =============
 
-All configuration can be done by adding configuration files. They are looked for in:
+All configuration can be done by adding configuration files.
 
- * ``/etc/luigi/client.cfg``
- * ``luigi.cfg`` (or its legacy name ``client.cfg``) in your current working directory
- * ``LUIGI_CONFIG_PATH`` environment variable
+Supported config parsers:
+* ``cfg`` (default)
+* ``toml``
 
-in increasing order of preference. The order only matters in case of key conflicts (see docs for ConfigParser.read_). These files are meant for both the client and ``luigid``. If you decide to specify your own configuration you should make sure that both the client and ``luigid`` load it properly.
+You can choose right parser via ``LUIGI_CONFIG_PARSER`` environment variable. For example, ``LUIGI_CONFIG_PARSER=toml``.
+
+Default (cfg) parser are looked for in:
+
+* ``/etc/luigi/client.cfg`` (deprecated)
+* ``/etc/luigi/luigi.cfg``
+* ``client.cfg`` (deprecated)
+* ``luigi.cfg``
+* ``LUIGI_CONFIG_PATH`` environment variable
+
+`TOML <https://github.com/toml-lang/toml>`_ parser are looked for in:
+
+* ``/etc/luigi/luigi.toml``
+* ``luigi.toml``
+* ``LUIGI_CONFIG_PATH`` environment variable
+
+Both config lists increase in priority (from low to high). The order only matters in case of key conflicts (see docs for ConfigParser.read_). These files are meant for both the client and ``luigid``. If you decide to specify your own configuration you should make sure that both the client and ``luigid`` load it properly.
 
 .. _ConfigParser.read: https://docs.python.org/3.6/library/configparser.html#configparser.ConfigParser.read
 
-The config file is broken into sections, each controlling a different part of the config. Example configuration file:
+The config file is broken into sections, each controlling a different part of the config.
 
+Example cfg config:
 
 .. code:: ini
 
@@ -23,6 +40,17 @@ The config file is broken into sections, each controlling a different part of th
  [core]
  scheduler_host=luigi-host.mycompany.foo
 
+Example toml config:
+
+.. code:: python
+
+ [hadoop]
+ version = "cdh4"
+ streaming-jar = "/usr/lib/hadoop-xyz/hadoop-streaming-xyz-123.jar"
+
+ [core]
+ scheduler_host = "luigi-host.mycompany.foo"
+
 
 .. _ParamConfigIngestion:
 
@@ -154,6 +182,7 @@ parallel_scheduling
  If true, the scheduler will compute complete functions of tasks in
  parallel using multiprocessing. This can significantly speed up
  scheduling, but requires that all tasks can be pickled.
+ Defaults to false.
 
 parallel-scheduling-processes
  The number of processes to use for parallel scheduling. If not specified
@@ -270,6 +299,12 @@ check_unfulfilled_deps
  resource-intensive.
  Defaults to true.
 
+force_multiprocessing
+ By default, luigi uses multiprocessing when *more than one* worker process is
+ requested. Whet set to true, multiprocessing is used independent of the
+ the number of workers.
+ Defaults to false.
+
 
 [elasticsearch]
 ---------------
@@ -716,6 +751,15 @@ worker_disconnect_delay
  scheduler before removing it and marking all of its running tasks as
  failed. Defaults to 60.
 
+pause_enabled
+ If false, disables pause/unpause operations and hides the pause toggle from
+ the visualiser.
+
+send_messages
+ When true, the scheduler is allowed to send messages to running tasks and
+ the central scheduler provides a simple prompt per task to send messages.
+ Defaults to true.
+
 
 [sendgrid]
 ----------

diff --git a/doc/index.rst b/doc/index.rst
@@ -15,7 +15,7 @@ Table of Contents
  workflows.rst
  tasks.rst
  parameters.rst
- command_line.rst
+ running_luigi.rst
  central_scheduler.rst
  execution_model.rst
  luigi_patterns.rst

diff --git a/doc/luigi_patterns.rst b/doc/luigi_patterns.rst
@@ -226,6 +226,33 @@ the task parameters or other dynamic attributes:
 Since, by default, resources have a usage limit of 1, no two instances of Task A 
 will now run if they have the same `important_file_name` property.
 
+Decreasing resources of running tasks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+At scheduling time, the luigi scheduler needs to be aware of the maximum
+resource consumption a task might have once it runs. For some tasks, however,
+it can be beneficial to decrease the amount of consumed resources between two
+steps within their run method (e.g. after some heavy computation). In this
+case, a different task waiting for that particular resource can already be
+scheduled.
+
+.. code-block:: python
+
+ class A(luigi.Task):
+
+ # set maximum resources a priori
+ resources = {"some_resource": 3}
+
+ def run(self):
+ # do something
+ ...
+
+ # decrease consumption of "some_resource" by one
+ self.decrease_running_resources({"some_resource": 1})
+
+ # continue with reduced resources
+ ...
+
 Monitoring task pipelines
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -290,3 +317,39 @@ built-in solutions. In the case of you're dealing with a file system
 :meth:`~luigi.target.FileSystemTarget.temporary_path`. For other targets, you
 should ensure that the way you're writing your final output directory is
 atomic.
+
+Sending messages to tasks
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The central scheduler is able to send messages to particular tasks. When a running task accepts 
+messages, it can access a `multiprocessing.Queue <https://docs.python.org/3/library/multiprocessing.html#pipes-and-queues>`__
+object storing incoming messages. You can implement custom behavior to react and respond to
+messages:
+
+.. code-block:: python
+
+ class Example(luigi.Task):
+
+ # common task setup
+ ...
+
+ # configure the task to accept all incoming messages
+ accepts_messages = True
+
+ def run(self):
+ # this example runs some loop and listens for the
+ # "terminate" message, and responds to all other messages
+ for _ in some_loop():
+ # check incomming messages
+ if not self.scheduler_messages.empty():
+ msg = self.scheduler_messages.get()
+ if msg.content == "terminate":
+ break
+ else:
+ msg.respond("unknown message")
+
+ # finalize
+ ...
+
+Messages can be sent right from the scheduler UI which also displays responses (if any). Note that
+this feature is only available when the scheduler is configured to send messages (see the :ref:`scheduler-config` config), and the task is configured to accept them.
diff --git a/doc/parameters.rst b/doc/parameters.rst
@@ -25,7 +25,7 @@ i.e.
 .. code:: python
 
  d = DailyReport(datetime.date(2012, 5, 10))
- print d.date
+ print(d.date)
 
 will return the same date that the object was constructed with.
 Same goes if you invoke Luigi on the command line.
@@ -88,6 +88,25 @@ are not the same instance:
  >>> hash(c) == hash(d)
  True
 
+Parameter visibility
+^^^^^^^^^^^^^^^^^^^^
+
+Using :class:`~luigi.parameter.ParameterVisibility` you can configure parameter visibility. By default, all
+parameters are public, but you can also set them hidden or private.
+
+.. code:: python
+
+ >>> import luigi
+ >>> from luigi.parameter import ParameterVisibility
+ 
+ >>> luigi.Parameter(visibility=ParameterVisibility.PRIVATE)
+
+``ParameterVisibility.PUBLIC`` (default) - visible everywhere
+
+``ParameterVisibility.HIDDEN`` - ignored in WEB-view, but saved into database if save db_history is true
+
+``ParameterVisibility.PRIVATE`` - visible only inside task.
+
 Parameter types
 ^^^^^^^^^^^^^^^