Factorize tasks #24

delhomer · 2018-05-18T17:29:08Z

This PR aims at factorizing bordeaux.py and lyon.py into a unified city.py, and as a consequence, to make the addition of a new city easier.

The whole Luigi pipeline has been reviewed in this view, from station data dowloading to machine learning algorithm execution.

Some improvements are still possible, especially regarding the downloading URLs handling. I suggest to open a further PR to address such points, if necessary.

This PR fixes issue #5 , by definition, and #14 as database table features as well as csv intermediary file features have been normalized.

…eline This commit introduces a new module that factorize lyon.py and bordeaux.py, so as to accomplish the pipeline first steps (station zip file downloading and unzipping). For the moment, the download URLs are hard-written as module parameters; a further module improvement may be to design a params_factory function as in lyon.py.

This commit introduces a task that create <city>.raw_station tables. Such tables allow to store station raw information (as contained in downloaded resources), before any normalization effort.

This commit transform the `raw_stations` tables in `stations` tables, for each city. This step ensures that column names are the same for every city before to continue into the pipeline.

…ion downloading for Lyon and Bordeaux This commit introduces a task `BikeAvailability` that merge `VelovStationAvailability` (Lyon) and `BicycleStationAvailability` (Bordeaux). It has a `city` arguments that allows to consider whatever city. Some minor evolutions may be expected as we get a `json` file in one case and a `xml` file in the other case...

This commit introduces a factorized way of converting availability data to `csv` files. The major difference that must be handled is that Bordeaux data are in `xml` format, whilst Lyon data are in `json` format.

This commit moves bike availability columns in config.ini, to let `city.py` as independant from the input data as possible. If we want to add a new data source, it will be necessary to define the feature name in the config file before.

… database This commit creates a new Luigi task for factorizing bike availability data insertion into database, for cities of Bordeaux and Lyon.

garaud · 2018-05-24T10:34:25Z

Two important tasks to do after that:

update the tables in our db to have the same names/columns
update the SQL queries from the controller.py since the name of the tables and columns will be modified

garaud

I think I'll work on these comments

garaud · 2018-05-24T09:53:07Z

jitenshea/tasks/city.py

-  - TB_STVEL_P: bicycle-station geoloc
-  - CI_VCUB_P: bicycle-station real-time occupation data
+* Bordeaux
+  - stations URL:


Are URLS missing in the docstring?

Yes, absolutely. I wrote the docstring before to find the good URLs, and did not go back to it when I finally got them.

garaud · 2018-05-24T09:53:48Z

jitenshea/tasks/city.py

-import sh
-
-import requests
-
 import numpy as np


My linter said that numpy and sklearn packages are imported but not used.

You're write, it comes from a wrong copy-paste process.

garaud · 2018-05-24T09:54:51Z

jitenshea/tasks/city.py

+                                address=config[self.city]['feature_address'],
+                                city=config[self.city]['feature_city'],
+                                nb_stations=config[self.city]['feature_nb_stations'])
+        print(sql)


there is a remaining print here.

garaud · 2018-05-24T09:56:45Z

jitenshea/tasks/city.py

+
+    @property
+    def projection(self):
+        return config[self.city]['srid']


I think you forget to update and commit the config.ini file because this new parameter srid should occur in the configuration file.

garaud · 2018-05-24T09:59:23Z

jitenshea/tasks/city.py

+
+    @property
+    def typename(self):
+        return config[self.city]['typename']


This parameter is missing in the configuration file.

garaud · 2018-05-24T10:12:26Z

jitenshea/tasks/city.py

+        connection = self.output().connect()
+        cursor = connection.cursor()
+        sql = self.query.format(schema=self.city,
+                                id=config[self.city]['feature_id'],


These parameters are not occurred in the config.ini file. I think we should udpate it.

garaud · 2018-05-24T10:18:13Z

jitenshea/tasks/city.py

+                df = pd.DataFrame(data['values'], columns=data['fields'])
+            else:
+                raise ValueError(("{} is an unknown city.".format(self.city)))
+        df = df[[config[self.city]['feature_avl_id'],


Another several new parameters to write in a updated config.ini file

Which parameter(s) do you mean?

garaud · 2018-05-24T10:27:52Z

jitenshea/tasks/city.py

-               ('available_bike', 'INT'),
-               ('ts', 'TIMESTAMP')]
+
+    columns = [('id', 'INT'),


i think we should set the same ID type as the normalize station table task, i.e. varchar.

Take into account the last parameters used in the city tasks refactoring

* newlines between functions or classes * remove a remaining print * update the docstring module with some URLs

garaud · 2018-05-24T13:08:53Z

jitenshea/tasks/city.py

    columns = [('station_id', 'INT'),
               ('start', 'DATE'),
               ('stop', 'DATE'),
               ('cluster_id', 'INT')]

+    @property
+    def table(self):
+        return '{schema}.clustered_stations'.format(schema=self.city)


the name of the cluster table should come from the configuration file.

I'm not that agree, as we have no reason to make the name city-dependent.

Do you think that it is necessary?

You're right. We shouldn't have a tablename for each city. But I like to have the possibility to choose these table names. We could move the name to the database section.

In this way, this column will have always the same two values: 'open' or 'close'. * update the value of the column status before writing the data into a CSV file * update the daily transaction SQL query

It was INT. * In the future, we could have station ids which aren't integers * this is consistent with the column type of the city.stations table

preprend a '0' for all hours before 10h, e.g. h1 -> h01

delhomer

Your last commits are fine for me. I just still have two remarks:

Could you explain why it is necessary to keep the clustering/prediction table names into the configuration file?
You mention new parameters to add the configuration files: which ones? If needed let's add them and update the code accordingly before to merge.

delhomer · 2018-05-24T16:03:53Z

jitenshea/tasks/city.py

+                df = pd.DataFrame(data['values'], columns=data['fields'])
+            else:
+                raise ValueError(("{} is an unknown city.".format(self.city)))
+        df = df[[config[self.city]['feature_avl_id'],


Which parameter(s) do you mean?

garaud · 2018-05-24T18:40:10Z

The parameters were set in this commit bb58789

Move the daily_transaction, clustering and centroid table names into the 'database' section of the configuration file.

garaud · 2018-05-25T09:25:14Z

I moved some table names specified in the configuration file into the 'database' section instead of having specific table names for each city.

…able also update the configuration file where the SRID for Bordeaux was wrong.

garaud · 2018-05-25T12:38:03Z

Note I change the way to normalize the table. The query is just more robust, specially about the projection of the Geometry type.

See 711c782

…ed table Generalize the previous commit (on cluster or prediction tables) to every implied tables: stations, raw_stations and timeseries were not yet considered in this way.

delhomer · 2018-05-25T16:31:06Z

That's perfect for me, except the fact that in my humble opinion, we should generalize this choice to every implied table (raw_stations, stations and timeseries were not yet considered). I've just pushed a commit to fix that.

From now, if everything is OK with this last commit, we can merge the PR.

delhomer added 13 commits May 7, 2018 10:32

tasks: create a new common task to create a raw_station table

703fed8

This commit introduces a task that create <city>.raw_station tables. Such tables allow to store station raw information (as contained in downloaded resources), before any normalization effort.

tasks: add a new task to normalize the station features

4c0d5f5

This commit transform the `raw_stations` tables in `stations` tables, for each city. This step ensures that column names are the same for every city before to continue into the pipeline.

tasks: create a task that transform bike availability data to csv format

68e851a

This commit introduces a factorized way of converting availability data to `csv` files. The major difference that must be handled is that Bordeaux data are in `xml` format, whilst Lyon data are in `json` format.

tasks : create a factorized task for adding bike availability data in…

012e2d4

… database This commit creates a new Luigi task for factorizing bike availability data insertion into database, for cities of Bordeaux and Lyon.

tasks : create factorized tasks for transaction handling

e7c093d

tasks: fix typos

f13faee

tasks: simplify the references to city

c43e40f

tasks: factorize clustering and XGBoost training related tasks fix #5

ea901e0

tasks: delete obsolete task modules

f6d3836

tasks: factorize url when downloading bike availability data

40fff3b

delhomer requested a review from garaud May 18, 2018 17:29

garaud mentioned this pull request May 24, 2018

improve the documentation of the configuration file #25

Open

garaud approved these changes May 24, 2018

View reviewed changes

garaud added 2 commits May 24, 2018 14:50

update the config.ini.sample file

bb58789

Take into account the last parameters used in the city tasks refactoring

tasks: some clean-up for the city.py module

51904d3

* newlines between functions or classes * remove a remaining print * update the docstring module with some URLs

garaud suggested changes May 24, 2018

View reviewed changes

garaud added 4 commits May 24, 2018 15:11

tasks: the name for the clustring should come from the config file

d426ade

tasks: normalize the values for the 'timeseries.status' column

127fdcc

In this way, this column will have always the same two values: 'open' or 'close'. * update the value of the column status before writing the data into a CSV file * update the daily transaction SQL query

tasks: set the station_id type to VARCHAR

e6e988b

It was INT. * In the future, we could have station ids which aren't integers * this is consistent with the column type of the city.stations table

tasks: rename some column names for the centroid table

a3bfd08

preprend a '0' for all hours before 10h, e.g. h1 -> h01

delhomer commented May 24, 2018

View reviewed changes

garaud added 2 commits May 25, 2018 11:22

tasks: name of some tables do not depend on the city name

f6c67eb

Move the daily_transaction, clustering and centroid table names into the 'database' section of the configuration file.

update the configuration file example

29db89b

garaud mentioned this pull request May 25, 2018

Update the Web API after the data/table/tasks refactoring #26

Merged

tasks: fix the SRID projection problem when normalizing the station t…

711c782

…able also update the configuration file where the SRID for Bordeaux was wrong.

tasks: extend the tablename definitions in config.ini for every impli…

6a5e837

…ed table Generalize the previous commit (on cluster or prediction tables) to every implied tables: stations, raw_stations and timeseries were not yet considered in this way.

tasks: select data for clustering only when the status is open

ec1b881

garaud approved these changes May 26, 2018

View reviewed changes

garaud merged commit 6781118 into master May 26, 2018

garaud deleted the factorize_tasks branch May 26, 2018 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Factorize tasks #24

Factorize tasks #24

delhomer commented May 18, 2018

garaud commented May 24, 2018

garaud left a comment

garaud May 24, 2018

delhomer May 24, 2018

garaud May 24, 2018

delhomer May 24, 2018

garaud May 24, 2018

garaud May 24, 2018

garaud May 24, 2018

garaud May 24, 2018

garaud May 24, 2018

delhomer May 24, 2018

garaud May 24, 2018

garaud May 24, 2018

delhomer May 24, 2018

garaud May 24, 2018

delhomer left a comment

delhomer May 24, 2018

garaud commented May 24, 2018

garaud commented May 25, 2018

garaud commented May 25, 2018

delhomer commented May 25, 2018

Factorize tasks #24

Factorize tasks #24

Conversation

delhomer commented May 18, 2018

garaud commented May 24, 2018

garaud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

delhomer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garaud commented May 24, 2018

garaud commented May 25, 2018

garaud commented May 25, 2018

delhomer commented May 25, 2018