Skip to content
This repository has been archived by the owner on Jan 22, 2021. It is now read-only.

Factorize tasks #24

Merged
merged 24 commits into from
May 26, 2018
Merged

Factorize tasks #24

merged 24 commits into from
May 26, 2018

Conversation

delhomer
Copy link
Collaborator

This PR aims at factorizing bordeaux.py and lyon.py into a unified city.py, and as a consequence, to make the addition of a new city easier.

The whole Luigi pipeline has been reviewed in this view, from station data dowloading to machine learning algorithm execution.

Some improvements are still possible, especially regarding the downloading URLs handling. I suggest to open a further PR to address such points, if necessary.

This PR fixes issue #5 , by definition, and #14 as database table features as well as csv intermediary file features have been normalized.

delhomer added 13 commits May 7, 2018 10:32
…eline

This commit introduces a new module that factorize lyon.py and
bordeaux.py, so as to accomplish the pipeline first steps (station zip
file downloading and unzipping). For the moment, the download URLs are
hard-written as module parameters; a further module improvement may be
to design a params_factory function as in lyon.py.
This commit introduces a task that create <city>.raw_station
tables. Such tables allow to store station raw information (as
contained in downloaded resources), before any normalization effort.
This commit transform the `raw_stations` tables in `stations` tables,
for each city. This step ensures that column names are the same for
every city before to continue into the pipeline.
…ion downloading for Lyon and Bordeaux

This commit introduces a task `BikeAvailability` that merge
`VelovStationAvailability` (Lyon) and `BicycleStationAvailability`
(Bordeaux). It has a `city` arguments that allows to consider whatever
city. Some minor evolutions may be expected as we get a `json` file in
one case and a `xml` file in the other case...
This commit introduces a factorized way of converting availability
data to `csv` files. The major difference that must be handled is that
Bordeaux data are in `xml` format, whilst Lyon data are in `json`
format.
This commit moves bike availability columns in config.ini, to let
`city.py` as independant from the input data as possible. If we want
to add a new data source, it will be necessary to define the feature
name in the config file before.
… database

This commit creates a new Luigi task for factorizing bike availability
data insertion into database, for cities of Bordeaux and Lyon.
@garaud
Copy link
Owner

garaud commented May 24, 2018

Two important tasks to do after that:

  • update the tables in our db to have the same names/columns
  • update the SQL queries from the controller.py since the name of the tables and columns will be modified

Copy link
Owner

@garaud garaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll work on these comments

- TB_STVEL_P: bicycle-station geoloc
- CI_VCUB_P: bicycle-station real-time occupation data
* Bordeaux
- stations URL:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are URLS missing in the docstring?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. I wrote the docstring before to find the good URLs, and did not go back to it when I finally got them.

import sh

import requests

import numpy as np
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My linter said that numpy and sklearn packages are imported but not used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're write, it comes from a wrong copy-paste process.

address=config[self.city]['feature_address'],
city=config[self.city]['feature_city'],
nb_stations=config[self.city]['feature_nb_stations'])
print(sql)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a remaining print here.


@property
def projection(self):
return config[self.city]['srid']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you forget to update and commit the config.ini file because this new parameter srid should occur in the configuration file.


@property
def typename(self):
return config[self.city]['typename']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is missing in the configuration file.

connection = self.output().connect()
cursor = connection.cursor()
sql = self.query.format(schema=self.city,
id=config[self.city]['feature_id'],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters are not occurred in the config.ini file. I think we should udpate it.

df = pd.DataFrame(data['values'], columns=data['fields'])
else:
raise ValueError(("{} is an unknown city.".format(self.city)))
df = df[[config[self.city]['feature_avl_id'],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another several new parameters to write in a updated config.ini file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which parameter(s) do you mean?

('available_bike', 'INT'),
('ts', 'TIMESTAMP')]

columns = [('id', 'INT'),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should set the same ID type as the normalize station table task, i.e. varchar.

garaud added 2 commits May 24, 2018 14:50
Take into account the last parameters used in the city tasks refactoring
* newlines between functions or classes
* remove a remaining print
* update the docstring module with some URLs
columns = [('station_id', 'INT'),
('start', 'DATE'),
('stop', 'DATE'),
('cluster_id', 'INT')]

@property
def table(self):
return '{schema}.clustered_stations'.format(schema=self.city)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name of the cluster table should come from the configuration file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that agree, as we have no reason to make the name city-dependent.

Do you think that it is necessary?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. We shouldn't have a tablename for each city. But I like to have the possibility to choose these table names. We could move the name to the database section.

garaud added 4 commits May 24, 2018 15:11
In this way, this column will have always the same two values: 'open' or 'close'.

* update the value of the column status before writing the data into a CSV file
* update the daily transaction SQL query
It was INT.

* In the future, we could have station ids which aren't integers
* this is consistent with the column type of the city.stations table
preprend a '0' for all hours before 10h, e.g. h1 -> h01
Copy link
Collaborator Author

@delhomer delhomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your last commits are fine for me. I just still have two remarks:

  • Could you explain why it is necessary to keep the clustering/prediction table names into the configuration file?
  • You mention new parameters to add the configuration files: which ones? If needed let's add them and update the code accordingly before to merge.

df = pd.DataFrame(data['values'], columns=data['fields'])
else:
raise ValueError(("{} is an unknown city.".format(self.city)))
df = df[[config[self.city]['feature_avl_id'],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which parameter(s) do you mean?

@garaud
Copy link
Owner

garaud commented May 24, 2018

The parameters were set in this commit bb58789

garaud added 2 commits May 25, 2018 11:22
Move the daily_transaction, clustering and centroid table names into the 'database'
section of the configuration file.
@garaud
Copy link
Owner

garaud commented May 25, 2018

I moved some table names specified in the configuration file into the 'database' section instead of having specific table names for each city.

…able

also update the configuration file where the SRID for Bordeaux was wrong.
@garaud
Copy link
Owner

garaud commented May 25, 2018

Note I change the way to normalize the table. The query is just more robust, specially about the projection of the Geometry type.

See 711c782

…ed table

Generalize the previous commit (on cluster or prediction tables) to
every implied tables: stations, raw_stations and timeseries were not
yet considered in this way.
@delhomer
Copy link
Collaborator Author

That's perfect for me, except the fact that in my humble opinion, we should generalize this choice to every implied table (raw_stations, stations and timeseries were not yet considered). I've just pushed a commit to fix that.

From now, if everything is OK with this last commit, we can merge the PR.

@garaud garaud merged commit 6781118 into master May 26, 2018
@garaud garaud deleted the factorize_tasks branch May 26, 2018 08:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants