-
Notifications
You must be signed in to change notification settings - Fork 4
Conversation
…eline This commit introduces a new module that factorize lyon.py and bordeaux.py, so as to accomplish the pipeline first steps (station zip file downloading and unzipping). For the moment, the download URLs are hard-written as module parameters; a further module improvement may be to design a params_factory function as in lyon.py.
This commit introduces a task that create <city>.raw_station tables. Such tables allow to store station raw information (as contained in downloaded resources), before any normalization effort.
This commit transform the `raw_stations` tables in `stations` tables, for each city. This step ensures that column names are the same for every city before to continue into the pipeline.
…ion downloading for Lyon and Bordeaux This commit introduces a task `BikeAvailability` that merge `VelovStationAvailability` (Lyon) and `BicycleStationAvailability` (Bordeaux). It has a `city` arguments that allows to consider whatever city. Some minor evolutions may be expected as we get a `json` file in one case and a `xml` file in the other case...
This commit introduces a factorized way of converting availability data to `csv` files. The major difference that must be handled is that Bordeaux data are in `xml` format, whilst Lyon data are in `json` format.
This commit moves bike availability columns in config.ini, to let `city.py` as independant from the input data as possible. If we want to add a new data source, it will be necessary to define the feature name in the config file before.
… database This commit creates a new Luigi task for factorizing bike availability data insertion into database, for cities of Bordeaux and Lyon.
Two important tasks to do after that:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll work on these comments
jitenshea/tasks/city.py
Outdated
- TB_STVEL_P: bicycle-station geoloc | ||
- CI_VCUB_P: bicycle-station real-time occupation data | ||
* Bordeaux | ||
- stations URL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are URLS missing in the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, absolutely. I wrote the docstring before to find the good URLs, and did not go back to it when I finally got them.
jitenshea/tasks/city.py
Outdated
import sh | ||
|
||
import requests | ||
|
||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My linter said that numpy and sklearn packages are imported but not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're write, it comes from a wrong copy-paste process.
jitenshea/tasks/city.py
Outdated
address=config[self.city]['feature_address'], | ||
city=config[self.city]['feature_city'], | ||
nb_stations=config[self.city]['feature_nb_stations']) | ||
print(sql) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a remaining print here.
jitenshea/tasks/city.py
Outdated
|
||
@property | ||
def projection(self): | ||
return config[self.city]['srid'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you forget to update and commit the config.ini file because this new parameter srid should occur in the configuration file.
|
||
@property | ||
def typename(self): | ||
return config[self.city]['typename'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter is missing in the configuration file.
jitenshea/tasks/city.py
Outdated
connection = self.output().connect() | ||
cursor = connection.cursor() | ||
sql = self.query.format(schema=self.city, | ||
id=config[self.city]['feature_id'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These parameters are not occurred in the config.ini file. I think we should udpate it.
jitenshea/tasks/city.py
Outdated
df = pd.DataFrame(data['values'], columns=data['fields']) | ||
else: | ||
raise ValueError(("{} is an unknown city.".format(self.city))) | ||
df = df[[config[self.city]['feature_avl_id'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another several new parameters to write in a updated config.ini file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which parameter(s) do you mean?
jitenshea/tasks/city.py
Outdated
('available_bike', 'INT'), | ||
('ts', 'TIMESTAMP')] | ||
|
||
columns = [('id', 'INT'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should set the same ID type as the normalize station table task, i.e. varchar.
Take into account the last parameters used in the city tasks refactoring
* newlines between functions or classes * remove a remaining print * update the docstring module with some URLs
jitenshea/tasks/city.py
Outdated
columns = [('station_id', 'INT'), | ||
('start', 'DATE'), | ||
('stop', 'DATE'), | ||
('cluster_id', 'INT')] | ||
|
||
@property | ||
def table(self): | ||
return '{schema}.clustered_stations'.format(schema=self.city) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name of the cluster table should come from the configuration file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not that agree, as we have no reason to make the name city-dependent.
Do you think that it is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. We shouldn't have a tablename for each city. But I like to have the possibility to choose these table names. We could move the name to the database
section.
In this way, this column will have always the same two values: 'open' or 'close'. * update the value of the column status before writing the data into a CSV file * update the daily transaction SQL query
It was INT. * In the future, we could have station ids which aren't integers * this is consistent with the column type of the city.stations table
preprend a '0' for all hours before 10h, e.g. h1 -> h01
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your last commits are fine for me. I just still have two remarks:
- Could you explain why it is necessary to keep the clustering/prediction table names into the configuration file?
- You mention new parameters to add the configuration files: which ones? If needed let's add them and update the code accordingly before to merge.
jitenshea/tasks/city.py
Outdated
df = pd.DataFrame(data['values'], columns=data['fields']) | ||
else: | ||
raise ValueError(("{} is an unknown city.".format(self.city))) | ||
df = df[[config[self.city]['feature_avl_id'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which parameter(s) do you mean?
The parameters were set in this commit bb58789 |
Move the daily_transaction, clustering and centroid table names into the 'database' section of the configuration file.
I moved some table names specified in the configuration file into the 'database' section instead of having specific table names for each city. |
…able also update the configuration file where the SRID for Bordeaux was wrong.
Note I change the way to normalize the table. The query is just more robust, specially about the projection of the Geometry type. See 711c782 |
…ed table Generalize the previous commit (on cluster or prediction tables) to every implied tables: stations, raw_stations and timeseries were not yet considered in this way.
That's perfect for me, except the fact that in my humble opinion, we should generalize this choice to every implied table ( From now, if everything is OK with this last commit, we can merge the PR. |
This PR aims at factorizing
bordeaux.py
andlyon.py
into a unifiedcity.py
, and as a consequence, to make the addition of a new city easier.The whole Luigi pipeline has been reviewed in this view, from station data dowloading to machine learning algorithm execution.
Some improvements are still possible, especially regarding the downloading URLs handling. I suggest to open a further PR to address such points, if necessary.
This PR fixes issue #5 , by definition, and #14 as database table features as well as
csv
intermediary file features have been normalized.