Personal mobility trajectories/traces could benefit a number of practical application scenarios, e.g., pandemic control, transportation system management, user analysis and product recommendation, etc. For example, Google COVID-19 Community Mobility Reports[1] demonstrated the daily people movement trend and have been utilized to accurately predict the influence of the community[2] like traveling agents, retail enterprises, etc. On the other hand, such mobility traces are also privacy-critical to the users, as they contain or can be used to infer highly private personal information, like home/work addresses, activity patterns, etc. Therefore, how to effectively utilize the data with high privacy-preserving degree, as well as to benefit the real-world applications remains challengeable.
In this project, we propose to apply the novel Federated Learning (FL)[3] framework to address the transportation mode prediction task with the privacy preserving service-level requirement.
As a deep-learning (DL) distributed training framework, Federated Learning could enable model training on the local devices without needs to upload the users' private data, thus greatly enhancing the privacy preserving capability as well as maintaining similar convergence accuracy of the model. Therefore, applying FL in the personal mobility data-related use scenarios have three major benefits:
###(1) High Privacy-Preserving Capability: The personal mobility data stays in the users' local devices and do not need to be sent to the central server, thus greatly reducing the risk of personal data leakage;
###(2) Implementation Efficiency: As there is no need to transmit the data to the central server, both the communication cost and the information transmission encryption efforts could be saved, thus achieving higher implementation efficiency;
###(3) Flexible User Participation: Meanwhile, the distributed training capability of FL enables salable amounts of users to flexibly participate in the training process, thus contributing and enhancing the overall application performance.
The project is implemented using PyTorch, and tested under the following enviorments:
Ubuntu 16.04
NVIDIA Driver == 440.64
CUDA == 10.2
PyTorch == 1.5.0
torchvision == 0.6.0
Scikit-learn == 0.23.2
Tensorboard is recommended but not required to visualize the training log.
-
Geolife Dataset: Raw Data
-
Preprocessed trajectories dataset (in numpy format):
After downloading the preprocessed data, place the data images.npy & labels.npy
into the \data
folder.
Baseline: Centralized training.
In this case, all training data is sent/stored to the central node and conduct central training.
python main.py --lr 0.1 --node 1
Federated Training: Simulate federated training.
In this case, training data is split into 2, 4 or 8 nodes and conduct federated learning with FedAvg.
python main.py --lr 0.1 --nodes 2 --bs 32 # Fed Learning with 2 nodes.
python main.py --lr 0.1 --nodes 4 --bs 32 # Fed Learning with 4 nodes.
python main.py --lr 0.1 --nodes 8 --bs 32 # Fed Learning with 8 nodes.
Evaluation: Evaluate trained models. The default location of saved model is in \checkpoint
folder.
Run the following commands to evaluate the model performance.
python eval.py --model ckpt_1 # model name.
python eval.py --model ckpt_2
python eval.py --model ckpt_4
python eval.py --model ckpt_8
Released Models: We have released our centralized and FL-trained models in the \checkpoint
folder including 4 models:
ckpt_1node_67.52.pth
, corresponding to the centralized training model;ckpt_2node_64.46.pth
, corresponding to FL training with 2 clients;ckpt_4node_66.88.pth
, corresponding to FL training with 4 clients;ckpt_8node_67.13.pth
, corresponding to FL training with 8 clients.
If you have any questions, please reach to the author (email: fyu2@gmu.edu).