Data Science is a field whose purpose is to extract knowledge from large-scale data. It is based on techniques from various domains such as data mining, machine learning, artificial intelligence, visualization, and optimization. These techniques are adapted to large scale datasets thanks to parallel data processing, distributed systems, and suitable databases.
These techniques are applied in various domains such as:
- Computer security : spam filtering, network monitoring, anomaly detection, intrusion detection, etc.
- Social network analysis: community detection, trend analysis and prediction, etc.
- Marketing : targeted advertising, recommender systems, etc.
- Epidemiology and public health : determining risk factors, drug response prediction, etc.
Based on the use of the Python programming language, this course address the following topics:
- Data acquisition, visualisation, and analysis
- Machine learning : supervised learning (classification, regression), unsupervised learning (clustering, decomposition)
- Network analysis : PageRank, mining social-network graphs
- Recommendation Systems
- Understand key algorithms and techniques of data science
- Implement these techniques in python
- Understand their limitations
- Select appropriate techniques for a particular problem
- Apply these techniques for modeling and analysing large scale datasets
Lectures and labs materials are based on the following resources :
- Boston University CS591 "Tools and Techniques for Data Mining and Applications" course
- Mining of Massive Datasets, by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman, Cambridge University Press, 2014
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie and Robert Tibshirani, Springer, 2009
- Data Science from Scratch, by Joel Grus, O'Reilly, 2015
- Dhar, V., Data Science and Prediction, Communications of the ACM, Vol. 56 No. 12, December 2013.
- https://cloud.google.com/bigquery/public-data/