Skip to content

Data Science Bowl 2017 for lung cancer prediction with Keras

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

Lung Cancer Data Science Bowl 2017


Repository for the Vila del Pingui team for the Data Science Bowl 2017 (Feb2017 to Apr2017). The competetition ($1M in prizes) was about predicting early stage lung cancer from CT Scan images. The training set was 1397 + 200 patients and the test 500 patients. The result is an ensemble of 3 convolutional neural networks (resnet) for feature generation and xgboost for final ensemble.

The team ended in 34th position of 2000 teams (top 2%) with the best model scoring in the 17th position.


Access to latest results of each team and to documentation

  1. Preprocessing and datasets (README TBD) 1. Utils (git) 2. Bad segmentation spreadsheet (gdocs) 3. Preprocessed v3 (AWS): /mnt/hd2/preprocessed3
  2. DL (README) 1. Slices: TBD 2. Segmentation: TBD
  3. Final model (README TBD) 1. New features: TBD 2. XGBoost: TBD 3. Final learner - submission: TBD
  4. Literature: 1. Preprocessing (google drive) 2. DL (google drive) 3. Features (google drive)

References quick start

Basic references to understand the problem and the data:

  1. [Video] ( how to detect a lung cancer from a physician perspective (15 min).
  2. Notebooks (Kaggle Kernels) Understand the data set and dealing with DICOM files.
  3. Preprocessing tutorial: understanding DICOM files, pixel values, standarization, ...
  4. Exploratory data analysis: basic exploration of the given data set
  5. [Kaggle tutorial] ( with code for training a CNN using the U-net network for medical image segmentation. Based on the external LUNA data set (annotated).
  6. [TensorFlow ppt] ( for quickstart (focused on convnets) and code included. After it, you can take the official TF tutorial as the sample code.


[TBD] 1 - Download the repo:

$ git clone

2 - Create virtual enviroment (see virtualenvwrapper) and install python requirements

$ mkvirtualenv lung
$(lung) pip install -r requirements.txt


  • Estan ja instalats els paquets de requirements.txt amb el kernel de python2.
  • Cada usuari pot fer git pull/commit/push desde un ssh o amb !git commit .. desde la consola de jupyter. No demana contrasenya, el usuari queda identificat amb el email
  • Cada usuari té el seu directori ~/lung_cancer_ds_bowl privat per ell excepte la carpeta ~/lung_cancer_ds_bowl/data que es compartida per tots.
  • Tots els usuaris tenen permís de sudo així que si cal instalar paquets poden fer servir !sudo pip install paquet desde jupyter i així seràn accesibles per tots.

Available datasets

See docs/


The preprocessed images are stored at /mnt/hd2/preprocessed/. To open the compressed files from python use the following instruction: np.load(file)['arr_0']. There is one file per patient. Eah file is a numpy array of 4 dimensions: [type,slice,height,width]. The dimension type contains the preprocessed image at index 0, the lung segmentation at index 1, and when available (luna dataset) the nodules segmentation at index 2. All the images have dimensions height and weight dimensions of 512x512.

General guidelines

  • The analysis files should start with the author initials.
  • Avoid storing files >50Mb in Git. In particular, images from data folder should be outside the git repository.

File structure

├──          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
├── docs               <- A default Sphinx project; see for details
├── models             <- Trained and serialized models, model predictions, or model summaries
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
├── src                <- Source code for use in this project.
│   ├──    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └──
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └──
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├──
│   │   └──
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └──
└── tox.ini            <- tox file with settings for running tox; see


"Could not find a version that satisfies the requirement SimpleITK==0.10.0"

The solution is to manually download the egg from the official website and install it with easy_install.

"Fatal Python error: PyThreadState_Get: no current thread"

>>> import SimpleITK as sitk
"Fatal Python error: PyThreadState_Get: no current thread"

The solution is to relink the

$ otool -L ~/virtualenvs/lung/lib/python2.7/site-packages/SimpleITK/ 
	/System/Library/Frameworks/Python.framework/Versions/2.7/Python (compatibility version 2.7.0, current version 2.7.1)
	/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 635.19.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 159.1.0)
	/usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 52.0.0)
$ sudo install_name_tool -change /System/Library/Frameworks/Python.framework/Versions/2.7/Python ~/virtualenvs/lung/.Python ~/virtualenvs/lung/lib/python2.7/site-packages/SimpleITK/


Data Science Bowl 2017 for lung cancer prediction with Keras






No releases published


No packages published