FixMatch

Code for the paper: "FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence" by Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel.

This is not an officially supported Google product.

Setup

Important: ML_DATA is a shell environment variable that should point to the location where the datasets are installed. See the Install datasets section for more details.

Install dependencies

sudo apt install python3-dev python3-virtualenv python3-tk imagemagick
virtualenv -p python3 --system-site-packages env3
. env3/bin/activate
pip install -r requirements.txt

Install datasets

export ML_DATA="path to where you want the datasets saved"
export PYTHONPATH=$PYTHONPATH:"path to the FixMatch"

# Download datasets
CUDA_VISIBLE_DEVICES= ./scripts/create_datasets.py
cp $ML_DATA/svhn-test.tfrecord $ML_DATA/svhn_noextra-test.tfrecord

# Create unlabeled datasets
CUDA_VISIBLE_DEVICES= scripts/create_unlabeled.py $ML_DATA/SSL2/svhn $ML_DATA/svhn-train.tfrecord $ML_DATA/svhn-extra.tfrecord &
CUDA_VISIBLE_DEVICES= scripts/create_unlabeled.py $ML_DATA/SSL2/svhn_noextra $ML_DATA/svhn-train.tfrecord &
CUDA_VISIBLE_DEVICES= scripts/create_unlabeled.py $ML_DATA/SSL2/cifar10 $ML_DATA/cifar10-train.tfrecord &
CUDA_VISIBLE_DEVICES= scripts/create_unlabeled.py $ML_DATA/SSL2/cifar100 $ML_DATA/cifar100-train.tfrecord &
CUDA_VISIBLE_DEVICES= scripts/create_unlabeled.py $ML_DATA/SSL2/stl10 $ML_DATA/stl10-train.tfrecord $ML_DATA/stl10-unlabeled.tfrecord &
wait

# Create semi-supervised subsets
for seed in 0 1 2 3 4 5; do
    for size in 10 20 30 40 100 250 1000 4000; do
        CUDA_VISIBLE_DEVICES= scripts/create_split.py --seed=$seed --size=$size $ML_DATA/SSL2/svhn $ML_DATA/svhn-train.tfrecord $ML_DATA/svhn-extra.tfrecord &
        CUDA_VISIBLE_DEVICES= scripts/create_split.py --seed=$seed --size=$size $ML_DATA/SSL2/svhn_noextra $ML_DATA/svhn-train.tfrecord &
        CUDA_VISIBLE_DEVICES= scripts/create_split.py --seed=$seed --size=$size $ML_DATA/SSL2/cifar10 $ML_DATA/cifar10-train.tfrecord &
    done
    for size in 400 1000 2500 10000; do
        CUDA_VISIBLE_DEVICES= scripts/create_split.py --seed=$seed --size=$size $ML_DATA/SSL2/cifar100 $ML_DATA/cifar100-train.tfrecord &
    done
    CUDA_VISIBLE_DEVICES= scripts/create_split.py --seed=$seed --size=1000 $ML_DATA/SSL2/stl10 $ML_DATA/stl10-train.tfrecord $ML_DATA/stl10-unlabeled.tfrecord &
    wait
done
CUDA_VISIBLE_DEVICES= scripts/create_split.py --seed=1 --size=5000 $ML_DATA/SSL2/stl10 $ML_DATA/stl10-train.tfrecord $ML_DATA/stl10-unlabeled.tfrecord

ImageNet

Codebase for ImageNet experiments located in the imagenet subdirectory.

Running

Setup

All commands must be ran from the project root. The following environment variables must be defined:

export ML_DATA="path to where you want the datasets saved"
export PYTHONPATH=$PYTHONPATH:.

Example

For example, training a FixMatch with 32 filters on cifar10 shuffled with seed=3, 40 labeled samples and 1 validation sample:

CUDA_VISIBLE_DEVICES=0 python fixmatch.py --filters=32 --dataset=cifar10.3@40-1 --train_dir ./experiments/fixmatch

Available labelled sizes are 10, 20, 30, 40, 100, 250, 1000, 4000. For validation, available sizes are 1, 5000. Possible shuffling seeds are 1, 2, 3, 4, 5 and 0 for no shuffling (0 is not used in practiced since data requires to be shuffled for gradient descent to work properly).

Multi-GPU training

Just pass more GPUs and fixmatch automatically scales to them, here we assign GPUs 4-7 to the program:

CUDA_VISIBLE_DEVICES=4,5,6,7 python fixmatch.py --filters=32 --dataset=cifar10.3@40-1 --train_dir ./experiments/fixmatch

Flags

python fixmatch.py --help
# The following option might be too slow to be really practical.
# python fixmatch.py --helpfull
# So instead I use this hack to find the flags:
fgrep -R flags.DEFINE libml fixmatch.py

The --augment flag can use a little more explanation. It is composed of 3 values, for example d.d.d (d=default augmentation, for example shift/mirror, x=identity, e.g. no augmentation, ra=rand-augment, rac=rand-augment + cutout):

the first d refers to data augmentation to apply to the labeled example.
the second d refers to data augmentation to apply to the weakly augmented unlabeled example.
the third d refers to data augmentation to apply to the strongly augmented unlabeled example. For the strong augmentation, d is followed by CTAugment for fixmatch.py and code inside cta/ folder.

Valid dataset names

for dataset in cifar10 svhn svhn_noextra; do
for seed in 0 1 2 3 4 5; do
for valid in 1 5000; do
for size in 10 20 30 40 100 250 1000 4000; do
    echo "${dataset}.${seed}@${size}-${valid}"
done; done; done; done

for seed in 1 2 3 4 5; do
for valid in 1 5000; do
    echo "cifar100.${seed}@10000-${valid}"
done; done

for seed in 1 2 3 4 5; do
for valid in 1 5000; do
    echo "stl10.${seed}@1000-${valid}"
done; done
echo "stl10.1@5000-1"

Monitoring training progress

You can point tensorboard to the training folder (by default it is --train_dir=./experiments) to monitor the training process:

tensorboard.sh --port 6007 --logdir ./experiments

Checkpoint accuracy

We compute the median accuracy of the last 20 checkpoints in the paper, this is done through this code:

# Following the previous example in which we trained cifar10.3@250-5000, extracting accuracy:
./scripts/extract_accuracy.py ./experiments/fixmatch/cifar10.d.d.d.3@40-1/CTAugment_depth2_th0.80_decay0.990/FixMatch_archresnet_batch64_confidence0.95_filters32_lr0.03_nclass10_repeat4_scales3_uratio7_wd0.0005_wu1.0/

# The command above will create a stats/accuracy.json file in the model folder.
# The format is JSON so you can either see its content as a text file or process it to your liking.

Adding datasets

You can add custom datasets into the codebase by taking the following steps:

Add a function to acquire the dataset to scripts/create_datasets.py similar to the present ones, e.g. _load_cifar10. You need to call _encode_png to create encoded strings from the original images. The created function should return a dictionary of the format {'train' : {'images': <encoded 4D NHWC>, 'labels': <1D int array>}, 'test' : {'images': <encoded 4D NHWC>, 'labels': <1D int array>}} .
Add the dataset to the variable CONFIGS in scripts/create_datasets.py with the previous function as loader. You can now run the create_datasets script to obtain a tf record for it.
Use the create_unlabeled and create_split script to create unlabeled and differently split tf records as above in the Install Datasets section.
In libml/data.py add your dataset in the create_datasets function. The specified "label" for the dataset has to match the created splits for your dataset. You will need to specify the corresponding variables if your dataset has a different # of classes than 10 and different resolution and # of channels than 32x32x3
In libml/augment.py add your dataset to the DEFAULT_AUGMENT variable. Primitives "s", "m", "ms" represent mirror, shift and mirror+shift.

Citing this work

@article{sohn2020fixmatch,
    title={FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence},
    author={Kihyuk Sohn and David Berthelot and Chun-Liang Li and Zizhao Zhang and Nicholas Carlini and Ekin D. Cubuk and Alex Kurakin and Han Zhang and Colin Raffel},
    journal={arXiv preprint arXiv:2001.07685},
    year={2020},
}

Name	Name	Last commit message	Last commit date
Latest commit david-berthelot Merge pull request #46 from daikikatsuragawa/master Nov 12, 2020 d4985a1 · Nov 12, 2020 History 29 Commits
ablation	ablation	revert missing argument back to classifier	Feb 12, 2020
cta	cta	Fix remixmatch naming error	Jan 24, 2020
fully_supervised	fully_supervised	Initial Commit	Jan 22, 2020
imagenet	imagenet	Fix simple typo in generate_ssl_imagenet.sh	Feb 17, 2020
libml	libml	fixed from_files to incorporate parallel reading	Oct 6, 2020
media	media	Added FixMatch diagram with white background.	Oct 21, 2020
scripts	scripts	Initial Commit	Jan 22, 2020
third_party	third_party	Initial Commit	Jan 22, 2020
CONTRIBUTING.md	CONTRIBUTING.md	Initial Commit	Jan 22, 2020
LICENSE	LICENSE	Initial Commit	Jan 22, 2020
README.md	README.md	Update README.md	Nov 12, 2020
fixmatch.py	fixmatch.py	revert missing argument back to classifier	Feb 12, 2020
ict.py	ict.py	Bugfix for ICT, the beta distribution was using x_in shape rather tha…	Jan 31, 2020
mean_teacher.py	mean_teacher.py	Initial Commit	Jan 22, 2020
mixmatch.py	mixmatch.py	Initial Commit	Jan 22, 2020
mixup.py	mixup.py	Initial Commit	Jan 22, 2020
pi_model.py	pi_model.py	Initial Commit	Jan 22, 2020
pseudo_label.py	pseudo_label.py	Initial Commit	Jan 22, 2020
remixmatch_no_cta.py	remixmatch_no_cta.py	Initial Commit	Jan 22, 2020
requirements.txt	requirements.txt	pillow and scipy added to the requirements	Aug 27, 2020
uda.py	uda.py	revert missing argument back to classifier	Feb 12, 2020
vat.py	vat.py	Initial Commit	Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FixMatch

Setup

Install dependencies

Install datasets

ImageNet

Running

Setup

Example

Multi-GPU training

Flags

Valid dataset names

Monitoring training progress

Checkpoint accuracy

Adding datasets

Citing this work

About

Releases 1

Packages

Contributors 7

Languages

License

google-research/fixmatch

Folders and files

Latest commit

History

Repository files navigation

FixMatch

Setup

Install dependencies

Install datasets

ImageNet

Running

Setup

Example

Multi-GPU training

Flags

Valid dataset names

Monitoring training progress

Checkpoint accuracy

Adding datasets

Citing this work

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 7

Languages

Packages