Image_Deepfake_Detection

Implementation and improvement of a research paper with a new theory for image deepfake detection with high generalization to include General Adversarial Networks, Diffusion Models and any other new generative techniques

Approach 1: Learn the low level cues left behind by an Image Generator

It is believed the image generators leave behind some perceptual cues (encompassing elements such as colors, textures, brightness, and low-level visual cues). The primary concept of this approach revolves around identifying those unique low-level cues left by image generators, achieved through training over diverse images. The methodology employs pretrained ResNET 50 and ResNET 18 backbones, with the addition of a new binary classification dense layer for discriminating between real and fake images.

The implementation is carried out using both Keras for ResNET 50 and PyTorch for ResNET 18. Image augmentation is incorporated in the ResNET 50 implementation. Both models are trained on the ImageNet Dataset. Key implementation details include a batch size of 128, a learning rate of 0.00001, Stochastic Gradient Descent (SGD) as the optimizer with momentum set to 0.9, Cross Entropy Loss as the chosen loss function, and a resized resolution of 256 x 256. The training process spans 80 epochs, ensuring a comprehensive learning experience for the models.

Flaws with this approach:

It was seen that while there decent accuracy over the training dataset, when an image generated using a different image generation technique was given the classifier failed to classify with good accuracy
Training the model takes too much time and is not entirely accurate
The was a sink in the parameters learned, such that, if it was a real image classification was good but if a fake image was given from a method not used in training it was considered as real. Any image that is not generated using the method used while training would be classified as real

Approach 2: Use a generalized backbone for feature maps and then use Classifiers

Since it was noticed in approach 1 that training a neural network over a single generation technique would make it work only for that generation method, why not stop aiming to learn a neural network with the specific task of classifying real and fake over a given generation technique and just use a pretrained backbone to get feature maps and then use classifiers like linear probes, KNN, SVMs, etc

Section 1: Implementaion details:

For this pupose I tried,
I. Datasets:

Laion vs LDM100 (1000 images each = 2000 Total)
ImageNet vs LDM200 (1000 images each = 2000 Total)
bigGAN Real vs bigGAN Fake (2000 images each = 4000 Total)

II. Transformations on image before training:

Transform 0: No Change
Transform 1: Adding Gaussian Blur
Transform 2: Adding Jitter
Transform 3: Adding Guassian Blur and Jitter Both

III. Backbone Models:

DINO ViT-B/16
DINO ResNET50
CLIP ViT-B/16

IV. As for the classifiers I used,

Bagging Classifier
Decision Tree
Random Forest
Linear Discriminant Analysis
Quadratic Discriminant Analysis
KNN (1 neighbour)
KNN (3 neighbour)
KNN (5 neighbour)
Linear Probe
Support Vector Machine
Gradient Boosting
Naive Bayes.

V. I also tried 3 different variations of dimensionality reductions

No Reduction
Principal Component Analysis
Autoencoding

Section 2: Findings and Inference:

1. Impact of using transforms before training and after training:

NOTE: Goodness Factor was calculated by subtracting min(accuracy) from all data points for each line. Each line represents a different dataset-backbone model combination. The names have been omitted for clarity

Inference 1: Adding No transform while training is better. But adding transform 2 also shows good results. Dealing with Gaussian Blur might cause issues in accuracy (around 0.05 on average)

Inference 2: On adding no transform while training but adding transformation while testing it was found that the models were robust to those transforms. This means that if the user were to do some edits like adding jitter or gaussian blur (while compressing it) to the images the models would still be very accurate

2. Impact of Dimensionality Reduction:

Inference: In many cases no dimensionality reduction is the best choice. But in some cases autoencoding performs better than no reduction or PCA

3. Impact of classifier used:

Inference: SVM, Linear Probing and Linear Discriminant Analysis seem to be good classifiers

4. Impact of backbone model used:

Inference: CLIP ViT-B/16 is the best backbone over all cases, followed by DINO ResNET50, followed by DINO ViT-b/16

Section 3: Going Beyond ...

After this I tried to split the load for feature transforms over multiple backbones instead of just one backbone, I made the following combinations:

DINO ViT-B/16 and DINO ResNET50 and CLIP ViT-B/16
DINO ViT-B/16 and DINO ResNET50
CLIP ViT-B/16 and DINO ResNET50
CLIP ViT-B/16 and DINO ViT-B/16

I then trained each model over the entire dataset and took the best classifier for each. I used randomised jitter (p=0.5) and guassian blur (p=0.5).

Inference: This showed that using a combination was still not able to beat the previous best model, i.e. CLIP ViT-B/16

Lastly I tried to test the models across datasets (Training on dataset A and testing on dataset B). This was to check for generalization of the models

Section 4: Results and Conclusion:

I. Results:

The best accuracy achieved was with no transformation, CLIP ViT-B/16 backbone model and Linear Discriminant Analysis with no reduction as the classifier. The test accuracy was 98.1875% with train accuracy 98.75% when trained and tested over the combined datasets
On testing across datasets, the model with CLIP ViT-B/16 backbone, Support Vector Machine classifier, autoencoder for dimensionality and randomised jitter and blur for transformation gave the best results.

i. Best Model for GANs: It was generalized best when trained over a bigGAN and tested over other datasets like laion, ImageNet, ldm100, ldm200. When trained and tested over bigGAN accuracy was 98.875%. When tested over laion and ldm100, accuracy was 79.1%. When tested over ImageNet and ldm200, accuracy was 81.1%.

ii. Best Model for Diffusion Models: It was generalized greatly when trained over a ImageNet vs ldm200 and tested over other datasets like laion, bigGAN, ldm100 also. When trained and tested over ImageNet vs ldm299 accuracy was 98.5%. When tested over laion and ldm100, accuracy was 94.44%. When tested over bigGAN Real vs bigGAN Fake, accuracy was 72.1%.

II. Conclusion:

Best results are achieved when a generalized backbone is used with SVM or LDA as classifier. Using a well trained autoencoder can also be very vital in making the difference
While getting good accuracy within different diffusion models or within different GANs based generation techniques, it is still currently difficult to get good accuracy with unknown techniques of image generation

Section 5: Future Scope:

Explore using text embeddings for classification of real and fake images (explore using the semantic information within an image)
Try to enhance the autoencoder hyperparameters
Use larger datasets for training the models
Try exploring other backbone models
Try building some neural network by using multiple backbones and then get feature maps

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Implementation		Implementation
Resources		Resources
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image_Deepfake_Detection

Approach 1: Learn the low level cues left behind by an Image Generator

Approach 2: Use a generalized backbone for feature maps and then use Classifiers

Section 1: Implementaion details:

Section 2: Findings and Inference:

1. Impact of using transforms before training and after training:

2. Impact of Dimensionality Reduction:

3. Impact of classifier used:

4. Impact of backbone model used:

Section 3: Going Beyond ...

After this I tried to split the load for feature transforms over multiple backbones instead of just one backbone, I made the following combinations:

Lastly I tried to test the models across datasets (Training on dataset A and testing on dataset B). This was to check for generalization of the models

Section 4: Results and Conclusion:

I. Results:

II. Conclusion:

Section 5: Future Scope:

About

Uh oh!

Releases

Packages

Languages

coderbeta1/Image_Deepfake_Detection

Folders and files

Latest commit

History

Repository files navigation

Image_Deepfake_Detection

Approach 1: Learn the low level cues left behind by an Image Generator

Approach 2: Use a generalized backbone for feature maps and then use Classifiers

Section 1: Implementaion details:

Section 2: Findings and Inference:

1. Impact of using transforms before training and after training:

2. Impact of Dimensionality Reduction:

3. Impact of classifier used:

4. Impact of backbone model used:

Section 3: Going Beyond ...

After this I tried to split the load for feature transforms over multiple backbones instead of just one backbone, I made the following combinations:

Lastly I tried to test the models across datasets (Training on dataset A and testing on dataset B). This was to check for generalization of the models

Section 4: Results and Conclusion:

I. Results:

II. Conclusion:

Section 5: Future Scope:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages