UnbiasedGenImage

🌟 Fake or JPEG? Revealing Common Biases in Generated Image Detection Datasets

This 🖥️📦 Repository corresponds to our 📚📄 Paper towards Biases in datasets for AI-Generated Images Detection. As discussed detailed in the paper, experiments are examined on the GenImage dataset.

Download

⬇️ We provide an easy GenImage download here (~500GB): DOWNLOAD. Furthermore, we removed corrupted files in the GenImage download and added a metadata CSV. This CSV is needed for our training and validation code and contains additional information like content classes of each image which is not part of the original dataset.
Use our download-script like this, since the web interface doesn’t allow downloading all files at one:

python download_genimage.py <--continue> <--destination {path}>
cat GenImage.z* > ../GenImage_restored.zip

ℹ️ NOTE: By now, there’s an easy GenImage download on Google Drive. We recommend downloading the GenImage dataset there and only downloading the metadata.csv from our dataverse. ℹ️

Code details

We provide Code for training and validating ResNet50 and Swin-T detectors. This aims to show that:

  1. Detectors trained on the raw GenImage dataset actually learn from existing Biases in compression and image size.
  2. Mitigating these Biases leads to significantly improved Cross-Generator Performance and Robustness towards JPEG-Compression, achieving state-of-the-art results.

Same as in the original GenImage paper, we use forks from timm and Swin-Transformer. We just changed the dataset (create_dataset.py) to be more suitable for our experiments. This dataset uses get_data.py for selecting the right data from the csv file and get_transform.py for transformations like JPEG-compression that are applied before the original transformations/augmentations. More details for how to start experiments can be found in the corresponding detector folders.


To do inference on own datasets, you have to create a CSV file and slightly adjust get_data.py as we did for the ffhq dataset.

Results

ResNet50


Cross-Generator Performance when training ResNet50 on constrained dataset



Difference to when training on raw dataset



Swin-T


Cross-Generator Performance when training Swin-T on constrained dataset



Difference to when training on raw dataset