This 🖥️📦 Repository corresponds to our 📚📄 Paper towards Biases in datasets for AI-Generated Images Detection. As discussed detailed in the paper, experiments are examined on the GenImage dataset.
To use our Unbiased GenImage dataset, you first need to download the original GenImage dataset and our additional metadata CSV which contains additional information about jpeg QF, size and content of each image. This CSV is needed for our training and validation code.
⬇️ We provide an easy GenImage (and metadata CSV) download here (~500GB): DOWNLOAD. Furthermore, we removed corrupted files from the BAIDU GenImage download.
Use our download-script like this, since the web interface doesn’t allow downloading all files together:
python download_genimage.py <--continue> <--destination {path}>
--continue
: Optional. Skip files if they already exist. Default is to start a new download.--destination {path}
: Optional. Specify a custom directory where the files should be downloaded. Default is ./GenImage_download
cat GenImage.z* > ../GenImage_restored.zip
ℹ️ NOTE: By now, there’s an easy GenImage download on Google Drive. We recommend downloading the GenImage dataset there and only downloading the metadata.csv from our dataverse. ℹ️
As shown in our training code of the detectors (-> get_data.py and get_transform.py), you can create our Unbiased Genimage dataset by selecting the subset of images in a specific size range (or by content classes). Then align the jpeg QG using jpeg_augment.py.
Example to create the (by size and compression) unbiased Wukong (512x512 px) subset:
df = pd.read_csv("metadata.csv")
df_unbiased_natural = df[ (df["generator"] == "nature") & (df["width"] >= 450) & (df["height"] >= 450) & (df["width"] <= 550) & (df["height"] <= 550) & (df["compression_rate"] == 96)]
df_unbiased_ai = df[ (df["generator"] == "wukong") ]
df_unbiased = pd.concat([df_unbiased_natural, df_unbiased_ai])
We provide Code for training and validating ResNet50 and Swin-T detectors. This aims to show that:
Same as in the original GenImage paper, we use forks from timm and Swin-Transformer. We just changed the dataset (create_dataset.py) to be more suitable for our experiments. This dataset uses get_data.py for selecting the right data from the csv file and get_transform.py for transformations like JPEG-compression that are applied before the original transformations/augmentations. More details for how to start experiments can be found in the corresponding detector folders.
To do inference on own datasets, you have to create a CSV file and slightly adjust get_data.py as we did for the ffhq dataset.
Cross-Generator Performance when training ResNet50 on constrained dataset
Difference to when training on raw dataset
Cross-Generator Performance when training Swin-T on constrained dataset
Difference to when training on raw dataset