The architecture of the proposed HyperCGAN: Linear layers are used as hypernetworks. Overall, given text embeddings and noise vector, hypernetworks generate parameters for modulating weights of INR-based decoder.
Qualitative results on three datasets: MS-COCO 2562, CUB 2562, and ArtEmis 2562.
Here, the input noise z is kept fixed while varying color names in the prompt ”a small {color}, bird with white and dark gray wingbars and white breast and long tail”, aiming to assess the model’s sensitivity to word-level modulation.
the capability of our models in terms of continuous image synthesis: extrapolation and superresolution.
Attention maps per modulating word-based weights.
@InProceedings{Haydarov_2024_CVPR,
author = {Haydarov, Kilichbek and Muhamed, Aashiq and Shen, Xiaoqian and Lazarevic, Jovana and Skorokhodov, Ivan and Galappaththige, Chamuditha Jayanga and Elhoseiny, Mohamed},
title = {Adversarial Text to Continuous Image Generation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {6316-6326}
}