Researchers at the NYU Tandon School of Engineering have found critical deficiencies in newly proposed methods designed to enhance the safety of potent text-to-image generative AI systems for public usage.
The examination, which will be showcased at the Twelfth International Conference on Learning Representations (ICLR) scheduled in Vienna from May 7-11, 2024, exposes how tactics that purport to eliminate the capability of models like Stable Diffusion to produce explicit, copyrighted, or otherwise hazardous visual content can be circumvented through basic attacks. The paper is also accessible on the arXiv pre-print repository.
Stable Diffusion stands as a publicly accessible AI tool that can craft remarkably lifelike images from mere text descriptions, with examples of these images showcased on GitHub.
The senior author of the paper, Chinmay Hegde, an associate professor in the Electrical and Computer Engineering Department at NYU Tandon, highlighted the surge in popularity of text-to-image models due to their proficiency in crafting diverse visual scenarios from textual depictions. However, this capability also paves the way for individuals to generate and disseminate photo-realistic images that could be profoundly manipulative, offensive, or even unlawful, including celebrity deepfakes or content breaching copyrights.
The researchers examined seven contemporary concept erasure techniques and unveiled their ability to bypass the imposed filters by employing concept inversion attacks.
Through specialized word embeddings and their input provisions, the researchers effectively induced Stable Diffusion to reconstruct the very concepts aimed to be extracted through sanitization. This included hate symbols, trademarked items, or celebrity resemblances, thereby showcasing the researchers’ capability to reproduce almost any unsafe imagery the initial Stable Diffusion model was equipped for, notwithstanding claims of concept removal.
The methods are believed to execute basic input sifting instead of genuinely eliminating unsafe knowledge representations, which could empower adversaries to leverage these inversion prompts on decontaminated models to create detrimental or illicit content.
The revelations raise concerns about hasty deployment of these scrubbing approaches as a safety remedy for potent generative AI. Hegde emphasized that rendering text-to-image generative AI models incapable of producing objectionable content necessitates an alteration in the model’s training itself, instead of relying on post hoc fixes. He further conveyed the implausibility of erasing specific concepts from these AI models once they have adeptly grasped these notions.
Hegde asserted that the evaluation of proposed concept erasure methods must encompass adversarial concept inversion attacks during the assessment phase, rather than solely focusing on general samples.
Collaborating with Hegde in the study were the paper’s lead author, NYU Tandon Ph.D. candidate Minh Pham, along with Govin Mittal, Kelly O. Marshall, and Niv Cohen, all associated with NYU Tandon.
This research represents another stride in Hegde’s line of work dedicated to enhancing AI models for applications in imaging, materials design, and transportation, while also pinpointing weaknesses in existing models. Additionally, Hegde and his team recently introduced an AI method capable of altering an individual’s apparent age in images while preserving their distinctive identifying features, marking progress beyond typical AI models that only manipulate age without retaining biometric markers.
*Note:
1. Source: Coherent Market Insights, Public sources, Desk research
2. We have leveraged AI tools to mine information and compile it.