NeurIPS 2025: Approximate Domain Unlearning for Vision-Language Models

Abstract

Pre-trained Vision-Language Models (VLMs) exhibit strong generalization capabilities, enabling them to recognize a wide range of objects across diverse domains without additional training. However, they often retain irrelevant information beyond the requirements of specific target downstream tasks, raising concerns about computational efficiency and potential information leakage. This has motivated growing interest in approximate unlearning, which aims to selectively remove unnecessary knowledge while preserving overall model performance. Existing approaches to approximate unlearning have primarily focused on class unlearning, where a VLM is retrained to fail to recognize specified object classes while maintaining accuracy for others. However, merely forgetting object classes is often insufficient in practical applications. For instance, an autonomous driving system should accurately recognize real cars, while avoiding misrecognition of illustrated cars depicted in roadside advertisements as real cars, which could be hazardous. In this paper, we introduce Approximate Domain Unlearning (ADU), a novel problem setting that requires reducing recognition accuracy for images from specified domains (e.g., illustration) while preserving accuracy for other domains (e.g., real). ADU presents new technical challenges: due to the strong domain generalization capability of pre-trained VLMs, domain distributions are highly entangled in the feature space, making naive approaches based on penalizing target domains ineffective. To tackle this limitation, we propose a novel approach that explicitly disentangles domain distributions and adaptively captures instance-specific domain information. Extensive experiments on three multi-domain benchmark datasets demonstrate that our approach significantly outperforms strong baselines built upon state-of-the-art VLM tuning techniques, paving the way for practical and fine-grained unlearning in VLMs.

Approximate Domain Unlearning (ADU)

Approximate Domain Unlearning (ADU) is a novel approximate unlearning problem introduced in this paper. Unlike existing approximate class unlearning tasks, ADU requires retraining a pre-trained Vision-Language Model (VLM) so that it cannot recognize images from specified domains (painting, clipart, sketch in the figure) while preserving its ability to recognize images from other domains (real in the figure).

Problem Setting

Given a set of training data \( \{(\mathbf{x}, y, d)\} \), where \( \mathbf{x} \in \mathcal{X} \) represents an input image, \( y \in \mathcal{C} \) is the class label, and \( d \in \mathcal{D} \) is the domain label, with \( \mathcal{C} \) and \( \mathcal{D} \) denoting the sets of all classes and domains, respectively. We define \( \mathcal{D}_{\text{memorize}} \subset \mathcal{D} \) as the set of domains to be preserved and \( \mathcal{D}_{\text{forget}} = \mathcal{D} \setminus \mathcal{D}_{\text{memorize}} \) as the set of domains to be forgotten. Our goal is to retrain a pre-trained vision-language model \( f \) to maintain the classification accuracy for \( \{(\mathbf{x}, y, d) \mid d \in \mathcal{D}_{\text{memorize}}\} \), while reducing it for \( \{(\mathbf{x}, y, d) \mid d \in \mathcal{D}_{\text{forget}}\} \).

Method

Applying Common Approximate Class Unlearning Approach to ADU

The common approach to the conventional approximate class unlearning is to use two different loss functions; one for retaining the classification accuracy for the classes to be retained and the other for reducing the accuracy for those to be forgotten.

Given this idea, a straightforward approach to ADU would be to adapt these two loss functions, that is, minimizing \( \mathcal{L}_{\mathrm{memorize}} \) for \( \{(x, y, d) \mid d \in \mathcal{D}_{\text{memorize}}\} \) and \( \mathcal{L}_{\mathrm{forget}} \) for \( \{(x, y, d) \mid d \in \mathcal{D}_{\text{forget}}\} \).

However, as we will show later in our experiments, this straightforward approach alone is insufficient to achieve satisfactory ADU performance.

This is primarily due to the strong domain generalization capability of pre-trained VLMs.

As evidenced by their robustness to domain shifts, the latent space of VLMs highly aligns data distributions across different domains, meaning that covariate shifts are minimal.

Consequently, the feature distributions across different domains are highly entangled, making it difficult to effectively control memorization and forgetting on a per-domain basis.

Domain Disentangling Loss (DDL)

To address this issue, we propose Domain Disentangling Loss (DDL), which aims to explicitly disentangle the feature distributions among different domains in the latent feature space.

The core idea is that if the feature distributions of individual domains are well-separated, the domain label \( d \) of a given sample \( \mathbf{x} \) can be accurately predicted, and vice versa.

Based on this insight, DDL encourages the domain label of a sample to be predictable from its latent feature through an auxiliary domain classifier.

More specifically, we introduce a standard cross-entropy loss that requires the model to correctly predict the domain labels of the samples:

\[ \mathcal{L}_{\mathrm{CE}}(\mathcal{B})=-\frac{1}{|\mathcal{B}|}\sum_{i=1}^{|\mathcal{B}|} \sum_{j=1}^{|\mathcal{D}|} d_{ij}\log p^d_{ij}, \]

where \( \mathbf{p}_i^d=(p^d_{i1}, p^d_{i2}, \dots, p^d_{i|\mathcal{D}|}) \) represents the confidence scores of a sample \( \mathbf{x}_i \) output by the domain classifier (a fully connected layer), and \( \mathbf{d}_i=(d_{i1}, d_{i2}, \dots, d_{i|\mathcal{D}|}) \) is the one-hot encoding of the domain label \( d_i \).

To further enhance domain separability, we additionally incorporate the maximum mean discrepancy (MMD) into DDL as an auxiliary loss term. MMD estimates the pairwise distance between domain distributions in a reproducing kernel Hilbert space (RKHS) as:

\[ \text{MMD}^2(\mathcal{B}) = \frac{2}{|\mathcal{D}|(|\mathcal{D}|-1)} \sum_{1 \leq d < d' \leq |\mathcal{D}|} \left\| \frac{1}{|\mathcal{B}_d|} \sum_{\mathbf{x}_i \in \mathcal{B}_d} \phi(\mathbf{x}_i) - \frac{1}{|\mathcal{B}_{d'}|} \sum_{\mathbf{x}_j \in \mathcal{B}_{d'}} \phi(\mathbf{x}_j) \right\|_{\mathcal{H}}^2, \]

where \( \phi \) denotes a kernel-induced feature mapping, \( \mathcal{B}_d \) is a subset of mini-batch \( \mathcal{B} \) within the domain \( d \in \mathcal{D} \). Intuitively, maximizing MMD increases the inter-domain divergence in the latent space.

Given the above formulations, our final DDL loss is defined as:

\[ \mathcal{L}_{\mathrm{domain}}(\mathcal{B}) = \gamma \mathcal{L}_{\mathrm{CE}}(\mathcal{B}) - \lambda \text{MMD}^2(\mathcal{B}), \]

where \( \gamma \) and \( \lambda \) are balancing hyperparameters.

Combining with the standard loss functions \( \mathcal{L}_{\mathrm{memorize}} \) and \( \mathcal{L}_{\mathrm{forget}} \), the learnable prompts and the domain classifier are jointly optimized by minimizing the total loss:

\[ \mathcal{L}_{\mathrm{total}}(\mathcal{B}) = \mathcal{L}_{\mathrm{memorize}}(\mathcal{B}) + \mathcal{L}_{\mathrm{forget}}(\mathcal{B}) + \mathcal{L}_{\mathrm{domain}}(\mathcal{B}). \]

Instance-wise Prompt Generator (InstaPG)

Reward-Adaptive Time-Rescaled Langevin SDE

Domain is often ambiguous. The term “illustration,” for example, encompasses a broad spectrum of styles, from highly realistic renderings that closely resemble real-world images to highly stylized depictions resembling clipart, with each image varying in style.

Given this nature of the domain, only a learnable prompt that is uniform across all the images cannot account for such an instance-level variation in the images, which may degrade the performance of domain unlearning.

To address this problem, we introduce an Instance-wise Prompt Generator (InstaPG) to adjust the learnable vision prompts according to the input image patches.

InstaPG is embedded in intermediate layers (i.e., the Transformer block) of the image encoder to generate additional instance-wise prompts to be fed to the subsequent layer via a cross-attention mechanism, where the learnable vision prompts serve as queries, while the image patch features act as keys and values. The generated prompts are conditioned on the input image patch features, allowing the model to effectively capture the property of the input image.

Experimental Results

Quantitative Results

The average performance of all possible combinations of the domains to be forgotten/retained are reported for various numbers of domains to be forgotten \( |D_{\text{forget}}| \in \{1, 2, 3\} \). ImageNet has only two domains, so only the case \( |D_{\text{forget}}| = 1 \) is tested. Performance is evaluated using the three metrics: (i) Mem: the accuracy of the classes for data from memorized domains, (ii) For: the error of the classes for data from forgotten domains, and (iii) H: the harmonic mean of Mem and For. Higher values mean better performance. The Baseline in the table refers to the method that trains the vision prompt using \( \mathcal{L}_{\mathrm{memorize}} \) and \( \mathcal{L}_{\mathrm{forget}} \).

Qualitative Results

We showcase the attention heatmaps before and after applying our method, where the domain to be forgotten is real.

For the forgetting data, the Zero-shot CLIP attention concentrates on the objects. After applying our method, the attention on the objects disappears or significantly weakens for the data from the domain to be forgotten (i.e., real).

For the data from the domain to be memorized (i.e., painting, clipart, and sketch), our method fully maintains or strengthens previous attention on the objects.

Our method suppresses prediction sensitivity for data from domains to be forgotten while preserving or enhancing sensitivity for data from domains to be memorized, enabling the model to effectively forget unwanted domain information while maintaining high accuracy on the retained domains.