BlobCtrl
Taming Controllable Blob for Element-level Image Editing
1Peking University, 2ARC Lab, Tencent PCG, 3The Chinese University of Hong Kong
Overview: As user expectations for image editing continue to rise, the demand for flexible, fine-grained manipulation of specific visual elements presents a challenge for current diffusion-based methods. In this work, we present BlobCtrl, a framework for element-level image editing based on a probabilistic blob-based representation. Treating blobs as visual primitives, BlobCtrl disentangles layout from appearance, affording fine-grained, controllable object-level manipulation. Our key contributions are twofold: (1) an in-context dual-branch diffusion model that separates foreground and background processing, incorporating blob representations to explicitly decouple layout and appearance, and (2) a self-supervised disentangle-then-reconstruct training paradigm with an identity-preserving loss function, along with tailored strategies to efficiently leverage blob-image pairs. To foster further research, we introduce BlobData for large-scale training and BlobBench, a benchmark for systematic evaluation. Experimental results demonstrate that BlobCtrl achieves state-of-the-art performance in a variety of element-level editing tasks, such as object addition, removal, scaling, and replacement, while maintaining computational efficiency.
BlobCtrl Capabilities
(1) Element-level Image Manipulation for precise and flexible visual content creation with fine-grained control over individual elements.
(2) Fidelity & Diversity for generating high-quality and diverse visual content.
(3) Unified Framework for seamless layout and appearance control in both generation and editing tasks.
(1) Blob-based representation: Treating blobs as visual primitives to disentangle layout from appearance for fine-grained, controllable object-level manipulation.
(2) In-context dual-branch diffusion model: Separates foreground and background processing, incorporating blob representations to explicitly decouple layout and appearance.
(3) Self-supervised disentangle-then-reconstruct training: Identity-preserving loss function with tailored strategies to efficiently leverage blob-image pairs.
(1) BlobData: Large-scale dataset (1.86M samples) with images, masks, ellipse parameters and text descriptions;
(2) BlobBench: Benchmark with 100 curated images for evaluating element-level operations across diverse scenarios;
(3) Evaluation: Framework for assessing identity preservation, grounding accuracy and generation quality.
The dataset curation process involves multiple steps:
• Image Filtering: We filter source images to: (1) Retain images with shorter sides exceeding 480 pixels; (2) Keep only images with valid instance segmentation masks; (3) Apply mask filtering to preserve masks with area ratios between 0.01-0.9 of total image area; (4) Exclude masks touching image boundaries.
• Parameter Extraction: For the filtered masks, we: (1) Fit ellipse parameters using OpenCV's ellipse fitting algorithm; (2) Derive corresponding 2D Gaussian distributions; (3) Remove invalid samples with covariance values below 1e-5.
Annotation: We generate detailed image descriptions using InternVL-2.5, providing rich textual context for each sample in the dataset.
Our evaluation framework assesses multiple aspects:
• Identity Preservation: We evaluate element-level appearance preservation using: (1) CLIP-I scores to measure appearance similarity; (2) DINO scores to assess feature-level preservation between generated and reference images.
• Grounding Accuracy: We evaluate layout control by: (1) Extracting masks from generated images using SAM; (2) Fitting ellipses/bounding boxes to these masks; (3) Computing MSE between fitted annotations and ground truth.
• Quality Metrics: We assess generation and harmonization quality using: (1) FID for distribution similarity; (2) PSNR and SSIM for pixel-level fidelity; (3) LPIPS for perceptual quality.