BlobCtrl

Taming Controllable Blob for Element-level Image Editing

Yaowei Li1, Lingen Li3, Zhaoyang Zhang2, Xiaoyu Li2, Guangzhi Wang2, Hongxiang Li1, Xiaodong Cun2, Ying Shan2, Yuexian Zou1*

1Peking University, 2ARC Lab, Tencent PCG, 3The Chinese University of Hong Kong

SIGGRAPH Asia 2025

Overview: As user expectations for image editing continue to rise, the demand for flexible, fine-grained manipulation of specific visual elements presents a challenge for current diffusion-based methods. In this work, we present BlobCtrl, a framework for element-level image editing based on a probabilistic blob-based representation. Treating blobs as visual primitives, BlobCtrl disentangles layout from appearance, affording fine-grained, controllable object-level manipulation. Our key contributions are twofold: (1) an in-context dual-branch diffusion model that separates foreground and background processing, incorporating blob representations to explicitly decouple layout and appearance, and (2) a self-supervised disentangle-then-reconstruct training paradigm with an identity-preserving loss function, along with tailored strategies to efficiently leverage blob-image pairs. To foster further research, we introduce BlobData for large-scale training and BlobBench, a benchmark for systematic evaluation. Experimental results demonstrate that BlobCtrl achieves state-of-the-art performance in a variety of element-level editing tasks, such as object addition, removal, scaling, and replacement, while maintaining computational efficiency.

BlobCtrl Capabilities

(1) Element-level Image Manipulation for precise and flexible visual content creation with fine-grained control over individual elements.

(2) Fidelity & Diversity for generating high-quality and diverse visual content.

(3) Unified Framework for seamless layout and appearance control in both generation and editing tasks.

🔧 Model & Method                              

(1) Blob-based representation: Treating blobs as visual primitives to disentangle layout from appearance for fine-grained, controllable object-level manipulation.

(2) In-context dual-branch diffusion model: Separates foreground and background processing, incorporating blob representations to explicitly decouple layout and appearance.

(3) Self-supervised disentangle-then-reconstruct training: Identity-preserving loss function with tailored strategies to efficiently leverage blob-image pairs.

📊 Data & Evaluation                

(1) BlobData: Large-scale dataset (1.86M samples) with images, masks, ellipse parameters and text descriptions;

(2) BlobBench: Benchmark with 100 curated images for evaluating element-level operations across diverse scenarios;

(3) Evaluation: Framework for assessing identity preservation, grounding accuracy and generation quality.

Demonstration Video

😎 We recommend watching in full screen and with sound on. 😎
YouTube
BlobCtrl Capabilities
Moving
Input image Middle image
Moving
Input image Output image
Scaling
Input image Output image
Scaling
Input image Output image
Removal
Input image Output image
Removal
Input image Output image
Replacement
Input image Output image
Replacement
Input image Output image
Composition
Input image Output image
Composition
Input image Output image
BlobData Overview
BlobData Dataset
BlobData is a large-scale dataset containing 1.86M samples sourced from BrushData, featuring images, segmentation masks, fitted ellipse parameters with derived 2D Gaussians, and descriptive texts.
The dataset curation process involves multiple steps:
• Image Filtering: We filter source images to: (1) Retain images with shorter sides exceeding 480 pixels; (2) Keep only images with valid instance segmentation masks; (3) Apply mask filtering to preserve masks with area ratios between 0.01-0.9 of total image area; (4) Exclude masks touching image boundaries.
• Parameter Extraction: For the filtered masks, we: (1) Fit ellipse parameters using OpenCV's ellipse fitting algorithm; (2) Derive corresponding 2D Gaussian distributions; (3) Remove invalid samples with covariance values below 1e-5.
Annotation: We generate detailed image descriptions using InternVL-2.5, providing rich textual context for each sample in the dataset.
BlobBench Overview
Preview
BlobBench is a comprehensive benchmark containing 100 curated images evenly distributed across different element-level operations (composition, movement, resizing, removal, and replacement). Each image is annotated with ellipse parameters, foreground masks, and expert-written text descriptions. The benchmark includes both real-world and AI-generated images across diverse scenarios like indoor/outdoor scenes, animals, and landscapes.
Our evaluation framework assesses multiple aspects:
• Identity Preservation: We evaluate element-level appearance preservation using: (1) CLIP-I scores to measure appearance similarity; (2) DINO scores to assess feature-level preservation between generated and reference images.
• Grounding Accuracy: We evaluate layout control by: (1) Extracting masks from generated images using SAM; (2) Fitting ellipses/bounding boxes to these masks; (3) Computing MSE between fitted annotations and ground truth.
• Quality Metrics: We assess generation and harmonization quality using: (1) FID for distribution similarity; (2) PSNR and SSIM for pixel-level fidelity; (3) LPIPS for perceptual quality.