BlobCtrl

Taming Controllable Blob for Element-level Image Editing

Yaowei Li¹, Lingen Li³, Zhaoyang Zhang², Xiaoyu Li², Guangzhi Wang², Hongxiang Li¹, Xiaodong Cun², Ying Shan², Yuexian Zou^1*

¹Peking University, ²ARC Lab, Tencent PCG, ³The Chinese University of Hong Kong

SIGGRAPH Asia 2025

Overview: As user expectations for image editing continue to rise, the demand for flexible, fine-grained manipulation of specific visual elements presents a challenge for current diffusion-based methods. In this work, we present BlobCtrl, a framework for element-level image editing based on a probabilistic blob-based representation. Treating blobs as visual primitives, BlobCtrl disentangles layout from appearance, affording fine-grained, controllable object-level manipulation. Our key contributions are twofold: (1) an in-context dual-branch diffusion model that separates foreground and background processing, incorporating blob representations to explicitly decouple layout and appearance, and (2) a self-supervised disentangle-then-reconstruct training paradigm with an identity-preserving loss function, along with tailored strategies to efficiently leverage blob-image pairs. To foster further research, we introduce BlobData for large-scale training and BlobBench, a benchmark for systematic evaluation. Experimental results demonstrate that BlobCtrl achieves state-of-the-art performance in a variety of element-level editing tasks, such as object addition, removal, scaling, and replacement, while maintaining computational efficiency.

BlobCtrl Capabilities

(1) Element-level Image Manipulation for precise and flexible visual content creation with fine-grained control over individual elements.

(2) Fidelity & Diversity for generating high-quality and diverse visual content.

(3) Unified Framework for seamless layout and appearance control in both generation and editing tasks.

🔧 Model & Method

(1) Blob-based representation: Treating blobs as visual primitives to disentangle layout from appearance for fine-grained, controllable object-level manipulation.

(2) In-context dual-branch diffusion model: Separates foreground and background processing, incorporating blob representations to explicitly decouple layout and appearance.

(3) Self-supervised disentangle-then-reconstruct training: Identity-preserving loss function with tailored strategies to efficiently leverage blob-image pairs.

📊 Data & Evaluation

(1) BlobData: Large-scale dataset (1.86M samples) with images, masks, ellipse parameters and text descriptions;

(2) BlobBench: Benchmark with 100 curated images for evaluating element-level operations across diverse scenarios;

(3) Evaluation: Framework for assessing identity preservation, grounding accuracy and generation quality.

Demonstration Video

😎 We recommend watching in full screen and with sound on. 😎

YouTube

BlobCtrl Capabilities

Moving

Moving

Scaling

Removal

Replacement

Composition

BlobData Overview

BlobData is a large-scale dataset containing 1.86M samples sourced from BrushData, featuring images, segmentation masks, fitted ellipse parameters with derived 2D Gaussians, and descriptive texts.
The dataset curation process involves multiple steps:
• Image Filtering: We filter source images to: (1) Retain images with shorter sides exceeding 480 pixels; (2) Keep only images with valid instance segmentation masks; (3) Apply mask filtering to preserve masks with area ratios between 0.01-0.9 of total image area; (4) Exclude masks touching image boundaries.
• Parameter Extraction: For the filtered masks, we: (1) Fit ellipse parameters using OpenCV's ellipse fitting algorithm; (2) Derive corresponding 2D Gaussian distributions; (3) Remove invalid samples with covariance values below 1e-5.
Annotation: We generate detailed image descriptions using InternVL-2.5, providing rich textual context for each sample in the dataset.

BlobBench Overview

BlobBench is a comprehensive benchmark containing 100 curated images evenly distributed across different element-level operations (composition, movement, resizing, removal, and replacement). Each image is annotated with ellipse parameters, foreground masks, and expert-written text descriptions. The benchmark includes both real-world and AI-generated images across diverse scenarios like indoor/outdoor scenes, animals, and landscapes.
Our evaluation framework assesses multiple aspects:
• Identity Preservation: We evaluate element-level appearance preservation using: (1) CLIP-I scores to measure appearance similarity; (2) DINO scores to assess feature-level preservation between generated and reference images.
• Grounding Accuracy: We evaluate layout control by: (1) Extracting masks from generated images using SAM; (2) Fitting ellipses/bounding boxes to these masks; (3) Computing MSE between fitted annotations and ground truth.
• Quality Metrics: We assess generation and harmonization quality using: (1) FID for distribution similarity; (2) PSNR and SSIM for pixel-level fidelity; (3) LPIPS for perceptual quality.

BlobCtrl Structure

BlobCtrl employs a dual-branch diffusion architecture for element-level visual manipulation. The foreground branch preserves element identity and appearance through blob-based representation, enabling precise spatial control and seamless composition. It takes concatenated inputs of noisy latents and reference conditions including: (1) opacity maps from blob splatting for layout control, (2) DINO V2 features splatted with blobs for spatial-aware semantics, and (3) VAE latents for appearance preservation. The background branch takes similar inputs as the foreground branch except for the splatted features, utilizing a complete diffusion backbone with cross-attention layers to maintain scene context while harmoniously integrating foreground elements through hierarchical feature fusion.
The model is trained in a self-supervised manner where any image can serve as training data by detecting target elements and generating random source blobs. During training, we apply extensive data augmentation (color jittering, scaling, rotation, erasing, perspective changes) on foreground elements to prevent simple copy-paste solutions and ensure natural integration. We introduce an identity preservation score function that operates within foreground regions to effectively decouple the branches. We also implement strategic dropout of branch weights and semantic/VAE features to balance appearance fidelity and creative diversity. Through these carefully designed components and self-supervised training approach, BlobCtrl achieves precise element-level control while maintaining visual coherence and identity preservation.

BlobCtrl

Taming Controllable Blob for Element-level Image Editing

Demonstration Video

📢 Dataset Notice

BibTeX