BlobCtrl

A Unified and Flexible Framework for Element-level Image Generation and Editing

Yaowei Li1, Lingen Li3, Zhaoyang Zhang2, Xiaoyu Li2, Guangzhi Wang2, Hongxiang Li1, Xiaodong Cun2, Ying Shan2, Yuexian Zou1*

1Peking University, 2ARC Lab, Tencent PCG, 3The Chinese University of Hong Kong

Under Review

Overview: Element-level visual manipulation is essential in digital content creation, but current diffusion-based methods lack the precision and flexibility of traditional tools. In this work, we introduce BlobCtrl, a framework that unifies element-level generation and editing using a probabilistic blob-based representation. By employing blobs as visual primitives, our approach effectively decouples and represents spatial location, semantic content, and identity information, enabling precise element-level manipulation. Our key contributions include: 1) a dual-branch diffusion architecture with hierarchical feature fusion for seamless foreground-background integration; 2) a self-supervised training paradigm with tailored data augmentation and score functions; and 3) controllable dropout strategies to balance fidelity and diversity. To support further research, we introduce BlobData for large-scale training and BlobBench for systematic evaluation. Experiments show that BlobCtrl excels in various element-level manipulation tasks, offering a practical solution for precise and flexible visual content creation.

BlobCtrl Capabilities

(1) Element-level Image Manipulation for precise and flexible visual content creation with fine-grained control over individual elements.

(2) Fidelity & Diversity for generating high-quality and diverse visual content.

(3) Unified Framework for seamless layout and appearance control in both generation and editing tasks.

🔧 Model & Method                              

(1) Blob-based representation: Decouple spatial location, semantic content, and identity information for flexible manipulation.

(2) Dual-branch diffusion model: Seamlessly integrate foreground and background information with hierarchical feature fusion.

(3) Self-supervised training: Tailored data augmentation and score functions with controllable dropout strategies.

📊 Data & Evaluation                

(1) BlobData: Large-scale dataset (1.86M samples) with images, masks, ellipse parameters and text descriptions;

(2) BlobBench: Benchmark with 100 curated images for evaluating element-level operations across diverse scenarios;

(3) Evaluation: Framework for assessing identity preservation, grounding accuracy and generation quality.

Demonstration Video

😎 We recommend watching in full screen and with sound on. 😎
YouTube
BlobCtrl Capabilities
Moving
Input image Middle image
Moving
Input image Output image
Scaling
Input image Output image
Scaling
Input image Output image
Removal
Input image Output image
Removal
Input image Output image
Replacement
Input image Output image
Replacement
Input image Output image
Composition
Input image Output image
Composition
Input image Output image
BlobData Overview
BlobData Dataset
BlobData is a large-scale dataset containing 1.86M samples sourced from BrushData, featuring images, segmentation masks, fitted ellipse parameters with derived 2D Gaussians, and descriptive texts.
The dataset curation process involves multiple steps:
• Image Filtering: We filter source images to: (1) Retain images with shorter sides exceeding 480 pixels; (2) Keep only images with valid instance segmentation masks; (3) Apply mask filtering to preserve masks with area ratios between 0.01-0.9 of total image area; (4) Exclude masks touching image boundaries.
• Parameter Extraction: For the filtered masks, we: (1) Fit ellipse parameters using OpenCV's ellipse fitting algorithm; (2) Derive corresponding 2D Gaussian distributions; (3) Remove invalid samples with covariance values below 1e-5.
Annotation: We generate detailed image descriptions using InternVL-2.5, providing rich textual context for each sample in the dataset.
BlobBench Overview
Preview
BlobBench is a comprehensive benchmark containing 100 curated images evenly distributed across different element-level operations (composition, movement, resizing, removal, and replacement). Each image is annotated with ellipse parameters, foreground masks, and expert-written text descriptions. The benchmark includes both real-world and AI-generated images across diverse scenarios like indoor/outdoor scenes, animals, and landscapes.
Our evaluation framework assesses multiple aspects:
• Identity Preservation: We evaluate element-level appearance preservation using: (1) CLIP-I scores to measure appearance similarity; (2) DINO scores to assess feature-level preservation between generated and reference images.
• Grounding Accuracy: We evaluate layout control by: (1) Extracting masks from generated images using SAM; (2) Fitting ellipses/bounding boxes to these masks; (3) Computing MSE between fitted annotations and ground truth.
• Quality Metrics: We assess generation and harmonization quality using: (1) FID for distribution similarity; (2) PSNR and SSIM for pixel-level fidelity; (3) LPIPS for perceptual quality.