IC-Custom: Diverse Image Customization via In-Context Learning

IC-Custom Team

Abstract

Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios.

To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We introduce the In-context Multi-Modal Attention (ICMA) mechanism with learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to correctly handle different task types and distinguish various inputs in polyptych configurations. To bridge the data gap, we carefully curated a high-quality dataset of 12k identity-consistent samples with 8k from real-world sources and 4k from high-quality synthetic data, avoiding the overly glossy and over-saturated synthetic appearance.

IC-Custom supports various industrial applications, including try-on, accessory placement, furniture arrangement, and IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models like GPT-4o, and state-of-the-art open-source approaches. IC-Custom achieves approximately 73% higher human preference across identity consistency, harmonicity, and text alignment metrics, while training only 0.4% of the original model parameters.


Position-aware Image Customization

Hover over the images to see the generated results - Our model seamlessly integrates reference content into target scenes


Position-free Image Customization

Hover over the reference images to see the generated results - Our model creates images based on text prompts while maintaining reference identity

Reference Image Generated Image

"Soft plush toy is joyfully wandering through a lush jungle..."

Reference Image Generated Image

"The long-haired dachshund lies in a sunny garden..."

Reference Image Generated Image

"A cat is lying on some ancient books..."

Reference Image Generated Image

"...rests on a rustic wooden table, filled with fresh blueberries that glisten in the morning sunlight..."

Reference Image Generated Image

"The bright yellow alarm clock is perched on a snowy mountain peak at sunrise..."

Reference Image Generated Image

"The elegant vase stands in the center of a dining table..."

Reference Image Generated Image

"A Lego figure is sitting on a weathered wooden park bench, surrounded by lush green grass and blooming flowers"

Reference Image Generated Image

"The crocheted gingerbread man is perched on a tree branch in a dense forest..."

Reference Image Generated Image

"The kitten is lounging on a lush, green meadow surrounded by wildflowers..."