DataScience
article thumbnail
728x90

https://arxiv.org/pdf/2208.12242v1.pdf

초록 

몇개의 이미지로  주요 시각적 특징을 유지하면서 환경과 자연스러운 무수히 많은 이미지를 합성할 수 있다.   

 

Our method takes as input a few images (typically 3 − 5 images suffice, based on our experiments) of a subject (e.g., a specific dog) and the corresponding class name (e.g. “dog”), and returns a fine-tuned/“personalized” text-to-image model that encodes a unique identifier that refers to the subject. Then, at inference, we can implant the unique identifier in different sentences to synthesize the subjects in difference contexts.

 

 

방법

 특정 개와 해당 클래스 이름의 이미지 3~5개가 입력이 되면 고유 식별자를 인코딩하는 개인화된 텍스트-이미지 모델을 생성한다. 그리고 추론시 고유 식별자를 다른문장에 삽입하여 다른 맥락에서 주제를 합성할 수 있다.

 

Figure 4: Fine-tuning. Given ∼ 3 − 5 images of a subject we fine tune a text-to-image diffusion in two steps: (a) fine tuning the low-resolution text-to-image model with the input images paired with a text prompt containing a unique identifier and the name of the class the subject belongs to (e.g., “A [V] dog”), in parallel, we apply a class-specific prior preservation loss, which leverages the semantic prior that the model has on the class and encourages it to generate diverse instances belong to the subject’s class using the class name in a text prompt (e.g., “A dog”). (b) fine-tuning the super resolution components with pairs of low-resolution and high-resolution images taken from our input images set, which enables us to maintain high-fidelity to small details of the subject.

두 단계로 텍스트-이미지 확산을 미세조정 한다. 식별자와 주제가 속한 클래스의 이름과 병행하여 클래스별 사전 보존 손실(Prior Preservation Loss)을 적용한다.  텍스트 프롬프트에 클래스 이름을 삽입하여 대상 클래스에 속하는 다양한 인스턴스를 생성하도록 한다. 입력 이미지 세트에서 가져온 저해상도 및 고해상도 이미지 쌍으로 초고해상도 구성 요소를 미세 조정하여 피사체의 작은 세부 사항에 대해 높은 충실도를 유지한다.

 

결과

Figure 5: Recontextualization of a backpack, vase, and teapot subject instances. By fine-tuning a model with our approach, we are able to generate images of the subject instance in different environments, with high preservation of subject details and realistic interaction between the scene and the subject. We display the conditioning prompts below each image. Image credit (input images): Unsplash.
Figure 6: Artistic renderings of a dog instance in the style of famous painters. We remark that many of the generated poses, e.g., the michelangelo renditions, were not seen in the training set. We also note that some renditions seem to have novel compositions and faithfully imitate the style of the painter. Image credit (input images): Unsplash.

유명한 화과의 스타일로 개를 예술적으로 표현했다. 생성된 많은 포즈가 훈련세트에서는 볼 수 없었다. 화가의 스타일을 모방하는 것처럼 보인다.

 

Figure 8: Text-guided view synthesis. Our technique can synthesize images with specified viewpoints for a subject cat (left to right: top, bottom, side, and back views). Note that the generated poses are different from the input poses, and the background changes in a realistic manner given a pose change. We also highlight the preservation of complex fur patterns on the subject cat’s forehead. Image credit (input images): Unsplash.

지정된 시점으로 이미지를 합성할수 있다. 생성된 포즈는 입력 포즈와 다르며, 포즈 변경에 따라 배경도 변경된다. 

 

Figure 10: Modification of subject properties while preserving their key features. We show color modifications in the first row (using prompts “a [color] [V] car”), and crosses between a specific dog and different animals in the second row (using prompts “a cross of a [V] dog and a [target species]”). We highlight the fact that our method preserves unique visual features that give the subject its identity or essence, while performing the required property modification. Image credit (input images): Unsplash.

색상을 수정하고, 특정 개와 다른 동물간의 crosses된 이미지를 보여준다.

 

 

profile

DataScience

@Ninestar

포스팅이 좋았다면 "좋아요❤️" 또는 "구독👍🏻" 해주세요!