LLM Multi-Modal

    Training Diffusion Models with RL๐Ÿ”—

    Arxiv: https://openreview.net/pdf/c6a24bc50ce18fe080ef17ee8b448a66bd060e63.pdf 4 Jan 2024

    1. Normalization over contrastive prompts.
    2. Prompt synthesis via LLM.
    3. Incorporating textual inconsistency into the score (calculate distance in embedding space) - avoid synthetically close, semantically different.

    [DPOK] RL for Fine-tuning Text-to-Image Diffusion Models๐Ÿ”—

    Arxiv: https://arxiv.org/abs/2305.16381 25 May 2023

    We focus on diffusion models, defining the fine-tuning task as an RL problem, and updating the pre-trained text-to-image diffusion models using policy gradients to maximize the feedback-trained reward. Our approach, coined DPOK, integrates policy optimization with KL regularization.

    Generation of more data includes generating n-1 negative samples and leveraging contrastive loss and generating more images to increase diversity.

    In fine-tuning the loss function will be the expectancy of the sum of all the binary-human-classified dataset and also loss from the pre-training based data (weighted with B) to maintain accuracy of the model (avoid catastrophic forgetting). For the reward loss the idea is for the reward to be log-likelihood but itโ€™s not easy, Therefore we minimize reward-weighted MSE loss instead.

    Setup: Pretrained Stable Diffusion 1.5, fine-tuning using static CLIP language encoder, Reward model is MLP using ViT-L/14 CLIP for image/text embeddings, Dataset 2700 prompts, 27k images, 16k unlabeled and 625k for pretraining.

    SFT: model is updated on a fixed dataset generated by the pre-trained model.

    RL: model is updated using new samples from the previously trained model during online RL fine-tuning.

    Based on the results, adding KL regularization helps in improving both image fidelity and accuracy (mostly image fidelty).

    [Point-E] A System for Generating 3D Point Clouds from Complex Prompts๐Ÿ”—

    Arxiv: https://arxiv.org/abs/2212.08751 16 Dec 2022 OpenAI

    In this paper, we explore an alternative method for 3D object generation which produces 3D models in only 1-2 minutes on a single GPU. Our method first generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model which conditions on the generated image.

    Using glade dataset for 2D (fine-tuned on 3D rendering).

    [CLIP] Connecting text and images๐Ÿ”—

    Arxiv: https://arxiv.org/abs/2103.00020 26 Feb 2021 OpenAI

    CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then use this behavior to turn CLIP into a zero-shot classifier. We convert all of a datasetโ€™s classes into captions such as โ€œa photo of a dogโ€ and predict the class of the caption CLIP estimates best pairs with a given image.