Skip to content

adds support for torch.Tensor input in inpaint pipeline#1128

Closed
vvvm23 wants to merge 4 commits into
huggingface:mainfrom
vvvm23:inpainting-pipeline-tensor-input-fix
Closed

adds support for torch.Tensor input in inpaint pipeline#1128
vvvm23 wants to merge 4 commits into
huggingface:mainfrom
vvvm23:inpainting-pipeline-tensor-input-fix

Conversation

@vvvm23
Copy link
Copy Markdown
Contributor

@vvvm23 vvvm23 commented Nov 3, 2022

I took the inpaint pipeline example from the README and used with a tensor input using the following code:

import PIL
import requests
import torch
from io import BytesIO

from diffusers import StableDiffusionInpaintPipeline
from torchvision.transforms.functional import to_tensor # <<<<<< NEW

def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(BytesIO(response.content)).convert("RGB")

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

init_image = to_tensor(init_image) # <<<<<< NEW
mask_image = to_tensor(mask_image)[0] # <<<<<< NEW

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    revision="fp16",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "A cute rabbit, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]

but this fails with an error when calling image = np.array(image.convert("RGB")).
I don't think this is expected as the docstrings make reference to a tensor input.

This PR changes it to check if the input is a torch.Tensor and if it is, converts it to a PIL.Image.Image.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

import torch

import PIL
from torchvision.transforms.functional import to_pil_image
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry we cannot use this here -> we don't have a dependency on torchvision . Could you instead use the pipe.numpy_to_pil function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I thought it would be OK as it is listed as a dependency here (though maybe I have misunderstood this file. I will change this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vvvm23,

That's a good point. The list you linked as a list of "soft-dependencies" which are not required for core functionality but for certain use cases which in the case of torchvision is training.

The "hard-dependencies" can be found here:

install_requires = [

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could instead make prepare_mask_and_masked_image a method of the pipeline and instead use self.numpy_to_pil?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it being in or out the pipeline would work, as numpy to pil is a static method so can be accessed without actually passing a reference to the pipe.

I prefer it as a method of the pipe I'll try and get something to you tomorrow ~

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still cannot have a dependency on torchvision though ;-)

Copy link
Copy Markdown
Contributor Author

@vvvm23 vvvm23 Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the torchvision dependency.

I noticed also the docstrings make reference to accepting batched inputs. So I'll add that too soon.

edit: I decided against using numpy_to_pil after experimenting with that approach.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, not sure if adding batching is easily doable here. Might be worth having a different issue for that. Can you take a look and see what you think?

I'll fix the conflicts for what I have so far.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Copy Markdown
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks fine, but I feel we need some clarification on the expected input shapes for the image and the mask - the docstrings mention "batches" and shapes like (B, H, W), but in the code we are assuming a single image and a single mask.

One option is to accept batches too. In this case, the batch dimension of both image and mask must match the cardinality of the prompts list. This would be useful for multiple in-painting tasks (for the same or different prompts), but it'd add some complexity.

If we keep the current behaviour (single image and mask, but optionally several prompts or predictions), then we need to clarify that in the docstrings :)

Thanks a lot!

repainted, while black pixels will be preserved. If `mask_image` is a PIL image, it will be converted
to a single channel (luminance) before use. If it's a tensor, it should contain one color channel (L)
instead of 3, so the expected shape would be `(B, H, W, 1)`.
to a single channel (luminance) before use. If it's a tensor, it should be of shape `(B, H, W)`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. The code in _prepare_mask_and_masked_image unconditionally prepends a batch dimension to the image. We also need to ensure that the mask has a single channel, so the shape in my opinion would have to be (H, W, 1) and then adapt the permutes accordingly.

We also need to document the expected shape of the image itself: (H, W, 3), I guess?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused too by the docstrings. Before I even began work it said it supported batches and tensor inputs, but I couldn't get either to work.

I disagree about the mask input, I feel it would be cleaner to implement with it being shape (H, W). But if there is some convention I am unaware of then I can change it.

I think for now I can update the doc strings for the single image case, then create a separate issue for multi batch. Would you agree? I have a local fork that adds batching to _prepare_mask_and_masked_image, but it fails further on in the code. It seems more involved and probably needs a separate issue.

@vvvm23
Copy link
Copy Markdown
Contributor Author

vvvm23 commented Nov 14, 2022

@patrickvonplaten just noticed #1003 also handles this issue, what is the status on this?

@patrickvonplaten
Copy link
Copy Markdown
Contributor

Hey @vvvm23,

Thanks for the PR!
Linking this to #1003 as well ->

We need a test for these changes, as they touch highly used pipelines.
Could you add a code snippet show-casing which use case is currently not possible, but will be enabled by this PR?

@vvvm23
Copy link
Copy Markdown
Contributor Author

vvvm23 commented Nov 17, 2022

I can handle that later today – but won't this just do the same thing as #1003? Maybe better to just use #1003 🤔 I'll defer to you though.

@patrickvonplaten
Copy link
Copy Markdown
Contributor

Hey @vvvm23,

Thanks for your answer, we just merged #1003 - would it be ok to close this one? 😅

@vvvm23
Copy link
Copy Markdown
Contributor Author

vvvm23 commented Nov 21, 2022

Yes totally fine! #1003 did what this does and then some, so it makes sense ~

@vvvm23 vvvm23 closed this Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants