Overcoming token dimension limitations, customized mannequin loading, LoRa assist, textual inversion assist, and extra
Stable Diffusion WebUI from AUTOMATIC1111 has confirmed to be a robust software for producing high-quality photographs utilizing the Diffusion mannequin. Nevertheless, whereas the WebUI is straightforward to make use of, knowledge scientists, machine studying engineers, and researchers typically require extra management over the picture era course of. That is the place the diffusers bundle from huggingface is available in, offering a approach to run the Diffusion mannequin in Python and permitting customers to customise their fashions and prompts to generate photographs to their particular wants.
Regardless of its potential, the Diffusers bundle has a number of limitations that stop it from producing photographs pretty much as good as these produced by the Steady Diffusion WebUI. Probably the most important of those limitations embrace:
- The lack to make use of customized fashions within the
.safetensor
file format; - The 77 immediate token limitation;
- A scarcity of LoRA assist;
- And the absence of picture scale-up performance (also referred to as HighRes in Steady Diffusion WebUI);
- Low efficiency and excessive VRAM utilization by default.
This text goals to handle these limitations and allow the Diffusers bundle to generate high-quality photographs corresponding to these produced by the Steady Diffusion WebUI. With the enhancement options offered, knowledge scientists, machine studying engineers, and researchers can get pleasure from larger management and suppleness of their picture era processes whereas additionally attaining distinctive outcomes. Within the following sections, we’ll discover the assorted methods and strategies that can be utilized to beat these limitations and unlock the complete potential of the Diffusers bundle.
Notice that please observe this hyperlink to put in all required CUDA and Python packages if it’s your first time operating Steady Diffusion.
1. Load Up Native Mannequin recordsdata in .safetensor Format
Customers can simply spin up diffusers to generate a picture like this:
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipeline.to("cuda")
picture = pipeline("A cute cat taking part in piano").photographs[0]
picture.save("image_of_cat_playing_piano.png")
You might not fulfill with both the output picture or the efficiency. Let’s take care of the issues one after the other. First, let’s load up a customized mannequin in .safetensor
format situated anyplace in your machine. you can’t simply load the mannequin file like this:
pipeline = DiffusionPipeline.from_pretrained("/mannequin/custom_model.safetensors")
Listed below are the detailed steps to covert .safetensor
file to diffusers format:
Step 1. Pull all diffusers code from GitHub
git clone https://github.com/huggingface/diffusers.git
Step 2. Underneath the scripts
folder find the file: convert_original_stable_diffusion_to_diffusers.py
In your terminal, run this command to transform .safetensor
file to Diffusers format. Bear in mind to alter the — checkpoint_path
worth to symbolize your case.
python convert_original_stable_diffusion_to_diffusers.py --from_safetensors --checkpoint_path="D:stable-diffusion-webuimodelsStable-diffusiondeliberate_v2.safetensors" --dump_path='D:sd_modelsdeliberate_v2' --device='cuda:0'
Step 3. Now you possibly can load up the pipeline utilizing the newly transformed mannequin file, right here is the whole code:
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
)
pipeline.to("cuda")
picture = pipeline("A cute cat taking part in piano").photographs[0]
picture.save("image_of_cat_playing_piano.png")
It is best to be capable to convert and use any fashions you obtain from huggingface or civitai.com.
2. Increase the Efficiency of Diffusers
Producing high-quality photographs could be a time-consuming course of even for the most recent 3xxx and 4xxx Nvidia RTX GPUs. By default, Diffuers bundle comes with non-optimized settings. Two options could be utilized to tremendously increase efficiency.
Right here is the interplay velocity earlier than making use of the next resolution, solely about 2.x iterations per second in RTX 3070 TI 8G RAM to generate a 512×512 picture
- Use Half Precision Weights
The primary resolution is to make use of half precision weights. Half precision weights use 16-bit floating-point numbers as an alternative of the standard 32-bit numbers. This reduces the reminiscence required for storing weights and quickens computation, which may considerably enhance the efficiency of the Diffusers bundle.
Based on this video, lowering float precision from FP32 to FP16 will even allow the Tensor Cores.
I had one other article to check out how briskly GPU Tensor cores can increase the computation velocity.
Right here is the best way to allow FP16 in diffusers, Simply including two traces of code will increase the efficiency by 500%, with virtually no picture high quality impacts.
from diffusers import DiffusionPipeline
import torch # <----- Line 1 added
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,torch_dtype = torch.float16 # <----- Line 2 Added
)
pipeline.to("cuda")
picture = pipeline("A cute cat taking part in piano").photographs[0]
picture.save("image_of_cat_playing_piano.png")
Now the iteration velocity boosts to 10.x iteration per second. A 5x occasions sooner.
Xformers is an open-source library that gives a set of high-performance transformers for numerous pure language processing (NLP) duties. It’s constructed on prime of PyTorch and goals to supply environment friendly and scalable transformer fashions that may be simply built-in into present NLP pipelines. (These days, are there any fashions that don’t use Transformer? :P)
Set up Xformers by pip set up xformers
, then we will simply change diffusers to make use of xformers by one line code.
...
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention() <--- one line added
...
This one-line code boosts efficiency by one other 20%.
3. Take away the 77 immediate tokens limitation
Within the present model of Diffusers, there’s a limitation of 77 immediate tokens that can be utilized within the era of photographs.
Luckily, there’s a resolution to this drawback. Through the use of the “lpw_stable_diffusion
” pipeline offered by the group, you possibly can unlock the 77 immediate token limitation and generate high-quality photographs with longer prompts.
To make use of the “lpw_stable_diffusion
” pipeline, you should use the next code:
pipeline = DiffusionPipeline.from_pretrained(
model_path,
custom_pipeline="lpw_stable_diffusion", #<--- code added
torch_dtype=torch.float16
)
On this code, we’re initializing a brand new DiffusionPipeline object utilizing the “from_pretrained
” methodology. We’re specifying the trail to the pre-trained mannequin and setting the “custom_pipeline
” argument to “lpw_stable_diffusion
”. This tells Diffusers to make use of the “lpw_stable_diffusion
” pipeline, which unlocks the 77 immediate token limitation.
Now, let’s use an extended immediate string to check it out. Right here is the whole code:
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion" #<--- code added
,torch_dtype = torch.float16
)
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()
immediate = """
Babel tower falling down, strolling on the starlight, dreamy extremely broad shot
, atmospheric, hyper reasonable, epic composition, cinematic, octane render
, artstation panorama vista images by Carr Clifton & Galen Rowell, 16K decision
, Panorama veduta photograph by Dustin Lefevre & tdraw, detailed panorama portray by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed put up processing, artstation, rendering by octane, unreal engine
"""
picture = pipeline(immediate).photographs[0]
picture.save("goodbye_babel_tower.png")
And you’re going to get a picture like this:
For those who nonetheless see a warning message like: Token indices sequence size is longer than the desired most sequence size for this mannequin ( *** > 77 ) . Operating this sequence by way of the mannequin will lead to indexing errors.
It’s regular, you possibly can simply ignore it.
4. Use Customized LoRA with Diffusers
Regardless of the claims of LoRA support in Diffusers, customers nonetheless face limitations on the subject of loading native LoRA recordsdata within the .safetensor
file format. This could be a important impediment for customers to make use of the LoRA from the group.
To beat this limitation, I’ve created a operate that enables customers to load LoRA recordsdata with weighted numbers in actual time. This operate can be utilized to load LoRA recordsdata and their corresponding weights to a Diffusers mannequin, enabling the era of high-quality photographs with LoRA knowledge.
Right here is the operate physique:
from safetensors.torch import load_file
def __load_lora(
pipeline
,lora_path
,lora_weight=0.5
):
state_dict = load_file(lora_path)
LORA_PREFIX_UNET = 'lora_unet'
LORA_PREFIX_TEXT_ENCODER = 'lora_te'alpha = lora_weight
visited = []
# immediately replace weight in diffusers mannequin
for key in state_dict:
# as now we have set the alpha beforehand, so simply skip
if '.alpha' in key or key in visited:
proceed
if 'textual content' in key:
layer_infos = key.break up('.')[0].break up(LORA_PREFIX_TEXT_ENCODER+'_')[-1].break up('_')
curr_layer = pipeline.text_encoder
else:
layer_infos = key.break up('.')[0].break up(LORA_PREFIX_UNET+'_')[-1].break up('_')
curr_layer = pipeline.unet
# discover the goal layer
temp_name = layer_infos.pop(0)
whereas len(layer_infos) > -1:
strive:
curr_layer = curr_layer.__getattr__(temp_name)
if len(layer_infos) > 0:
temp_name = layer_infos.pop(0)
elif len(layer_infos) == 0:
break
besides Exception:
if len(temp_name) > 0:
temp_name += '_'+layer_infos.pop(0)
else:
temp_name = layer_infos.pop(0)
# org_forward(x) + lora_up(lora_down(x)) * multiplier
pair_keys = []
if 'lora_down' in key:
pair_keys.append(key.substitute('lora_down', 'lora_up'))
pair_keys.append(key)
else:
pair_keys.append(key)
pair_keys.append(key.substitute('lora_up', 'lora_down'))
# replace weight
if len(state_dict[pair_keys[0]].form) == 4:
weight_up = state_dict[pair_keys[0]].squeeze(3).squeeze(2).to(torch.float32)
weight_down = state_dict[pair_keys[1]].squeeze(3).squeeze(2).to(torch.float32)
curr_layer.weight.knowledge += alpha * torch.mm(weight_up, weight_down).unsqueeze(2).unsqueeze(3)
else:
weight_up = state_dict[pair_keys[0]].to(torch.float32)
weight_down = state_dict[pair_keys[1]].to(torch.float32)
curr_layer.weight.knowledge += alpha * torch.mm(weight_up, weight_down)
# replace visited record
for merchandise in pair_keys:
visited.append(merchandise)
return pipeline
The logic is extracted from the convert_lora_safetensor_to_diffusers.py of the diffusers git repo.
Take one of many well-known LoRA:MoXin for instance. you should use the __load_lora
operate like this:
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion"
,torch_dtype = torch.float16
)
lora = (r"D:sd_modelsLoraMoxin_10.safetensors",0.8)
pipeline = __load_lora(pipeline=pipeline,lora_path=lora[0],lora_weight=lora[1])
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()immediate = """
shukezouma,destructive house,shuimobysim
a department of flower, conventional chinese language ink portray
"""
picture = pipeline(immediate).photographs[0]
picture.save("a department of flower.png")
The immediate will generate a picture like this:
You’ll be able to name a number of occasions of __load_lora()
to load a number of LoRAs for one era.
With this operate, now you can load LoRA recordsdata with weighted numbers in actual time and use them to generate high-quality photographs with Diffusers. The LoRA loading is fairly quick, normally taking only one–2 seconds, approach higher than changing and utilizing(which is able to generate one other mannequin file in GB dimension).
5. Use Customized Texture Inversions with Diffusers
Utilizing customized Texture Inversions with Diffusers bundle could be a highly effective approach to generate high-quality photographs. Nevertheless, the official documentation of Diffusers means that customers want to coach their very own Textual Inversions which may take as much as an hour on a V100 GPU. This will not be sensible for a lot of customers who need to generate photographs shortly.
So I investigated it and located an answer that may allow diffusers to make use of a textual inversion identical to in Steady Diffusion WebUI. Under is the operate I created to load a customized Textual Inversion.
def load_textual_inversion(
learned_embeds_path
, text_encoder
, tokenizer
, token = None
, weight = 0.5
):
'''
Use this operate to load textual inversion mannequin in mannequin initilization stage
or picture era stage.
'''
loaded_learned_embeds = torch.load(learned_embeds_path, map_location="cpu")
string_to_token = loaded_learned_embeds['string_to_token']
string_to_param = loaded_learned_embeds['string_to_param']# separate token and the embeds
trained_token = record(string_to_token.keys())[0]
embeds = string_to_param[trained_token]
embeds = embeds[0] * weight
# forged to dtype of text_encoder
dtype = text_encoder.get_input_embeddings().weight.dtype
embeds.to(dtype)
# add the token in tokenizer
token = token if token will not be None else trained_token
num_added_tokens = tokenizer.add_tokens(token)
if num_added_tokens == 0:
#print(f"The tokenizer already accommodates the token {token}.The brand new token will substitute the earlier one")
increase ValueError(f"The tokenizer already accommodates the token {token}. Please go a distinct `token` that isn't already within the tokenizer.")
# resize the token embeddings
text_encoder.resize_token_embeddings(len(tokenizer))
# get the id for the token and assign the embeds
token_id = tokenizer.convert_tokens_to_ids(token)
text_encoder.get_input_embeddings().weight.knowledge[token_id] = embeds
return (tokenizer,text_encoder)
Within the load_textual_inversion()
operate, you must present the next arguments:
learned_embeds_path
: Path to the pre-trained textual inversion mannequin file in .pt or .bin format.text_encoder
: Textual content encoder object obtained from the Diffusion Pipeline.tokenizer
: Tokenizer object obtained from the Diffusion Pipeline.token
: Optionally available argument specifying the immediate token. By default, it’s set to None. it’s the key phrase that may set off the textual inversion in your immediateweight
: Optionally available argument specifying the burden of the textual inversion. By default, I set it to 0.5. you possibly can change to different worth as wanted.
Now you can use the operate with a diffusers pipeline like this:
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
r"D:sd_modelsdeliberate_v2"
,custom_pipeline = "lpw_stable_diffusion"
,torch_dtype = torch.float16
,safety_checker = None
)textual_inversion_path = r"D:sd_modelsembeddingsstyle-empire.pt"
tokenizer = pipeline.tokenizer
text_encoder = pipeline.text_encoder
load_textual_inversion(
learned_embeds_path = textual_inversion_path
, tokenizer = tokenizer
, text_encoder = text_encoder
, token = 'styleempire'
)
pipeline.to("cuda")
pipeline.enable_xformers_memory_efficient_attention()
immediate = """
styleempire,award successful stunning avenue, storm,((darkish storm clouds))
, fluffy clouds within the sky, shaded flat illustration, digital artwork
, trending on artstation, extremely detailed, tremendous element, intricate
, ((lens flare)), (backlighting), (bloom)
"""
neg_prompt = """
cartoon, 3d, ((disfigured)), ((unhealthy artwork)), ((deformed)), ((poorly drawn))
, ((further limbs)), ((shut up)), ((b&w)), bizarre colours, blurry
, hat, cap, glasses, sun shades, lightning, face
"""
generator = torch.Generator("cuda").manual_seed(1)
picture = pipeline(
immediate
,negative_prompt =neg_prompt
,generator = generator
).photographs[0]
picture.save("tv_test.png")
Right here is the results of making use of an Empire Style Textual Inversion.
The left’s fashionable avenue turns to an previous London model.
6. Upscale Pictures
Diffusers bundle is nice for producing high-quality photographs, however picture upscaling will not be its major operate. Nevertheless, the Steady-Diffusion-WebUI provides a characteristic referred to as HighRes, which permits customers to upscale their generated photographs to 2x or 4x. It will be nice if Diffusers customers may get pleasure from the identical characteristic. After some analysis and testing, I discovered that the SwinRI mannequin is a wonderful possibility for picture upscaling, and it will probably simply upscale photographs to 2x or 4x after they’re generated.
To make use of the SwinRI mannequin for picture upscaling, we will use the code from the GitHub repository of JingyunLiang/SwinIR. For those who simply need codes, downloading fashions/network_swinir.py
, utils/util_calculate_psnr_ssim.py
and main_test_swinir.py
is sufficient. Following the readme guideline, you possibly can upscale photographs like magic.
Here’s a pattern of how nicely SwinRI can scale up a picture.
Many different open-source options can be utilized to enhance picture high quality. Right here record three different fashions that I attempted that return fantastic outcomes.
RealSR can scale up a picture 4 occasions virtually pretty much as good as SwinRI, and its execution efficiency is the quickest, as an alternative of invoking PyTorch and CUDA. The creator compiles the code and CUDA utilization to binary immediately. My observations reveal that the RealSR can upscale a mage in about simply 2–4 seconds.
CodeFormer is sweet at restoring blurred or damaged faces, it will probably additionally take away noise and improve background particulars. This resolution and algorithm is extensively utilized in different functions, together with Steady-Diffusion-WebUI
One other highly effective open-source resolution that archives superb outcomes of face restoration, and it’s quick too. GFPGAN can also be built-in into Steady-Diffusion-WebUI.
7. Optimize Diffusers CUDA Reminiscence Utilization
When utilizing Diffusers to generate photographs, it’s essential to think about the CUDA reminiscence utilization, particularly while you need to load different fashions to additional course of the generated photographs. For those who attempt to load one other mannequin like SwinIR to upscale photographs, you would possibly encounter a RuntimeError: CUDA out of reminiscence
because of the Diffuser mannequin nonetheless occupying the CUDA reminiscence.
To mitigate this problem, there are a number of options to optimize CUDA reminiscence utilization. The next two options I discovered work the perfect:
- Sliced Consideration for Further Reminiscence Financial savings
Sliced consideration is a way that reduces the reminiscence utilization of self-attention mechanisms in transformers. By partitioning the eye matrix into smaller blocks, the reminiscence necessities are lowered. This method can be utilized with the Diffusers bundle to cut back the reminiscence footprint of the Diffuser mannequin.
To make use of it in Diffusers, merely one line code:
pipeline.enable_attention_slicing()
Often, you received’t have two fashions operating on the similar time, the concept is to dump the mannequin knowledge to the CPU reminiscence quickly and unencumber CUA reminiscence house for different fashions, and solely load as much as VRAM while you begin utilizing the mannequin.
To make use of dynamically offload knowledge to CPU reminiscence in Diffusers, use this line code:
pipeline.enable_model_cpu_offload()
After making use of this, each time Diffusers end the picture era process, the mannequin knowledge will likely be offloaded to CPU reminiscence routinely till the subsequent time calling.
Abstract
The article discusses the best way to enhance the efficiency and capabilities of the Diffusers bundle, The article covers a number of options to frequent points confronted by Diffusers customers, together with loading native .safetensor
fashions, boosting efficiency, eradicating the 77 immediate tokens limitation, utilizing customized LoRA and Textual Inversion, upscaling photographs, and optimizing CUDA reminiscence utilization.
By making use of these options, Diffusers customers can generate high-quality photographs with higher efficiency and extra management over the method. The article additionally consists of code snippets and detailed explanations for every resolution.
For those who can efficiently apply these options and code in your case, there may very well be a further profit, which I profit loads, is that you could be implement your personal options by studying the Diffusers supply code and perceive higher how Steady Diffusion works. To me, studying, discovering, and implementing these options is a enjoyable journey. Hope these options can even provide help to and need you get pleasure from with Steady Diffusion and diffusers bundle.
Right here present the immediate that generates the heading picture:
Babel tower falling down, strolling on the starlight, dreamy extremely broad shot
, atmospheric, hyper reasonable, epic composition, cinematic, octane render
, artstation panorama vista images by Carr Clifton & Galen Rowell, 16K decision
, Panorama veduta photograph by Dustin Lefevre & tdraw, detailed panorama portray by Ivan Shishkin
, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild
, 4k detailed put up processing, artstation, rendering by octane, unreal engine
Measurement: 600 * 800
Seed: 3977059881
Scheduler (or Sampling methodology): DPMSolverMultistepScheduler
Sampling steps: 25
CFG Scale (or Steering Scale): 7.5
SwinRI mannequin: 003_realSR_BSRGAN_DFO_s64w8_SwinIR-M_x4_GAN.pth
License and Code Reuse
The options offered on this article had been achieved by way of intensive supply studying, later evening testing, and logical design. It is very important be aware that on the time of writing (April 2023), loading LoRA and Textual Inversion options and code included on this article are the one working variations throughout the web.
For those who discover the code introduced on this article helpful and need to reuse it in your challenge, paper, or article, please reference again to this Medium article. The code introduced right here is licensed below the MIT license, which allows you to use, copy, modify, merge, publish, distribute, sublicense, and/or promote copies of the software program, topic to the situations of the license.
Please be aware that the options introduced on this article will not be the optimum or best approach to obtain the specified outcomes, and are topic to alter as new developments and enhancements are made. It’s at all times advisable to completely take a look at and validate any code earlier than implementing it in a manufacturing atmosphere.