all K quants comparison using fp16/fp8 t5

#15
by Nelathan - opened

i did a quick comparison over all quant levels (K where possible) and want to share them with you.

A fierce female warrior with long, flowing black hair, her armor etched with glowing rainbow colored circuits, lining strip of RGB led lighting  and electronic chip, stands in the heart of a digital nexus. leaning on object, head slightly tilt, looking at viewer smilng, long fingernails, fingerless gauntlet, Surrounding her is a vortex of luminous circuitry, casting intricate patterns of light on her armor. The scene is rendered with photorealistic detail, the lighting capturing the interplay between the metallic textures and the pulsating rainbow colored energy energy of the digital backdrop.
3d render in unreal engine

FLUX.1-dev-Q8_0

t5xxl_fp16 FLUX_00057_.png
t5xxl_fp8 FLUX_00063_.png

FLUX.1-dev-Q6_K

t5xxl_fp16 FLUX_00058_.png
t5xxl_fp8 FLUX_00064_.png

FLUX.1-dev-Q5_K_S

t5xxl_fp16 FLUX_00059_.png
t5xxl_fp8
FLUX_00065_.png

FLUX.1-dev-Q4_K_S

t5xxl_fp16 FLUX_00060_.png
t5xxl_fp8
FLUX_00066_.png

FLUX.1-dev-Q3_K_S

t5xxl_fp16 FLUX_00061_.png
t5xxl_fp8
FLUX_00067_.png

FLUX.1-dev-Q2_K

t5xxl_fp16 FLUX_00062_.png
t5xxl_fp8
FLUX_00068_.png

Hi @Nelathan , Thanks so much for your work!

  1. Are you sure about the allocation of photos between fp16 and fp8 for each models ?
    Personally, given the details, I would have reversed the photos between fp16 and fp8, especially for Q8 (shoulder armor detail, fingerless gloves,...).

  2. Also, are you pictures with the same prompt made by the models reference flux1.dev en fp 16 et fp 8, please ?

Thanks so much. Have a good day

I assume "FLUX.1-dev-Q8_0" is the equivalent of "Flux.1-dev fp8 e5m2" ?

Q8_0 is not equal to fp8 e5m2. Its int8 based.

You can open any image in comfyui to check my workflow and the models. Sorry i dont have the original model since i only have 12gb vram. If someone could add those images for the fp16 and fp8 e5m2 would be great

Interesting, thanks for the comparison. Looks like Q3 with t5xxl_fp8 can still pull off a pretty convincing image even if it's slightly rough looking; good news for lower end machines. Q2 is where it really degrades to the point of "nobody should use this one". More or less falls in line with my experience using ggufs with LLMs: Q3 variants can still be decent, Q2 is awful.

There is still hope for iQ quants.

Also we get a loss when dequantizing to bfloat16. If we could run inference on int8 or even int4 there would be less loss i think.

Q8_0 is not equal to fp8 e5m2. Its int8 based.

You can open any image in comfyui to check my workflow and the models. Sorry i dont have the original model since i only have 12gb vram. If someone could add those images for the fp16 and fp8 e5m2 would be great

Sure!

Flux Dev fp8_e5m2:

t5xxl_fp16:
fp8_e5m2_t5xxl_fp16_00001_.png

t5xxl_fp8_e4m3fn:
fp8_e5m2_t5xxl_fp8_e4m3fn_00001_.png


Flux Dev fp8_e4m3fn:

t5xxl_fp16:
fp8_e4m3fn_t5xxl_fp16_00002_.png

t5xxl_fp8_e4m3fn:
fp8_e4m3fn_t5xxl_fp8_e4m3fn_00001_.png

I used your workflow embedded in the images, so it's a faithful comparison

Thanks for sharing, it is amazing that Q3 maintains its quality!
The GGUF conversion and quantization looks like a breakthrough technology that will improve accessibility and performance. It is strange that quantization and partial offloading have not been used in image generation before.

yeah but did you checked the speed? the speeds are major letdown

@3blackbar Because each tensor is dequanted for each use to bfloat16.

Thats how engineerjng works, firat make it work then make it fast

Using a qualitative analysis made for Lama 3 8B of these quantization methods as a reference, we can get a good idea of the best cost-benefit (https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md#llama-3-8b-scoreboard). I plotted a chart that identifies an ideal area where if further quantized we would get diminishing returns. Please see below:

chart.png

This indicates that models 'Q6_K' and 'Q5_K_S' would yield the best compromise between performance and quality loss.

Using a qualitative analysis made for Lama 3 8B of these quantization methods as a reference, we can get a good idea of the best cost-benefit (https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md#llama-3-8b-scoreboard). I plotted a chart that identifies an ideal area where if further quantized we would get diminishing returns. Please see below:

chart.png

This indicates that models 'Q6_K' and 'Q5_K_S' would yield the best compromise between performance and quality loss.

Excuse me, but are u serious? It's so clear in the graph that β€˜Q6_K’ and 'Q5_K_M' would yield the best compromise between performance and quality loss.

@deepfree , Sorry, I jumped to the conclusion. When I referred to 'Q5_K_S' I was already considering that we do not have a 'Q5_K_M' for Flux available.

Got it. Expecting Q4_K_M and Q5_K_M.

Looks like Q5_K_S is the best choice for people with low vram. Quality loss is barely noticable from Q8.

may I ask the vram consumption between Q8 and FP8? should I switch to Q8 or Q6 to save more vram?

may I ask the vram consumption between Q8 and FP8? should I switch to Q8 or Q6 to save more vram?

i haven't looked into the vram usage while using either of them but when fp8 first got release it used to crash on me all the time with comfy at the vae decode stage (4070 super, 32 ddr5 ram) i was required to close all browser tabs & any program in the background for it to run. also it takes lot of time to load the model from WD_Black ssd which has really fast r/w speeds. Q8 gguf however doesn't do that as often & load in ok amount of time.

Sign up or log in to comment