all K quants comparison using fp16/fp8 t5

#15

by Nelathan - opened Aug 18

Aug 18

i did a quick comparison over all quant levels (K where possible) and want to share them with you.

A fierce female warrior with long, flowing black hair, her armor etched with glowing rainbow colored circuits, lining strip of RGB led lighting  and electronic chip, stands in the heart of a digital nexus. leaning on object, head slightly tilt, looking at viewer smilng, long fingernails, fingerless gauntlet, Surrounding her is a vortex of luminous circuitry, casting intricate patterns of light on her armor. The scene is rendered with photorealistic detail, the lighting capturing the interplay between the metallic textures and the pulsating rainbow colored energy energy of the digital backdrop.
3d render in unreal engine

FLUX.1-dev-Q8_0

t5xxl_fp16
t5xxl_fp8

FLUX.1-dev-Q6_K

t5xxl_fp16
t5xxl_fp8

FLUX.1-dev-Q5_K_S

t5xxl_fp16
t5xxl_fp8

FLUX.1-dev-Q4_K_S

t5xxl_fp16
t5xxl_fp8

FLUX.1-dev-Q3_K_S

t5xxl_fp16
t5xxl_fp8

FLUX.1-dev-Q2_K

t5xxl_fp16
t5xxl_fp8

Loopinggg

Aug 18

Hi @Nelathan , Thanks so much for your work!

Are you sure about the allocation of photos between fp16 and fp8 for each models ?
Personally, given the details, I would have reversed the photos between fp16 and fp8, especially for Q8 (shoulder armor detail, fingerless gloves,...).
Also, are you pictures with the same prompt made by the models reference flux1.dev en fp 16 et fp 8, please ?

Thanks so much. Have a good day

HamedEmine

Aug 18

I assume "FLUX.1-dev-Q8_0" is the equivalent of "Flux.1-dev fp8 e5m2" ?

Nelathan

Aug 18

Q8_0 is not equal to fp8 e5m2. Its int8 based.

You can open any image in comfyui to check my workflow and the models. Sorry i dont have the original model since i only have 12gb vram. If someone could add those images for the fp16 and fp8 e5m2 would be great

bobp

Aug 18

Interesting, thanks for the comparison. Looks like Q3 with t5xxl_fp8 can still pull off a pretty convincing image even if it's slightly rough looking; good news for lower end machines. Q2 is where it really degrades to the point of "nobody should use this one". More or less falls in line with my experience using ggufs with LLMs: Q3 variants can still be decent, Q2 is awful.

Nelathan

Aug 18

There is still hope for iQ quants.

Also we get a loss when dequantizing to bfloat16. If we could run inference on int8 or even int4 there would be less loss i think.

HamedEmine

Aug 18

Q8_0 is not equal to fp8 e5m2. Its int8 based.

You can open any image in comfyui to check my workflow and the models. Sorry i dont have the original model since i only have 12gb vram. If someone could add those images for the fp16 and fp8 e5m2 would be great

Sure!

Flux Dev fp8_e5m2:

t5xxl_fp16:

t5xxl_fp8_e4m3fn:

Flux Dev fp8_e4m3fn:

t5xxl_fp16:

t5xxl_fp8_e4m3fn:

I used your workflow embedded in the images, so it's a faithful comparison

erus556

Aug 18

Thanks for sharing, it is amazing that Q3 maintains its quality!
The GGUF conversion and quantization looks like a breakthrough technology that will improve accessibility and performance. It is strange that quantization and partial offloading have not been used in image generation before.

3blackbar

Aug 18

yeah but did you checked the speed? the speeds are major letdown

Nelathan

Aug 19

@3blackbar Because each tensor is dequanted for each use to bfloat16.

Thats how engineerjng works, firat make it work then make it fast

Gabriel-Filincowsky

about 1 month ago

Using a qualitative analysis made for Lama 3 8B of these quantization methods as a reference, we can get a good idea of the best cost-benefit (https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md#llama-3-8b-scoreboard). I plotted a chart that identifies an ideal area where if further quantized we would get diminishing returns. Please see below:

This indicates that models 'Q6_K' and 'Q5_K_S' would yield the best compromise between performance and quality loss.

deepfree

about 1 month ago

Using a qualitative analysis made for Lama 3 8B of these quantization methods as a reference, we can get a good idea of the best cost-benefit (https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md#llama-3-8b-scoreboard). I plotted a chart that identifies an ideal area where if further quantized we would get diminishing returns. Please see below:

This indicates that models 'Q6_K' and 'Q5_K_S' would yield the best compromise between performance and quality loss.

Excuse me, but are u serious? It's so clear in the graph that ‘Q6_K’ and 'Q5_K_M' would yield the best compromise between performance and quality loss.

Gabriel-Filincowsky

about 1 month ago

@deepfree , Sorry, I jumped to the conclusion. When I referred to 'Q5_K_S' I was already considering that we do not have a 'Q5_K_M' for Flux available.

deepfree

30 days ago

Got it. Expecting Q4_K_M and Q5_K_M.

Melyn

27 days ago

Looks like Q5_K_S is the best choice for people with low vram. Quality loss is barely noticable from Q8.

Ta1e

22 days ago

may I ask the vram consumption between Q8 and FP8? should I switch to Q8 or Q6 to save more vram?

Melyn

21 days ago

may I ask the vram consumption between Q8 and FP8? should I switch to Q8 or Q6 to save more vram?

i haven't looked into the vram usage while using either of them but when fp8 first got release it used to crash on me all the time with comfy at the vae decode stage (4070 super, 32 ddr5 ram) i was required to close all browser tabs & any program in the background for it to run. also it takes lot of time to load the model from WD_Black ssd which has really fast r/w speeds. Q8 gguf however doesn't do that as often & load in ok amount of time.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment