[bug] KoboldCpp 1.71.1 wrong detection of quants not only llama type and cause quality loss

#1
by softfluffyboy - opened

hi,sorry for report bug here, no on github (shadow ban me for freevpn dirty ip ,i just forget to turn off vpn and login in )
after a adding patch for llama 3.1 ropefix cause bug, wrong detection of quant type not only llama models and cause quality loss,
Llama-3.1-8B-Instruct-abliterated.Q4_K_S but detected as Q3_K - Large

llm_load_print_meta: model ftype = Q3_K - Large

llama3.1 log

Welcome to KoboldCpp - Version 1.71.1
For command line arguments, please refer to --help
Loading model: /media/user/sdc1/Llama-3.1-8B-Instruct-abliterated.Q4_K_S.gguf

The reported GGUF Arch is: llama
Arch Category: 0
llama_model_loader: loaded meta data with 33 key-value pairs and 292 tensors from /media/user/sdc1/llama3/Llama-3.1-8B-Instruct-abliterated.Q4_K_S.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q3_K - Large
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.36 GiB (4.67 BPW)
llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Abliterated
llm_load_print_meta: BOS token = 128000 <|begin_of_text|>
llm_load_print_meta: EOS token = 128009 <|eot_id|>
llm_load_print_meta: LF token = 128 Ä
llm_load_print_meta: EOT token = 128009 <|eot_id|>
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.32 MiB
llm_load_tensors: offloading 13 repeating layers to GPU
llm_load_tensors: offloaded 13/33 layers to GPU
llm_load_tensors: CPU buffer size = 4467.80 MiB
llm_load_tensors: OpenCL buffer size = 1521.46 MiB

also non llama qwen2

Welcome to KoboldCpp - Version 1.71.1
For command line arguments, please refer to --help

Loading model: /media/user/sdc1/qwen2/Qwen2-7B-Instruct-abliterated-Q4_K_M-imat.0.gguf

The reported GGUF Arch is: qwen2
Arch Category: 5


Identified as GGUF model: (ver 6)
Attempting to Load...

llama_model_loader: loaded meta data with 25 key-value pairs and 339 tensors from /media/user/sdc1/qwen2/Qwen2-7B-Instruct-abliterated-Q4_K_M-imat.0.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens cache size = 421
llm_load_vocab: token to piece cache size = 0.9352 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 3584
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 28
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18944
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q3_K - Large
llm_load_print_meta: model params = 7.62 B
llm_load_print_meta: model size = 4.36 GiB (4.91 BPW)
llm_load_print_meta: general.name = Qwen2-7B-Instruct-abliterated
llm_load_print_meta: BOS token = 151643 <|endoftext|>
llm_load_print_meta: EOS token = 151645 <|im_end|>
llm_load_print_meta: PAD token = 151643 <|endoftext|>
llm_load_print_meta: LF token = 148848 ÄĬ
llm_load_print_meta: EOT token = 151645 <|im_end|>
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.35 MiB
llm_load_tensors: offloading 13 repeating layers to GPU
llm_load_tensors: offloaded 13/29 layers to GPU
llm_load_tensors: CPU buffer size = 4460.45 MiB

softfluffyboy changed discussion title from [bug] KoboldCpp wrong detection of quants not only llama type and cause quality loss to [bug] KoboldCpp 1.71.1 wrong detection of quants not only llama type and cause quality loss
Koboldcpp org

Please try 1.72, which has been released.
Also, please report any issues on github.

concedo changed discussion status to closed
Koboldcpp org

Alternative to the Github reports, we have a very active discord community you can join at https://koboldai.org/discord to talk to us directly.

Sign up or log in to comment