Update Llama.cpp Submodule to #9fb13f by AuLaSW · Pull Request #1007 · abetlen/llama-cpp-python

AuLaSW · 2023-12-13T17:05:15Z

This pull request is small and simple: update the Llama.cpp submodule to #9fb13f. The submodule has been updated enough to include support for MoE models (such as the new Mixtral-8x7B-v0.1 that came out yesterday). I have tested this on WSL and it works with the quantized version of that model from TheBloke.

The latest commit allows for MoE models thanks to commit #799a1cb. This should update the connector to use the new llama.cpp files and allow for MoE models (such as Mixtral-8x7B-v0.1) to be used.

Update the Llama.cpp submodule to include commit #799a1cb, which expands Llama.cpp to include MoE models such as Mixtral-8x7B-v0.1.

shell-skrimp · 2023-12-13T18:58:31Z

Hi @AuLaSW . I went through and tested your PR and it seems to work fine. I used mixtral-8x7b-Q5_K_M.gguf. Output is below. I also tested llama2 and mistral and they worked fine.

llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from mixtral-8x7b-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:          blk.0.ffn_gate.0.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    2:          blk.0.ffn_down.0.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_up.0.weight q5_K     [  4096, 14336,     1,     1 ]

llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               [general.name](http://general.name/) str              = .
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:                         [llama.expert](http://llama.expert/)_count u32              = 8
llama_model_loader: - kv  10:                    [llama.expert](http://llama.expert/)_used_count u32              = 2
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 17
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  913 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 29.93 GiB (5.50 BPW)
llm_load_print_meta: [general.name](http://general.name/)     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.39 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 21128.36 MiB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/33 layers to GPU
llm_load_tensors: VRAM used: 9518.71 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 117.85 MiB
llama_new_context_with_model: VRAM scratch buffer: 114.54 MiB
llama_new_context_with_model: total VRAM used: 9633.25 MiB (model: 9518.71 MiB, context: 114.54 MiB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

llama_print_timings:        load time =    3605.30 ms
llama_print_timings:      sample time =      31.00 ms /   139 runs   (    0.22 ms per token,  4484.45 tokens per second)
llama_print_timings: prompt eval time =    3605.24 ms /    22 tokens (  163.87 ms per token,     6.10 tokens per second)
llama_print_timings:        eval time =   26828.24 ms /   138 runs   (  194.41 ms per token,     5.14 tokens per second)
llama_print_timings:       total time =   30731.81 ms

I'm looking at getting one of these soon and I have two questions that need to be answered. First how much power does the stock turbo make on the supra, is it true it has less boost then my 90 celica gt (12psi). Secondly what are some good bolt-on mods for this car. What kind of power can I get from a full exhaust and maybe a chip? How about an intercooler?

Thanks in advance.

------------------
Glenn
90 Celica gt, 15psi, HKS cams, B&M FPR, Full Exhaust

AuLaSW · 2023-12-13T19:05:22Z

Does this also cover #1000? I read through it and it seems like it would.

shell-skrimp · 2023-12-13T19:06:14Z

I believe it does

pabl-o-ce · 2023-12-13T19:54:52Z

I love it!

mclassen · 2023-12-14T00:44:16Z

Please merge it! 🙏

abetlen · 2023-12-14T02:56:18Z

@AuLaSW thank you for this! I've merged the latest llama.cpp release into main and published a new release (v0.2.23) to pypi.

shell-skrimp · 2023-12-14T03:37:14Z

Tested new release, seems good.

AuLaSW and others added 2 commits December 13, 2023 09:26

Update llama.cpp to be latest commit

a8a3dcc

The latest commit allows for MoE models thanks to commit #799a1cb. This should update the connector to use the new llama.cpp files and allow for MoE models (such as Mixtral-8x7B-v0.1) to be used.

Update Llama.cpp to latest commit

77fa274

Update the Llama.cpp submodule to include commit #799a1cb, which expands Llama.cpp to include MoE models such as Mixtral-8x7B-v0.1.

abetlen closed this Dec 14, 2023

pseudotensor mentioned this pull request Dec 14, 2023

Error: Mixtral-8x7B-Instruct-v0.1 issue with Llama-cpp-python h2oai/h2ogpt#1202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Llama.cpp Submodule to #9fb13f#1007

Update Llama.cpp Submodule to #9fb13f#1007
AuLaSW wants to merge 2 commits into
abetlen:mainfrom
AuLaSW:main

AuLaSW commented Dec 13, 2023

Uh oh!

shell-skrimp commented Dec 13, 2023

Uh oh!

AuLaSW commented Dec 13, 2023

Uh oh!

shell-skrimp commented Dec 13, 2023

Uh oh!

pabl-o-ce commented Dec 13, 2023

Uh oh!

mclassen commented Dec 14, 2023

Uh oh!

abetlen commented Dec 14, 2023

Uh oh!

shell-skrimp commented Dec 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

AuLaSW commented Dec 13, 2023

Uh oh!

shell-skrimp commented Dec 13, 2023

Uh oh!

AuLaSW commented Dec 13, 2023

Uh oh!

shell-skrimp commented Dec 13, 2023

Uh oh!

pabl-o-ce commented Dec 13, 2023

Uh oh!

mclassen commented Dec 14, 2023

Uh oh!

abetlen commented Dec 14, 2023

Uh oh!

shell-skrimp commented Dec 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants