Skip to content

tts : implement sesame CSM + Mimi decoder #12648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
24a07ab
tts : implement mimi decoder
ngxson Mar 29, 2025
efeaa57
fix llama-tts
ngxson Mar 29, 2025
a98f199
put mimi_model into a shared header
ngxson Mar 29, 2025
891273c
mimi : non-transposed input codes
ngxson Mar 29, 2025
6dca237
tts : add sesame csm
ngxson Mar 29, 2025
2d743b6
wip
ngxson Mar 29, 2025
f9162e7
wip
ngxson Mar 30, 2025
eae5f0e
add mimi_model::transpose_input
ngxson Mar 30, 2025
43bf237
fix build
ngxson Mar 30, 2025
e618405
fix build (2)
ngxson Mar 30, 2025
e185e0a
fix build (3)
ngxson Mar 30, 2025
ce83041
fix strcmp
ngxson Mar 30, 2025
61d8ad6
fix compilation on linux
ngxson Mar 30, 2025
4012054
clean up
ngxson Mar 30, 2025
b97fd3e
Merge branch 'xsn/mimi_dec' into xsn/csm_tts
ngxson Mar 30, 2025
7ecce76
working now
ngxson Mar 30, 2025
6976682
update readme
ngxson Mar 30, 2025
1e9afd9
nits
ngxson Mar 30, 2025
9f05741
Merge branch 'master' into xsn/csm_tts
ngxson Mar 30, 2025
40ab1ab
fix mul_mat_id read out-of-bound
ngxson Mar 30, 2025
eaba2bf
will this fix windows build?
ngxson Mar 30, 2025
5fe27ef
(try) fixing problem with long text
ngxson Mar 30, 2025
c796ee0
mimi: fix frame splitting
ngxson Mar 30, 2025
e31a75c
fix mimi example dummy1
ngxson Mar 31, 2025
5be8e7d
add top-k and temp sampling
ngxson Mar 31, 2025
90231cc
much better on long generation
ngxson Apr 1, 2025
156b528
Merge branch 'master' into xsn/csm_tts
ngxson Apr 2, 2025
e9dc476
fix tts-csm
ngxson Apr 2, 2025
c681257
ability to do multi-turns
ngxson Apr 2, 2025
142b545
Merge branch 'master' into xsn/csm_tts
ngxson Apr 3, 2025
d178099
add audio EOS token
ngxson Apr 3, 2025
0b55d8b
Merge branch 'master' into xsn/csm_tts
ngxson Apr 5, 2025
1219827
Merge branch 'master' into xsn/csm_tts
ngxson Apr 9, 2025
d1de6cc
add speaker reference
ngxson Apr 9, 2025
31b5d22
Merge branch 'master' into xsn/csm_tts
ngxson Apr 23, 2025
9533fb7
fix build_attn
ngxson Apr 23, 2025
e5bb560
rm print
ngxson Apr 23, 2025
c1cd710
fix pyright
ngxson Apr 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ examples/server/*.gz.hpp
!examples/*/*/*.kts
!examples/sycl/*.bat
!examples/sycl/*.sh
/*.wav

# Server Web UI temporary files
node_modules
Expand Down
28 changes: 28 additions & 0 deletions common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1565,3 +1565,31 @@ common_control_vector_data common_control_vector_load(const std::vector<common_c

return result;
}

//
// Audio utils
//

bool save_wav16(const std::string & fname, const std::vector<float> & data, int sample_rate) {
std::ofstream file(fname, std::ios::binary);
if (!file) {
LOG_ERR("%s: Failed to open file '%s' for writing.\n", __func__, fname.c_str());
return false;
}

wav_header header;
header.sample_rate = sample_rate;
header.byte_rate = header.sample_rate * header.num_channels * (header.bits_per_sample / 8);
header.block_align = header.num_channels * (header.bits_per_sample / 8);
header.data_size = data.size() * (header.bits_per_sample / 8);
header.chunk_size = 36 + header.data_size;

file.write(reinterpret_cast<const char*>(&header), sizeof(header));

for (const auto & sample : data) {
int16_t pcm_sample = static_cast<int16_t>(std::clamp(sample * 32767.0, -32768.0, 32767.0));
file.write(reinterpret_cast<const char*>(&pcm_sample), sizeof(pcm_sample));
}

return file.good();
}
22 changes: 22 additions & 0 deletions common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -662,3 +662,25 @@ const char * const LLM_KV_SPLIT_COUNT = "split.count";
const char * const LLM_KV_SPLIT_TENSORS_COUNT = "split.tensors.count";

}

//
// Audio utils
//

struct wav_header {
char riff[4] = {'R', 'I', 'F', 'F'};
uint32_t chunk_size;
char wave[4] = {'W', 'A', 'V', 'E'};
char fmt[4] = {'f', 'm', 't', ' '};
uint32_t fmt_chunk_size = 16;
uint16_t audio_format = 1; // PCM
uint16_t num_channels = 1; // Mono
uint32_t sample_rate;
uint32_t byte_rate;
uint16_t block_align;
uint16_t bits_per_sample = 16;
char data[4] = {'d', 'a', 't', 'a'};
uint32_t data_size;
};

bool save_wav16(const std::string & fname, const std::vector<float> & data, int sample_rate);
17 changes: 17 additions & 0 deletions examples/tts/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,20 @@ add_executable(${TARGET} tts.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE llama common ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

add_library(mimi-model STATIC mimi-model.h mimi-model.cpp)
target_link_libraries(mimi-model PRIVATE llama common ${CMAKE_THREAD_LIBS_INIT})
# for using C++ designated initializers, TODO: can be changed back to C++17 in the future
target_compile_features(mimi-model PRIVATE cxx_std_20)

set(TARGET llama-mimi)
add_executable(${TARGET} mimi.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE llama common mimi-model ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

set(TARGET llama-tts-csm)
add_executable(${TARGET} tts-csm.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE llama common mimi-model ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)
47 changes: 47 additions & 0 deletions examples/tts/README-csm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Sesame CSM

This demo shows running inference of [Sesame CSM](https://github.com/SesameAILabs/csm) using llama.cpp / GGML

It contains 3 components (each has its own GGUF file):
1. Backbone LLM
2. Decoder LLM
3. Mimi decoder

## Quick start

By default, all GGUF files are downloaded from [ggml-org Hugging Face's account](https://huggingface.co/ggml-org/sesame-csm-1b-GGUF)

```sh
# build (make sure to have LLAMA_CURL enabled)
cmake -B build -DLLAMA_CURL=ON
cmake --build build -j --target llama-tts-csm

# run it
./build/bin/llama-tts-csm -p "[0]Hi, my name is Xuan Son. I am software engineer at Hugging Face."
```

## Convert the model yourself

To get the GGUF:

```sh
python examples/tts/convert_csm_to_gguf.py

# default output files:
# sesame-csm-backbone.gguf
# sesame-csm-decoder.gguf

# optionally, quantize it
# (lowest scheme is q8_0, it does not make sense to quantize further, quality degrades too much)
python examples/tts/convert_csm_to_gguf.py --outtype q8_0
```

Run the example using local file:

```sh
./build/bin/llama-tts-csm -m sesame-csm-backbone.gguf -mv kyutai-mimi.gguf -p "[0]Hello world."
# sesame-csm-backbone.gguf will automatically be loaded
# make sure the place these 2 GGUF files in the same directory

# output file: output.wav
```
50 changes: 50 additions & 0 deletions examples/tts/README-mimi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# llama.cpp/example/mimi

This demonstrates running [Kyutai's Mimi](https://huggingface.co/kyutai/mimi) model via GGML.

## Quickstart

Convert model to GGUF (no need to download, the script will automatically download the `safetensors` file)

```sh
python examples/tts/convert_mimi_to_gguf.py

# output file: kyutai-mimi.gguf

# optionally, use q8_0 quantization for faster speed
python examples/tts/convert_mimi_to_gguf.py --outtype q8_0
```

Then compile, run it:

```sh
cmake --build build -j --target llama-mimi

./build/bin/llama-mimi kyutai-mimi.gguf codes.txt

# output: output.wav

# alternatively, use "dummy1" to get a "wah hello there" sample output file
./build/bin/llama-mimi kyutai-mimi.gguf dummy1
```

Example of code file (one code per line):

```
1263
1597
1596
1477
1540
1720
1433
118
1066
1968
1096
232
418
566
1653
2010
```
Loading
Loading