Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added rudimentary support for outetts v0.3 500m and 1b models #11287

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

LostRuins
Copy link
Collaborator

Hi @ggerganov @edwko

This PR adds rudimentary support for the newly released OuteTTS v0.3 500m and 1b models, found at https://huggingface.co/OuteAI/OuteTTS-0.3-500M-GGUF and https://huggingface.co/OuteAI/OuteTTS-0.3-1B-GGUF

This will allow loading and generating with the new models, although crucially it ignores the new punctuation tokens. I had previously added them in my own fork, but they come with a lot of edge cases that may not be so easy to untangle, since they are grouped with other tokens and there are degenerate cases (e.g. www..!...google....com??) that will cause problems if they are simply swapped in as is.

The model types are differentiated by attempting to tokenize <|space|>, which is a single token in v0.3, but not in earlier versions. For the 1B model, the token <|0|> has a different offset, thus it's been changed to be determined dynamically. The existing speaker voice is retained, but I swapped out your hardcoded token array with a runtime tokenization for the same reasons (and also adapting the v0.3 format)

Remains compatible with v0.2 and should be able to load all 3 model types.

It is actually ready to merge as-is, but feel free to make whatever changes you deem necessary. Cheers!

@LostRuins LostRuins requested a review from ggerganov January 18, 2025 10:58
@edwko
Copy link

edwko commented Jan 18, 2025

Yeah, that’s why in the library I grouped them before and after words, it might not be the best solution, but it works:

Input: www..!...google....com??

Converts to:

<|im_start|>
<|text_start|>www<|period|><|period|><|exclamation_mark|><|period|><|period|><|period|><|space|>google<|period|><|period|><|period|><|period|><|space|>com<|question_mark|><|question_mark|><|text_end|>
<|audio_start|>

@LostRuins
Copy link
Collaborator Author

LostRuins commented Jan 18, 2025

Yeah, but even if the TTC part works, I think the CTS part might fail. I can definitely do that if you think it's better.

@@ -371,7 +371,7 @@ static std::string replace_numbers_with_words(const std::string & input_text) {
}

// Based on: /~https://github.com/edwko/OuteTTS/blob/a613e79c489d8256dd657ea9168d78de75895d82/outetts/version/v1/prompt_processor.py#L39
static std::string process_text(const std::string & text) {
static std::string process_text(const std::string & text, bool is_version_0_3) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw to check if the version is 0.3, you can use:

bool is_version_0_3 = common_get_builtin_chat_template(model) == "outetts-0.3"

@edwko I planned to add this as a dedicated GGUF meta key, but turns out I still not have the time to implement this. I'll try to do this in next week! And btw congrats for the release of v0.3 😄

@LostRuins
Copy link
Collaborator Author

@edwko how is this case currently handled for you:

google .. . . com

I had issues when encountering fragments with only spaces and punctuations but no readable text. The narration breaks down once that is encountered

@edwko
Copy link

edwko commented Jan 19, 2025

@LostRuins All punctuations are merged to the closest word in cases like this google .. . . com

<|im_start|>
<|text_start|>google<|period|><|period|><|period|><|period|><|space|>com<|text_end|>
<|audio_start|>

Speech generation works fine if you follow this format. I just tested both google .. . . com and www..!...google....com??, and everything was generated correctly.

@LostRuins
Copy link
Collaborator Author

<|text_start|>google<|period|><|period|><|period|><|period|><|space|>com<|text_end|>

I noticed you removed the inbetween spaces. Whats the rules for that? The naive approach would generate

<|text_start|>google<|period|><|period|><|space|><|period|><|space|><|space|><|period|><|space|>com<|text_end|>

@edwko
Copy link

edwko commented Jan 19, 2025

It processes the text like this:
google .. . . com -> google.... com -> to prompt
For example, if the text was:
google .. . . ..com . . -> google.... ..com.. -> to prompt

Here’s the implementation for this:
_process_text also self.normalize_token_spacing constructs the spacing correctly.

When constructing the words back to create the audio prompt, it joins the punctuation like this:

word = s["word"]
if i.get("before", []):
    word = "".join(i["before"]) + word
if i.get("after", []):
    word += "".join(i["after"])

@LostRuins
Copy link
Collaborator Author

Yeah, anyway this is exactly what I meant by the various edge cases that may need to be untangled regarding punctuation, which is why I initially excluded it.

Perhaps we can consider starting with this, and then expanding the implementation? Happy for someone to improve upon it here, either before or after merging.

recommended way to check if the version is 0.3, as requested by ngxson
@LostRuins LostRuins requested a review from ngxson January 19, 2025 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants