I created a CLI wrapper for Kitten TTS: <a href="https://github.com/newptcai/purr" rel="nofollow">https://github.com/newptcai/purr</a><p>BTW, it seems that kitten (the Python package) has the following chain of dependencies: kittentts → misaki[en] → spacy-curated-transformers<p>So if you install it directly via uv, it will pull torch and NVIDIA CUDA packages (several GB), which are not needed to run kitten.
by dawdler-purge
|
Mar 21, 2026, 12:13:18 PM
What I love about OpenClaw is that I was able to send it a message on Discord with just this github URL and it started sending me voice messages using it within a few minutes. It also gave me a bunch of different benchmarks and sample audio.<p>I'm impressed with the quality given the size. I don't love the voices, but it's not bad. Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.
by kevin42
|
Mar 21, 2026, 12:13:18 PM
I created a demo running in the browser, on your device: <a href="https://next-voice.vercel.app" rel="nofollow">https://next-voice.vercel.app</a>
by g58892881
|
Mar 21, 2026, 12:13:18 PM
Was playing around a bit and for its size it's very impressive. Just has issues pronounciating numbers. I tried to let it generate "Startup finished in 135 ms."<p>I didn't expect it to pronounciate 'ms' correctly, but the number sounded just like noise. Eventually I got an acceptable result for the string "Startup finished in one hundred and thirty five seconds.
by __fst__
|
Mar 21, 2026, 12:13:18 PM
A very clear improvement from the first set of models you released some time ago. I'm really impressed. Thanks for sharing it all.
by daneel_w
|
Mar 21, 2026, 12:13:18 PM
Very cool :)
Look forward to trying it out<p>Maybe a dumb and slightly tangential question,
(I don't mean this as a criticism!)
but why not release a command line executable?<p>Even the API looks like what you'd see in a manpage.<p>I get it wouldn't be too much work for a user to actually make something like that,
I'm just curious what the thought process is
by geokon
|
Mar 21, 2026, 12:13:18 PM
You should put examples comparing the 4 models you released - same text spoken by each.
by ks2048
|
Mar 21, 2026, 12:13:18 PM
I'd love to see a monolingual Japanese model sometime in the future. Qwen3-tts works for Japanese in general, but from time to time it will mix with some Mandarin in between, making it unusable.
by _hzw
|
Mar 21, 2026, 12:13:18 PM
Good on device TTS is an amazing accessibility tool. Thank you for building this. Way too many of devices that use it rely on online services, this is much preferred.
by jacquesm
|
Mar 21, 2026, 12:13:18 PM
They sound like cartoon voices... but I really like them I could listen to a book with those.
by nsnzjznzbx
|
Mar 21, 2026, 12:13:18 PM
I ran install instructions and it took 7.1GB of deps, tf you mean "tiny" ?
by PunchyHamster
|
Mar 21, 2026, 12:13:18 PM
The size/quality tradeoff here is interesting. 25MB for a TTS model that's usable is a real achievement, but the practical bottleneck for most edge deployments isn't model size -- it's the inference latency on low-power hardware and the audio streaming architecture around it. Curious how this performs on something like a Raspberry Pi 4 for real-time synthesis. The voice quality tradeoff at that size usually shows up most in prosody and sentence-final intonation rather than phoneme accuracy.
by bobokaytop
|
Mar 21, 2026, 12:13:18 PM
One of the core features I look for is expressive control.<p>Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.<p>Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].<p>the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?
by altruios
|
Mar 21, 2026, 12:13:18 PM
There's a number of recent, good quality, small TTS models.<p>If the author doesn't describe some detail about the data, training, or a novel architecture, etc, I only assume they just took another one, do a little finetuning, and repackage as a new product.
by ks2048
|
Mar 21, 2026, 12:13:18 PM
To the folks and Kitten team: I'm working on TTS as a problem statement (for an application), and what is the best model at the latency/cost inference. I'm currently settling for gemini TTS, which allows for a lot of expressiveness, but a word at 150ms starts to hurt when the content is a few sentences.<p>my current best approach is wrapping around gemini-flash native, and the model speaking the text i send it, which allows me end to end latency under a second.<p>are there other models at this or better pricing i can be looking at.
by anilgulecha
|
Mar 21, 2026, 12:13:18 PM
The Github readme doesn't list this: what data trained this? Was it done by the voices of the creators, or was this trained on data scraped from the internet or other archives?
by jamamp
|
Mar 21, 2026, 12:13:18 PM
Great stuff. Is your team interested in the STT problem?
by boutell
|
Mar 21, 2026, 12:13:18 PM
Fingers crossed for a normal-sounding voice this time around. The cute Kitten voices are nice, but I want something I can take seriously when I'm listening to an audiobook.
by arcanemachiner
|
Mar 21, 2026, 12:13:18 PM
the dependency chain issue is a real barrier for edge deployment. i've been running tts models on a raspberry pi for a home automation project and anything that pulls torch + cuda makes the whole thing a non-starter. 25MB is genuinely exciting for that use case.<p>curious about the latency characteristics though. 1.5x realtime on a 9700 is fine for batch processing but for interactive use you need first-chunk latency under 200ms or the conversation feels broken. does anyone know if it supports streaming output or is it full-utterance only?<p>the phoneme-based approach should help with pronunciation consistency too. the models i've tried that work on raw text tend to mispronounce technical terms unpredictably — same word pronounced differently across runs.
by baibai008989
|
Mar 21, 2026, 12:13:18 PM
This is awesome, well done. Been doing lot of work with voice assistants, if you can replicate voice cloning Qwen3-TTS into this small factor, you will be absolute legends!
by armcat
|
Mar 21, 2026, 12:13:18 PM
The example.py file says "it will run blazing fast on any GPU. But this example will run on CPU."<p>I couldn't locate how to run it on a GPU anywhere in the repo.
by pumanoir
|
Mar 21, 2026, 12:13:18 PM
A lot of good small TTS models in recent times. Most seem to struggle hard on prosody though.<p>Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.<p>Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?
by magicalhippo
|
Mar 21, 2026, 12:13:18 PM
How did you make a very small AI model (14M) sound more natural and expressive than even bigger models?
by swaminarayan
|
Mar 21, 2026, 12:13:18 PM
A lot of these models struggle with small text strings, like "next button" that screen readers are going to speak a lot.
by devinprater
|
Mar 21, 2026, 12:13:18 PM
Did they train this on @lauriewired's voice? The demo video sounds exactly like her at 0:18
by stbtrax
|
Mar 21, 2026, 12:13:18 PM
How much work would it be to use the C++ ONNX run-time with this instead of Python? Is it a Claudeable amount of work?<p>The iOS version is Swift-based.
by fwsgonzo
|
Mar 21, 2026, 12:13:18 PM
Would an Android app of this be able to replace the built in tts?
by vezycash
|
Mar 21, 2026, 12:13:18 PM
I thought they were going to make kitten sounds instead of speech
by agnishom
|
Mar 21, 2026, 12:13:18 PM
Thanks for open sourcing this.<p>Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.
by ilaksh
|
Mar 21, 2026, 12:13:18 PM
Nice, but it's weird that no "language" or "English" is mentioned on the github page, and only from the "Release multilingual TTS" Roadmap item could I guess it's probably English only for now.
by spyder
|
Mar 21, 2026, 12:13:18 PM
How long until I can buy this as a chip for my Arduino projects?
by amelius
|
Mar 21, 2026, 12:13:18 PM
Only American voices? For some reason I'm only interested in Irish, British or Welsh accents. American is a no
by tim-projects
|
Mar 21, 2026, 12:13:18 PM
Found they struggle with numbers. Like, give them a random four digit number in a sentence and it fumbles.
by Stevvo
|
Mar 21, 2026, 12:13:18 PM
Is this open-source or open-weights ML?
by pabs3
|
Mar 21, 2026, 12:13:18 PM
This would be great as a js package - 25mb is small enough that I think it'd be worth it (in-browser tts is still pretty bad and varies by browser)
by DavidTompkins
|
Mar 21, 2026, 12:13:18 PM
Thanks for working on this!<p>Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.
by great_psy
|
Mar 21, 2026, 12:13:18 PM
It is based on onnx, so can i use with transformers.js and the browser?
by sroussey
|
Mar 21, 2026, 12:13:18 PM
I'm still looking for the "perfect" setup in order to clone my voice and use it locally to send voice replies in telegram via openclaw. Does anyone have auch a setup?<p>I want to be my own personal assistant...<p>EDIT: I can provide it a RTX 3080ti.
by sschueller
|
Mar 21, 2026, 12:13:18 PM
Really cool to see innovation in terms of quality of tiny models. Great work!
by schopra909
|
Mar 21, 2026, 12:13:18 PM
are there plans to output text alignment?
by gabrielcsapo
|
Mar 21, 2026, 12:13:18 PM
The <25MB figure is what stands out. Been wanting to add TTS to a few Next.js projects for offline/edge scenarios but model sizes have always made it impractical to ship.<p>At 25MB you can actually bundle it with the app. Going to test whether this works in a Vercel Edge Function context -- if latency is acceptable there it opens up a lot of use cases that currently require a round-trip to a hosted API.
by rsmtjohn
|
Mar 21, 2026, 12:13:18 PM
What's the actual install size for a working example? Like similar "tiny" projects, do these models actually require installing 1GB+ of dependencies?
by janice1999
|
Mar 21, 2026, 12:13:18 PM
I'm thinking of giving "voice" to my virtual pets (think Pokemon but less than a dozen). The pets are made up animals but based on real animal, like Mouseier from Mouse (something like that). Is this possible?<p>Tldr: generate human-like voice based on animal sound. Anyway maybe it doesn't make sense.
by wiradikusuma
|
Mar 21, 2026, 12:13:18 PM
How noticeable is the difference in quality between the 4M model and the 80M model?
by erkoo
|
Mar 21, 2026, 12:13:18 PM
Is it English only?
by Tacite
|
Mar 21, 2026, 12:13:18 PM
This is great. Demo looks awesome.
by whitepaper27
|
Mar 21, 2026, 12:13:18 PM
So, one thing I noticed, and this could easily be user error, is that if I set the text & voice in the example to:<p><pre><code> text ="""
Hello world. This is Kitten TTS.
Look, it's working!
"""
voice = 'Luna'
</code></pre>
On macOS, I get "Kitten TTS", but on Linux, I get "Kit… TTS". Both OSes generate the same phonemes of,<p><pre><code> Phonemes: ðɪs ɪz kˈɪʔn ̩ tˌiːtˌiːˈɛs ,
</code></pre>
which makes me really confused as to where it's going off the rails on Linux, since from there it should just be invoking the model.<p>edit: it really helps to use the same model <i>facepalm</i>. It's the 80M model, and it happens on both OS. Wildly the nano gets it better? I'm going to join the Discord lol.
by deathanatos
|
Mar 21, 2026, 12:13:18 PM
sounds amazing! does it stream? or is it so fast you don't need to?
by exe34
|
Mar 21, 2026, 12:13:18 PM
Whats the training data for this?
by pabs3
|
Mar 21, 2026, 12:13:18 PM
Wow, what an amazing feat. Congratulations!
by moralestapia
|
Mar 21, 2026, 12:13:18 PM
This is something I've been looking for (the <50MB models in particular). Unfortunately my feedback is as follows:<p><pre><code> Downloading https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl (22 kB)
Collecting num2words (from kittentts==0.8.1)
Using cached num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
Collecting spacy (from kittentts==0.8.1)
Using cached spacy-3.8.11-cp314-cp314-win_amd64.whl.metadata (28 kB)
Collecting espeakng_loader (from kittentts==0.8.1)
Using cached espeakng_loader-0.2.4-py3-none-win_amd64.whl.metadata (1.3 kB)
INFO: pip is looking at multiple versions of kittentts to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following versions that require a different python version: 0.7.10 Requires-Python >=3.8,<3.13; 0.7.11 Requires-Python >=3.8,<3.13; 0.7.12 Requires-Python >=3.8,<3.13; 0.7.13 Requires-Python >=3.8,<3.13; 0.7.14 Requires-Python >=3.8,<3.13; 0.7.15 Requires-Python >=3.8,<3.13; 0.7.16 Requires-Python >=3.8,<3.13; 0.7.17 Requires-Python >=3.8,<3.13; 0.7.5 Requires-Python >=3.8,<3.13; 0.7.6 Requires-Python >=3.8,<3.13; 0.7.7 Requires-Python >=3.8,<3.13; 0.7.8 Requires-Python >=3.8,<3.13; 0.7.9 Requires-Python >=3.8,<3.13; 0.8.0 Requires-Python >=3.8,<3.13; 0.8.1 Requires-Python >=3.8,<3.13; 0.8.2 Requires-Python >=3.8,<3.13; 0.8.3 Requires-Python >=3.8,<3.13; 0.8.4 Requires-Python >=3.8,<3.13; 0.9.0 Requires-Python >=3.8,<3.13; 0.9.2 Requires-Python >=3.8,<3.13; 0.9.3 Requires-Python >=3.8,<3.13; 0.9.4 Requires-Python >=3.8,<3.13; 3.8.3 Requires-Python >=3.9,<3.13; 3.8.5 Requires-Python >=3.9,<3.13; 3.8.6 Requires-Python >=3.9,<3.13; 3.8.7 Requires-Python >=3.9,<3.14; 3.8.8 Requires-Python >=3.9,<3.14; 3.8.9 Requires-Python >=3.9,<3.14
ERROR: Could not find a version that satisfies the requirement misaki>=0.9.4 (from kittentts) (from versions: 0.1.0, 0.3.0, 0.3.5, 0.3.9, 0.4.0, 0.4.4, 0.4.5, 0.4.6, 0.4.7, 0.4.8, 0.4.9, 0.5.0, 0.5.1, 0.5.2, 0.5.3, 0.5.4, 0.5.5, 0.5.6, 0.5.7, 0.5.8, 0.5.9, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.6.7, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4)
ERROR: No matching distribution found for misaki>=0.9.4
</code></pre>
I realize that I can run a multiple versions of python on my system, and use venv to managed them (or whatever equivalent is now trendy), but as I near retirement age all those deep dependencies nets required by modern software is really depressing me. Have you ever tried to build a node app that hasn't been updated in 18 months? It can't be done. Old man yelling at cloud I guess <i>shrugs</i>.
by tredre3
|
Mar 21, 2026, 12:13:18 PM
[dead]
by JulianPembroke
|
Mar 21, 2026, 12:13:18 PM
25MB is impressive. What's the tradeoff vs the 80M model — is it mainly voice quality or does it also affect pronunciation accuracy on less common words?
by Remi_Etien
|
Mar 21, 2026, 12:13:18 PM
[dead]
by catbot_dev
|
Mar 21, 2026, 12:13:18 PM
[flagged]
by eddie-wang
|
Mar 21, 2026, 12:13:18 PM
[dead]
by takahitoyoneda
|
Mar 21, 2026, 12:13:18 PM
[dead]
by aplomb1026
|
Mar 21, 2026, 12:13:18 PM
[dead]
by openclaw01
|
Mar 21, 2026, 12:13:18 PM
[dead]
by ryguz
|
Mar 21, 2026, 12:13:18 PM
[dead]
by devnotes77
|
Mar 21, 2026, 12:13:18 PM
[dead]
by devcraft_ai
|
Mar 21, 2026, 12:13:18 PM
[dead]
by adriencr81
|
Mar 21, 2026, 12:13:18 PM
[dead]
by blackoutwars86
|
Mar 21, 2026, 12:13:18 PM
[dead]
by 5o1ecist
|
Mar 21, 2026, 12:13:18 PM
[dead]
by Iamkkdasari74
|
Mar 21, 2026, 12:13:19 PM
[dead]
by rcdwealth
|
Mar 21, 2026, 12:13:19 PM