Doesn’t look like much is available beyond what the included text to speech plugin does. Limited local voices and many voices on the cloud but there seems to be limited selection. Maybe it’s hard to make one?
From what I can tell there are two parts to doing text to speech.
1. Convert text to a list of phonemes (or sounds) to say the words. Basically that would be done by applying all English pronunciation rules to the text. Could be tedious but a shortcut could be to utilize a website that can do the conversion to do it with all the dialog beforehand. Would make the code simpler.
2. Have a recorded sound of each phoneme and their length so you can play that list. English has 44 so that’s mostly busywork to record and trim the recordings. Better playback varies volume, pitch and speed to replicate speech more closely but it would require more expertise to know in what ways to do that. A pro about doing it with just sounds is you can utilize any feature the audio plug-in provides.
A prototype of the idea could be record a few phonemes to do some words to see how it sounds. Likely it would be fairly monotone and robotic.
State of the art seems to utilize neural networks to extract phonemes from a sample of speech and a different one to blend the phonemes together to sound less robotic. But that’s out of the scope of my knowledge.
That said I’m just sharing some ideas. I lack the time and expertise to make a complete solution at this time.
Edit: tried a simple test where I tried recording the individual sounds and then combining them together manually. It came out pretty rough. More research is needed.