You mean like a speaking book for children ?
It can be done ; the hard bit is synchronising the text and audio.
It should be possible (difficult in C2 realistically) to analyse the audio to find the pause points, especially if you are speaking slowly, and then marry this up with your spoken text split on spaces, to produce a JSON structure which contains the word, the start point and length of the equivalent audio. This would be easiest if the sound is in some sort of absolute raw audio format, e.g. a sequence of byte, word or long values sampled at a constant rate which if you plotted them would show the waveform of the sound - easier than decoding WAV etc. yourself, you can then look for periods of silence in the audio. I don't think this is a C2 thing - possibly python has a library that handles sound ? Java maybe ? sox could probably do the format conversion for you ? Not sure.