That seems very complex, but interesting though.
If you want to analyse an audio file (like shazam does for example), you need some pretty complex math, like applying fourier transformations to split the audio into multiple wave forms. Depending on the wave forms you could distinguish phonetic values using some machine learning algorithm and pass that through to a series of transformations of the mounth
I actually think you severely underestimate the complexity and math behind this.
No offense though