I've had a bit of time so I was able to experiment with some stuff relating to this.
Here's a reference capx
Here are some suggestions about how to approach it using the capx as a reference.
Sequencing audio was done using the Audio::Schedule next play expression. The example sounds I've used were very short sounds, but in your case, it is speech. So the important thing is to know how long each speech audio file is so you can schedule it appropriately. In the capx you will see "Schedule next play for Audio.CurrentTime+1". The +1 is 1 second. You should place as many seconds as it will need to play the audio file that you will play after that schedule action.
Just an overview of how to organise the project's logic. It's a bit difficult to explain why something is organised they way it is. But the way I thought best to solve the problem is to figure out the basic functionality that you want from the project.
1.) you want to be able to sequence sounds.
2.) you want to be able to stop _only_ those sequence of sounds on a key press.
3.) you want to play sounds before you switch scenes.
4.) you want to play a background sound that doesn't change.
For #1 you should be able to work out the technique to sequence the sounds. In the capx you'll see under the group "AUDIO FUNCTIONS" that that is how I approached it.
For #2 you should be able to identify those sounds that are 'stoppable' by a user keypress. I've done this by using unique tags, and setting a integer variable to keep track of how many sounds are actually playing. For example, in StartFGAudio, I have 4x sounds playing, and tagged them p1, p2, p3, p4. Then I set a global variable ndxs to 4. This notes that there are 4x sounds playing.
When I decide to stop the sounds, I use a for loop, concatenating 'p'&ndxs, which gives me p1, p2, p3, p4, and I use those strings to stop the audio.
For #3 you want the same thing as #2, but you need a trigger to let you know when a certain audio has ended. In the function StartTransitionAudio, 2x audio is played (this is to simulate your 'footstep' audio). I play this as sequence, with the last audio having a unique tag called "transition_last". This is the tag that I'm going to be looking for in order to transition to the next scene.
In the group AUDIO TRIGGERS, you will see the trigger there. And again, this only happens after the sound has ended. If you tried to do this in the StartAudioTransition function, the scene would have changed before the audio has stopped playing. If you like that behaviour then you can place the GoToScene function there.
For #4, the background sound is played with its own unique tag. As long as you don't stop the audio with this tag, it will keep playing.
Now, lastly, the overall organisation is putting things into their proper context, and I think that's why you had thought of using FSM. The basic idea that you were going for was to put something to a state, like a scene or layout, and have all the things related to that scene appear. That's fine, but the problem is that you still need to control everything else. So the FSM is not really necessary, but a clear indicator of what your state or scene is important.
For me, I thought using a 'scene' global variable to determine my current scene. When I change this scene variable to something else, all the other functions that need context will change as needed. For example, when I call fn.Call("StartFGAudio",scene), the function will know what sound to play, because I told it to play the 'machine gun' audio when the current scene is "secondscene".
But then I prefer to make a more general function called fn.Call("GoToScene",scene), because I can call fn.Call("StartFGAudio",scene) as well as load any other layout that's related to that scene.
Anyway, I hope the capx explains it better than I have written.