What you basically need is a way to trigger actions upon specified timed cues. You can use the "system---> compare 2 values" and pick the playback time.
The thing is that C2 or HTML5 sound isn't reliable when it comes to timing or synchronizing audio. Take a look on the attached example. While I use a "All preloads complete" condition to trigger the music, it takes half a second for it to actually start. That's making syncing a tedious an unreliable task.
Of course, you have to figure out a different system to trigger actions than the one I have on the example, because otherwise you will lost count after just a few seconds/events...