Thanks for the sample file.
I was affraid it was the GPU-CPU data transfer...
Unfortunately, I can't delay the color data collection by a tick, as the number of points to be sampled varies and the sampled data is used to construct just a portion of the final rendered frame.
To give a bit more info, the colors are sampled from a depth pass of a scene and determine if a "light locator" (just a uniformly colored Sprite) is occluded by other objects - if the sample indicates it's not - draw a lens flare, if it is - do nothing and iterate to the next locator.
Since I've posted, I came up with a more crude way of getting this "occlusion" data - the Capture Canvas still gets a part of the depth pass at the Light Locator coordinates, the lens flare is always drawn, but then the Capture Canvas gets stretched to fit the entire screen and is multiplied over the flare - non-occluded Capture Canvasses will be white, giving no change, while occluded ones will be black - blacking out the flare.
It's an extra set of drawing and pasting instrucions, but so far performance doesn't suffer. I just waste 2MB of VRAM for the fullscreen flare Canvas.
Thanks for your help.