Thank you all for your suggestions. I think I'll go with the built in sorting method newt suggested. The layer and optimization tips will also sure be handy. We are aiming for a minimalist graphics style so animations are currently not a concern.
I ran the possible logic in my head today (I didn't have the time to try it in a prototype yet). However I'm still not sure how I'll get the "walking behind buildings" part work. Does the sorting take care of this as well? Or should I carefully assign the origin point of each object for the sorting to handle this? We're still not sure about the angle we're going to use for the images..
Regarding collisions, is it enough if I just set the collision polygon for the base square of a building?