Because video can use temporal compression and spatial compression.
First off, the PNG (of 2Mb) is not that much compressed, and is containing a transparantie (black&white) image.
So, PNG is 2 images in one file. The fill image and the transparanty image.
To really compare, lets talk about JPEG. If you have 1 image, the compression done on that 1 image is called the spatial compression.
But video is many images in a row. So if you store the first image with a good spatial compression, you can compare the second image to that first image and only store the diffences. This is temporal compression. If image 2 is the same as image 1, then only that info is stored: The Same.
So, Mp4 has keyframes with spatial compression, comparable to JPEG. And then an amount of frames that only containes the differences to that keyframe. Till there is to much difference. Then there will be again a keyframe written. And that is the temporal compression.
You can, ofcourse, do no temporal compression with only 1 frame (image).