Thanks Steven.
Perhaps the normal image does some undersampling, specifically of the colour information. Maybe the pixel shift just corrects that, but there seems to be no oversampling.
Would there be any advantage in moving the pixels by 1/2 instead of 1? This would mean that you would need to process 8 images instead 4 (or would that be 9 images?).
"Oversampling" simply means to take more samples and output it at a lower resolution. You could say that the normal image is under-sampled in that it doesn't have enough accuracy. But really the pixel shift is oversampling as it is using ~170MP recorded output as 42.4MP.
This is the same thing as taking a high resolution image and displaying it at half size. That's effectively 2x oversampling and it causes a reduction of image noise and an increase in apparent sharpness/DR.
Half stepping the shift would simulate a pixel of 1/2 dimensions (1/4 area) and could theoretically provide benefits. But only if there is something to differentiate w/in that step size... highly unlikely IMO. In order to get the maximum your sensor is already capable of you need to be at an aperture no smaller than f/5 with a lens that is diffraction limited at f/5 (i.e. doesn't get sharper when stopped down farther). The reality is that everything you are doing (pixel shift/focus stack/etc) is in attempt to actually output 42MP worth of detail, which would be amazing.
The reality is that due to technique/lens/lighting/subject/etc we are often struggling to record even 12MP worth of actual detail (which would be a sharp/detailed image). And it is important to understand that this has nothing to do with image pixel dimensions; the only thing important is the amount of detail w/in an output image's physical size (i.e. A4) and the physical size it was originally recorded at (sensor size)....
This is how it usually works in reality, I'm going to use a real life scenario based on my use of 35mm (FF) sensors.
I have the best lens I can afford, and due to the subject requirements (higher SS, smaller apertures, handheld or unlocked tripod, lower light) I struggle to output 12MP worth of actual resolution using the D3 (12MP sensor w/ AA filter; it's not an actual possibility)... Maybe I'm only getting about 6-8MP output resolution (marginal "OK" images).
So I buy the D4 (16MP w/ AA filter). This does not jump me up to 16MP of resolution. Instead what I am doing is using a 4MP oversample compared to the D3, and I get maybe 10-12MP of actual detail (sharp images).
Then I upgrade to the D5 (21MP w/ AA filter) and now I'm getting 12-14MP... really detailed images (on a good day).
Or I use my D810 instead (36MP w/o AA filter) and now, if I'm really lucky I get maybe 20MP of detail actually recorded... Amazingly detailed/sharp; such an image has so much detail that it cannot even be seen unless it is viewed at a magnified level (enlarged).
What I have actually done throughout this process is increase the sensor resolution in order to "oversample" the 35mm area. The reason I am calling it oversampling is because the limit to recorded resolution is not primarily due to the sensor, it is elsewhere. And because the limit is elsewhere, oversampling never gives me a 1:1 increase in actual recorded detail. So what has occurred is that I have managed to incrementally increase the recorded detail that exists in the 35mm sensor area; and what now matters is how much I am going to enlarge it (i.e. to 297mm/A4 print), and how close I will view it (normal or enlarged/magnified).
This last factor is why larger formats/sensors always generate better results... they need less enlargement for a given output size.