In my opinion, the problem with going to such small pixel densities is not a lens issue, but a technique issue. Smaller and smaller pixels require even higher shutter speeds and less vibration on a tripod to fully extract all the detail. Movement and vibration become the killer of sharpness here, and I'm not sure if there is a tripod stable enough to keep a semi-long lens from exhibiting shake unless your shutter speeds are 3x the focal length or more. I'm speaking of extracting sharp pixel-level sharpness (otherwise why bother with more pixels?).
Good point. That said more pixels still make your unsharp picture sharper than large pixels would. (With multiple exposure integration sensors, (and heavy processing), that problem will also largely disappear as they can follow and compensate for movement throughout the exposure time).
As for pixel level sharpness most people take pictures, not pixels. (Would you expect every grain in a film negative to have clear definitions to the next grain?) Moreover, as we get more pixels we don't actually want pixel level sharpness. It is fairly useful to have so many pixels that we're well inside the diffraction limit, and the fuzziness caused by that effectively replace the function of the anti-aliasing filter.