I think there are some confusion regarding FF sensors and the statement that they capture 2,56 times more light. (1,6 x 1,6)
That is for the same exposure, a conditional statement.
-Yes FF sensors are 2,56 times bigger (area)
-exposure is always the same when settings are the same. (And the sensors are equally sensitive to light)
Exposure has nothing at all to do with a sensor's QE or the photosite capacity. Exposure is about what the lens projects before it goes through the sensor's surface
-Its all about pixel size. If you have a crop sensor and a FF sensor with the same individual pixel size. They would be equally as good. (If made the same way)
Not really. Pixel-level noise should be similar, but in practice, we have mostly seen higher pixel-level noise with the same pixel size on FF cameras, vs APS-C cameras. Some examples would be the 7D2 is cleaner than 5DSR, Nikon D500 is cleaner than D850.
The problem is that crop sensors are always packed tighter compared to FF sensors when regarding individual pixel size. This is why FF sensors generally perform better.
"Per unit of sensor area", there is no generic noise benefit to larger pixels, whatsoever, other than with very low f-numbers with front-side-illuminated sensors. In fact, they are more vulnerable to downstream noises because downstream noises are independent of sensor charge and any first-stage amplification, and one bad noise impulse (or hot pixel) ruins a larger sensor area, with larger pixels.
Almost all of the differences and wildcards that we still see when normalizing noise per unit of sensor area, are due to readout technology and pixel clocks, and have nothing to do with photon efficiency, which really does not vary much at all between sensors of the same general time period, despite widely varied pixel densities and sensor sizes.
The underlying rules implied by the facts on the ground of existing sensors, are that larger sensors
especially are difficult to read out at a fast pixel clock without introducing extra noise, and larger sensors' noise character tends to be more spatially correlated, and therefore easier to see, compared to smaller sensors.
Larger sensors only have better potential for noise at a given exposure level and ISO because of a larger sensor area, to capture more total light. To benefit from this in real world practice requires either full exposure at base ISO with no shutter speed challenge, or a lens that has a larger pupil than what you would use with a smaller sensor, when shutter speed needs prevent base ISO.
Here are four different sensors with different sensor sizes and pixel densities, all receiving approximately the same number of photons in the exposure, in each of the four windows:
Studio Comparison Tool