If understand correctly, each color pixel is made of three 3 photosites: one sensible for light in red color spectrum, another for green and the last one for blue.
If you make a sensor of just one color light sensible photosites (therefore 1 photosite equals 1pixel) my instinct tells me that either the pixels will be 200% bigger or you can squeeze 200% more pixels into the same surface of the sensor.
However I heard there is only a slight difference between monochrome sensors and color ones when it comes to noise or resolution. Why is that?
Just because someone does a comparison does not mean that it is meaningful.
For example, setting two systems to the same manual ISO in Av-priority mode is a TOTALLY BOGUS comparison.
There are really only two things that interest me when comparing color sensor monochrome to monochrome sensor monochrome.
One is ETTR at base ISO, and seeing how far down the shadows are usable. The other is how they compare for noise way above base ISO, with the exact same exposure, and Av and Tv values. That is all that matters for "high ISO low-light situations"; comparing the two systems in Av mode and the same manual ISO has absolutely nothing to do with anything practical, whatsoever, and the metering will nearly-normalize and nearly-mask any differences in QE. No environment dictates the practice of using a specific ISO setting.
Most comparisons that people do are too diluted to give any clarity as to potential differences; comparing noise at base ISO, for example, tells very little about the difference or ratio of SNRs, because noise is mostly subliminal and lost in processing. The color of the subject matter and the light source also make all the difference in the world, in the difference between color and monochrome sensors. You will see very little difference metered for the same ISO when the light or subject matter is a pink-ish magenta, because all the photowells are filling almost equally, regardless of color channel. For daylight, and white highlights, then the red channel will fall about a stop behind the green channel in filling the wells, and 1/2 stop behind in the blue channel, still, not a huge difference at lower ISOs. What if the light is a red LED, though? Now, the difference in QE is huge, about 12:1 or more.
There are no simple answers, here. You have to understand a lot of details to estimate how and why the photowells will fill with various wavelengths, and how metering works.
As to resolution, the red LED is a good example of an extreme. Almost nothing will be recorded in 3/4 of the photowells, and you get low resolution with egregious potential aliasing with a color sensor. With the monochrome sensor, they all fill nearly equal to their neighbors, and fill much faster.