Why not RYGB, especially given how much yellow comes up (skin tones and browns)? I know Sony tries RGTB (T being teal) a while back, but that seems the wrong direction.
Is there an advantage to using the same green filter twice? Or, more to the point, I suppose, is there no advantage to not using the same green filter twice?
Or have I missed the plot all together? The two green filters are not the same, just closer together than yellow and green?
It's mostly a questions of unknowns...
More accurately, how many unknowns you can have in a system before the solution space becomes unstable. When you have two identical greens per four pixels, your number of unknowns is halved for most intents and purposes. More than that in many cases.
You have to include the ultimate destabilizer here, noise.
Consider that you have four colours in a 2x2 repeating grid. If the optical resolution vs the digitizing resolution is kept low, so that the best the lens can do is to get you a Rayleigh of about 2 p-p distances, most scenarios can be said to contain stable solutions. Solutions where you can compute the luminance part AND the chrominance part to a reasonable accuracy for each pixel.
Add in some noise, and that stable solution falls apart. It de-stabilizes the system so that you have to choose between stable estimation result deviations to be "luminance errors" or "chrominance errors", and then compensate for that.
Now add in "better optical sharpness"... Which is the condition most consumer cameras - except the smallest sensor, high-res smartphone modules - work under today. Better sharpness than 2 p-p distances mean that there's NO stable solutions for chrominance / luminance parts. Even without noise... Even PERFECT, noiseless data will require the interpolation algorithm to make very unstable guesses about what the conditions causing the given data were, since there will be many (countless, up to the value resolution limit...) solutions to the problem. This is why raw-converters differ from each other in how they render small detail. There IS no correct solution, given that the raw converter can only know what the data tells it - it has no knowledge of what really was in front of the lens.
Now take the two worst cases (oversharp images AND noise) - and you get total chaos. At that point, you have no stable solutions no matter what you do. It's all guesswork, and you've lowered the resolution of the image by a factor 2x.