But the size of the code can be reduced as well as optimized. As such, you may get 10-20 FPS more with just with juts switching on a bunch of CPU specific compiler switches(GPU doesn't count here).
For instance, I tried to run Crysis 2 on a P4 machine, by the time Multiplayer loaded, the game was over. Had it been compiled with the P4 architecture in mind, there could(and I emphasize on 'could') have been a significant performance increase, but at the same time, that particular code would've(or might have) performed bad on an AMD chip, and would still be limited by the GPU.
This is why these practices are not utilized in commercial applications as you'd need to redistribute the binaries for each architecture which is costly, and bandwidth consuming(and time consuming for a developer).
Oh yeah, almost forgot. Code with no changes that was compiled for x86-64 could literally be twice as fast as opposed to it's x86 counterpart, not counting the maximum memory it could allocate would increase far beyond what 32-bit addressing offers, again, just the CPU side.
Just my 2 perhaps pointless cents.