The worldspace is stream loaded, so the "load screens" are hidden in the background as you move. That can't be done completely in the city, since they don't know which building you are going into.
Sure they do. There aren't many secret entrances are there?
You only need to load the interiors starting within a certain range of the entrance. Think of a large semicircle on the ground outside your door. Divide that circle up into levels, and as you approach the door, you step on a strip of the semicircle that triggers more and more detail to be loaded into memory. There are also probably common objects in all buildings that could stay stored in a small cache. Combine that with culling which they first introduced in a basic manner in FO3, and it's not like the GPU is going to be rendering all the objects in all the houses at once. You don't need to stream the content of the house down the road if you're standing right next to the door of another house.
The problem I think is keeping the exterior entirely in cache while you remain indoors. Or, they could do the opposite of what I described above. As you get further inside the house, start unloading the city from memory. I know when I play though, I usually don't even *get* a load screen after leaving an interior.
I imagine though that the above is hard to accomplish with the memory constraints on consoles.
With that said, I'm fine with interior load screens, and as you said this is probably a good explanation as to why they don't bother:
More importantly, the buildings in the cities in Oblivion are compressed in size. The interiors are almost always larger than the exteriors, and sometimes don't even match the shape of the outside. Making the buildings big enough to contain their own insides would require making the cities take up more landscape and leave less room for wilderness. It would also require the interior and exterior designers to work much more closely together instead of creating their sections separately based on concept designs.
Edit:
It's like in Oblivion, I imagine when you were riding on a horse in a certain direction, they weren't focusing on loading the cells behind you, they were probably predictively loading the content ahead of you, to fade in as naturally as possible. Of course that engine did poorly enough that there was always stutter and pop-in.