Strange Failure Modes

When things aren’t working, I can’t help but think about what has actually gone wrong to cause the behaviour I’m observing. Here are a couple of failure modes, the root cause of which still perplexes me.

Hot Water

A couple of years ago, I was staying at a hotel in Scotland, in a small town north of Glasgow. It was late fall, which means it was relatively cold: highs were maybe 15C/60F and it got down near freezing at night. It would have been nice to come back to the hotel in the evening and take a nice, warm shower, except there was one problem: the hotel room had no cold water! This meant that every shower was incredibly, unbearably scalding hot. First-degree burns aren’t particularly relaxing.

I spent the entire trip trying to figure out how that could have happened. Not having hot water, I could have understood. Old boilers, lack of insulation on pipes in the cold weather, underprovisioned supply in a building which had clearly been expanded several times. But none of that applies to cold water! Cold water is the default if you simply don’t do anything — especially in Scotland in the fall and winter. This place was obviously doing something, and doing it very wrong.

To make matters more confusing, some friends staying in a different wing of the hotel had no issues, so it wasn’t systemic to the whole complex. The only explanations I could come up with involved seriously messed up plumbing — maybe my room, or my entire wing, connected both the “hot” and “cold” pipes back to the “hot” pipes in the building’s main supply?

The front desk managers just shrugged and mumbled something about too many people taking a shower at once when I asked, which was helpful for neither resolving nor explaining the situation. It was clearly a complaint they were used to getting, and they clearly didn’t think much about how weird it was compared to the “no hot water” complaint you see at other, less interesting, hotels!

The Lifts

Another strange “how does that happen” failure involves the displays inside the lifts at my office.

The lifts’ destinations are set via a control panel in the hall, outside the lift carriage, so there are no floor buttons inside the lifts themselves. Instead, there is a small computer display embedded in the wall of the carriage which lists each floor and highlights the ones the lift will be stopping at.

The lift has 8 possible destinations, and so the display in the lift has two rows of four indicators, numbered -1 through 6, for each of the building’s floors from the basement up to the top.

-1 0 1 2
 3 4 5 6

That is, if it’s working correctly. Which is pretty rare. Sometimes the indicator lists floors 00 through 08, which I guess makes sense as a “default fallback” or something. But often, it lists completely random numbers for the floors, something like:

5 G 3 1
0 4 Y 6

My counting can be pretty bad sometimes, but I’m at least pretty sure our building does not have floor 3 directly below floor 1, nor a floor numbered “Y” anywhere at all. Actually, I’m not sure there’s a building anywhere with a floor numbered “Y”.

It’s consistently wrong enough that those of us who work in the office have developed a game where you post photos of the display and score points based on the difference between the numbers currently shown and what they should say.

As an office of software engineers, we can’t help but think about what causes this; when we first moved into the building, it was a frequent topic of conversation. I’ve never heard of anyone seeing the numbers actually change (though clearly they do from time to time) so I suppose it must be some sort of periodic configuration sent from the master controller which is getting corrupted? The best conjecture I’ve heard is that the configuration is sent via a very dumb protocol with no error correction, over an unshielded wire which passes too close to some powerful bit of lift machinery. Which frequently causes it to get totally scrambled due to interference from the lift machinery, and then the display dutifully spits the garbage back out due to the lack of error correction. I guess that mostly makes sense, though it’s still such a bizarre failure.

I’m also a bit curious how no one at the manufacturer noticed this during testing. Or maybe they just didn’t care.

Interestingly, no matter how messed up the display is, the lift always manages to stop at the right floors. So the main controller is unaffected, only the display inside the carriage. Thus at least it’s not a safety issue. (These same lifts randomly and suddenly dropping 20cm, that was a safety issue. But the manufacturer did manage to fix that one.)

A building full of software engineers has fun with this sort of thing. I wonder what the less technically inclined think when they see something like this. Do they, too, wonder what has gone wrong? Or is this just part of the way the world works? Or, like the front desk managers in Scotland, do they not even think about how unusual this is at all?