When it comes to computers in my house, it never rains but it pours. Microsoft released their monthly Patch Tuesday updates on 8 November 2016, which I duly installed and rebooted, and turned the computer off. When I turned it on the following day, the computer failed to boot. It didn’t even get as far as the blue Windows logo that appears when you try to boot Windows 10. “Very odd”, I thought, “I wonder what caused that?” and went to do some digging.
The offending update was KB3200970, the November cumulative update for Windows 10 Anniversary Edition. The fact that the Windows screen didn’t appear lead me to think that it was actually the boot loader that was the problem, so I booted into the EFI Shell (thankfully my Supermicro X10SAE workstation board has a copy of that in its firmware, unlike some of the consumer boards) and attempts to boot bootx64.efi off the EFI System Partition by hand. And, lo and behold, it crashed the machine solid when I tried to run it.
The next thing to try was a clean install of Windows 10 and see if it still did it. (Protip: Never store any important files on a Windows box, especially on a boot disk, in case you have to do things like this). A clean install of Windows 10 Anniversary Update from USB stick worked initially, no matter how many times I booted it, so I assumed from this it wasn’t a hardware fault – and just to prove it, I installed Debian. That worked fine too. But as soon as KB3200970 was installed, that made it go bang on the reboot after the updates had been installed, every time.
To cut a long story short, talking to Microsoft support was an experience that was worse than useless. I couldn’t get anything useful out of them apart from vague promises that ‘our engineers are looking into this’ and I gave up all hope of anyone bothering to deliver on that promise, especially as they weren’t asking the right questions about my hardware configuration or software configuration, so in the end I disabled the installation of KB3200970 and left it at that. At least I had a working computer. In the meantime, I read up on Windows PE and how to make .WIM files of the hard disk, so at least I had a back of both the ESP and the boot disk NTFS partition if it happened again. The problem was, the boot loader was changed as part of this update because of a security hole CVE-2016-7247, and therefore either MS had to undo the bad fix, or I had to work out what it was about my system that was causing this behaviour.
Fast forward a few days and a very curious note appeared on the Microsoft support page for KB3200970. It would appear that certain Lenovo-branded servers were having similar symptoms to me, but on Windows Server 2016, and the official Microsoft advice and the advice on the Lenovo article was ‘Lenovo are looking into it’ and ‘don’t install the update’. This started to ring a bell, because this usually results in a new UEFI firmware being issued, and sure enough, that’s what happened in the case of Lenovo.
For mainly sound-card related reasons, I had to downgrade to UEFI firmware 2.0 on my motherboard because the sound card stuttered when I installed 3.0. I did start to wonder whether whatever was breaking the Lenovo servers was also breaking my board as well.
So, I re-flashed my motherboard with the 3.0 UEFI firmware. Last time I did this, this caused my sound card to stutter so badly that it became unusable. (You should probably know that my sound card is actually PCI, as it’s an expensive RME HDSP 9632 card, so to replace it with a PCIe version for the sake of it would have cost about twice as much as the motherboard, and therefore it’s connected to the rest of the motherboard via an onboard PCIe bridge.)
So, more digging. What could be causing the stuttering? First I tried the PCI latency timer, which is normally set at 32 PCI clocks by default – raising this value gives each PCI card a longer timeslot to send data, but this can cause performance problems. Either way, changing the value to a much larger value, like 128 or 224, didn’t help so that couldn’t have been it.
After much head-scratching and guessing, what it turned out to be in the end was a bit obscure. I’m not sure if this is a new option in the 3.0 UEFI firmware, but there is a set of options to configure something called ASPM (Active State Power Management). Reading Wikipedia or otherwise on this subject I’ll leave as an exercise for the reader, but it would appear that my RME sound card doesn’t like having the power management enabled on it, and it’s possible in the 2.0 UEFI firmware it was locked to power management off, but in the 3.0 version the option was exposed and defaulted to Auto. So, (mainly in desperation at this point), I turned the power management off on the PCI slots only to see if it improved things. In trepidation, I saved the changes in the UEFI settings and booted Windows, and fired up some sound. No more stutters…
With that problem out of the way, I thought, perhaps this 3.0 UEFI firmware is usable after all. So then I had the silly idea of trying to reinstall KB3200970 to see if it would crash my machine again. After all, what could possibly go wrong (!) and in any event I had a .WIM file backup of my machine that I could restore if it went wrong again.
So I installed it from an .msu file rather than using Windows Update proper and rebooted. The first reboot would always have worked, because the new boot loader gets installed as Windows comes up from the first reboot.
So, the machine booted, and I rebooted it again, expecting to get a crashed machine and a blank screen. But this time I didn’t. So I rebooted it again just to make sure it wasn’t a fluke. It rebooted properly again. Then I tried it again, to make sure the last reboot wasn’t a fluke, and it booted properly again.
So, whatever fixes were made between the 2.0 and 3.0 versions of the UEFI firmware for the Supermicro X10SAE board, made this work again. Had I have not figured out what was causing the sound card stuttering, this version of the firmware would have been very problematic for me, but now I seem to have a happy computer again.
So, if your boot loader crashes on you, it might just be your UEFI firmware (or BIOS if you have one of those instead) in need of an update. As anyone who was read them knows, UEFI and BIOS changelogs are notoriously bad at actually explaining what they fix (mainly because the firmwares themselves seem to be so badly written it would be like hanging out all your dirty washing in public if they admitted to anything). And I’m sure Microsoft will be blaming the firmware writers and the firmware writers will be blaming Microsoft, and no-one will admit to what really happened here, but if it’s been this much bother for me, what on earth is the ordinary punter in the street supposed to do about it when Windows fails to boot and your sound card stutters in this manner?
I’ve encountered the sort of weirdness you’re describing with PCIe to PCI bridges before.
The trouble is that PCIe ASPM state transitions take time, and the PCI card expects to be able to DMA at any moment. If your card was a PCIe card, it’d prevent ASPM L0s/L1 entry when in use. Disabling ASPM for the bridge solves this with the big hammer – you could enable it if you weren’t latency-bound for power savings