In spring 2024, I completed Vivaldi’s developer build assistance transition from Goma to the newer Reclient system.
Goma and Reclient are remote build assistance that can perform compilation of files that you need to build, but do so on remote computers (workers), managed by a backend server, resulting in a cache of build results that can be shared by multiple developers, reducing time spend building the source.
Goma was the original system implemented by the Chromium team for this, and we implemented our own system for that some years ago, although to get it working with our environment we had to do some patching of the server and the client that would be running on the developers’ machines.
Reclient is the replacement for Goma, and has some significant improvements in that it can communicate directly with the caching backend (called a CAS Server) which also distributes tasks to the workers. With Goma, the backend server facilitated that communication, meaning that there was an intermediate step that might cause a performance bottleneck. Reclient is also based on the communication protocol used for the Bazel build system, which actually combines what Reclient does with what the Ninja and the GN build tools does for Chromium (but just adding Reclient to the project is far less complicated than changing the whole system to Bazel).
Soon after we transitioned in early May, I started noticing some kind of memory leak, which seemed related to Reclient usage, more builds done caused more memory usage that was never released, and eventually more than half the memory of my PC was consumed, leading to me having to reboot the computer at least once or twice a week, which had previously only happened once a month (When we “celebrated” Patch Tuesday together with Microsoft; in fact, at about the time we transitioned to Reclient we had just held another such “celebration”, more about that later).
This naturally started to get “irritating” when builds started to fail because they ran out of memory (some tasks performed when building Vivaldi require dozens of gigabytes of RAM), and I also suspected that it caused some apps to crash in the background.
A curious observation did nag at me: We are running a second Reclient cache-only system in our build cluster, and those machines did not experience the issue at all, neither did my colleagues (although there were a couple reporting issues that might be related). So what was causing the issue on my computer?
Using a Microsoft tool to analyze RAM usage I found hundreds if not thousands (actually turned out to be tens and hundreds of thousands) of processes that were still using RAM even when they were no longer running. At first, this seemed to be a problem with Reclient.
I filed a bug report to the Reclient team, but nothing much happened, so I kept rebooting my machine several times a week.
During the autumn I started to notice something else; after a while, Vivaldi and the Slack App (which is also a Chromium app based on Electron) started freezing up at irregular intervals, to the extent that Windows would put up a banner on them saying that “[App] is not responding”. I eventually concluded that the freezes happened when the apps where allocating memory from the system, and that when this started happening it was time to reboot the machine. Again.
Moving into December I finally got irritated enough that I started looking more into this issue, running various tests to try to isolate the problem, without luck. Separately I also discovered a case of the Git tool leaking memory in the same fashion when I was running a massive Git script (I actually had to move that job over to a Linux machine to be able to complete it).
Eventually I did a search for info about persisting processes like what I was seeing. One of the top results was a 2018 article by Bruce Dawson (who recently retired from Google) about zombie processes. That article identified a Windows process that (then) had a tendency to keep processes from properly closing by not releasing a Windows System Handle (Handles are used to identify resources, like open files and processes), so they remained in memory (aka “zombies”), occupying about 64 KB each. 64 KB does not sound like much, but when you are dealing with hundreds of thousands of dangling processes it will add up. The problem in my case was that the process responsible for Bruce’s 2018 problem was nowhere to be found among the Windows processes on my machine (probably gone after the original bug was fixed by Microsoft), and no other process fitted the signature of the problem.
Downloading and running a tool Bruce had made available came up empty. I then downloaded the source of the tool and started adding debug code to try to discover the problem, but still came up empty. No process that the tool could check was causing the problem.
While discussing this with Bruce, he started thinking that a device driver was causing it. I wasn’t really inclined to believe that, since I don’t have many device drivers I have installed myself, most of them being pre-installed when buying the machine, or by the OS.
Eventually, I did a new search, including device drivers as a keyword, and I started hitting reports from April and later (e.g. [1] [2]) about an AMD Adrenalin Edition graphics driver for the Radeon GPU causing lots of zombie processes. Certain newer versions of the driver seemed to allocate handles to the processes that was never released.
That was very interesting in relation to my problem, since my machine is an AMD machine, and I found that it actually had that software installed (installed by the shop that built the computer for us). However, my Graphics card is using an Nvidia chip …
The forum threads indicated that v24 of the tool was involved in the problem, but my installed version was 23.10.x, so should not be affected, an impression which did not change even when further reports indicated that 23.12 was causing the problem, too, but 23.11 (still newer than mine) was not affected.
I then looked in the Windows Device Manager app for the installed graphics drivers and found that I had a Radeon GPU (which comes shipped onboard with the CPU) with a driver installed. The driver itself had a completely unrelated (not v24) version number, but it was signed by Microsoft in early January 2024, several months after I set up the machine, which meant that it had to have been installed by Windows Update. My guess is that it was installed as part of the May Patch Tuesday update (since I hadn’t notice any memory usage issues before May).
The forum threads I found suggested several possible solutions, such as uninstalling the current software, and install an older version, alternatively disabling the onboard GPU in BIOS.
Given my suspicions about Windows Update I don’t think that reverting to an older version would work for long, unless one also took steps to freeze the driver version, so that Windows Update couldn’t smuggle it back in during a later “celebration”.
Disabling the Radeon onboard chip might work for my machine, since I am not using that HDMI port, having an Nvidia graphics card (for multiple displays), but it would definitely cause problems for those using that GPU for their display; they would have to go with the version revert option.
At the time I discovered this, I wasn’t able to do a BIOS setting change, being out of office for several weeks, but I decided to try a different option: Disable the Radeon device in the Windows Device Manager app.
After doing that, and rebooting, I ran several build jobs, including a full rebuild of one of my environments. Results so far: No zombie processes. Yay!
(Update: Bruce has updated the article describing this case, and also added a extra information in his tool to make the user aware of drivers as possible source for the problem.)
What appears to be causing this is a MAJOR bug in this AMD driver. And I find it very problematic that AMD apparently had not yet fixed the problem (at least there seems to be no version 25) or at least requested that Microsoft push out an update with a fix of the driver, despite the fact that they have to have known about the issue for over 8 months now.
Another, related issue, is the question of why this driver had any business whatsoever in setting up a handle to these processes, which apparently did not use the GPU (in particular for an unused graphics device) at all? Perhaps it happens if/when the application (or its environment) uses a GPU to calculate something. But at this point it is up to AMD to discover why they are creating zombie processes this way, and provide a fix.
In my opinion, while AMD is mainly to blame, Microsoft is not getting off the hook either, since they actually shipped that driver out using an automatic update, without discovering the zombie process issue during testing. In my opinion, Microsoft should definitely improve their driver testing procedures.
So, AMD, you had better break out the software chainsaws and flamethrowers and go hunt up those zombie plague-spots and remove them ASAP.
Update Jan 8 2025: An AMD dev responding to Bruce Dawson’s Blusky post about the issue says that the problem should be fixed in recent versions of the driver, but that is probably too early for Windows Update to be shipping it out yet. So, manual update of the driver is currently the only way to fix the issue, which means you have to be aware of the cause of the problems you are experiencing to do so. In my case, I don’t need that GPU, so I am just going to disable it in BIOS sometime in the next week.