Dealing with bad RAM on Linux
I have a server that has a single byte of RAM which is defective. Usually you just RMA the affected sticks, but I felt it was kind of wasteful to do that for just a single byte of a 8GiB stick that was otherwise still perfectly fine.
Under Linux you have basically three methods of telling the kernel that you don't want to use the defective memory anymore: The BadRAM-Patch, the memmap
kernel parameter and Grub 2s badram
command.
As the first option would require patching the kernel, I wanted to stay away from that one if possible. The second one is a bit flaky, the addresses are in MiB-increments and most places where you use the exclusion syntax where you exclude a specific region mention that is unstable.
Telling Grub would obviously be the easiest variant. It turns out that using that is actually pretty simple, but there are a few caveats, especially if you run a 64-bit system (as most people currently do).
At first you have to have run Memtest86+ and grab the addresses that are defective. Generally people recommend to use the badram output option to print out the addresses, but that output option cuts off addresses larger than 232. So what you actually need to do is take the defective address from the screen (in the example below 003ba0b5e20
)
That address (or more addresses) now needs to be stripped of excess zeroes and formatted like this: 0x00000003ba0b5e24
. Together with a mask, for example 0xffffffffffffff00
, we can put this address into the Grub config.
badram 0x00000003ba0b5e24,0xffffffffffffff00
Further addresses could be added after a second comma.
After rebooting the system we can see the RAM utilization table changed:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d3ff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009d400-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000b82f4fff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000b82f5000-0x00000000b82fbfff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000b82fc000-0x00000000b8748fff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000b8749000-0x00000000b8b98fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000b8b99000-0x00000000cc6a9fff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000cc6aa000-0x00000000cc8b1fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000cc8b2000-0x00000000cc8c8fff] ACPI data
[ 0.000000] BIOS-e820: [mem 0x00000000cc8c9000-0x00000000cce0afff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000cce0b000-0x00000000cdffefff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000cdfff000-0x00000000cdffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000cf000000-0x00000000df1fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed03fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x00000003ba0b5bff] usable
[ 0.000000] BIOS-e820: [mem 0x00000003ba0b6000-0x000000041fdfffff] usable
The two last entries are proof that our Grub parameter indeed had the effect of disabling the defective memory. The kernel blocked 1KiB of memory in between the two large usable blocks which contains the defective RAM.