24 August 2018

Help, ZFS ate my storage server (kernel segfaults on SL6)


At Edinburgh our storage test server (sl6) just updated it's kernel and had to reboot. Unfortunately it did not come back and suffered a kernel segfault during the reboot.

This was spotted to be during the filesystem mounting stage in the init scripts and specifically was caused by modprobe-ing the zfs module which had just been built by dkms.

The newer sl6 redhat kernels (2.6.32-754....) appear to have broken part of the abi used by the ZFS modules built by dkms.

The solution to fix this was found to be:
  1. Reboot into the old kernel (anything with a version 2.6.32-696... or older)
  2. check dkms for builds of the zfs/spl modules:   dkms status
  3. run:   dkms uninstall zfs/0.7.9; dkms uninstall spl/0.7.9
  4. make sure dkms removed this for ALL kernel versions (if needed run dkms uninstall zfs/0.7.9 -k 2.6.32-754) to remove it for a specific kernel
  5. remove all traces of these modules:
     for i in /lib/modules/*; do
      for j in extra weak-updates; do
       for k in avl icp nvpair spl splat unicode zcommon zfs zpios ; do
         rm -r ${i}/${j}/${k};
       done;
      done;
     done
  6. reboot back into the new kernel and reinstall zfs:
    dkms install zfs/0.7.9; dkms install spl/0.7.9
  7. Check that you've saved everything important.
  8. Now load the new modules: modprobe zfs
  9. re-import your pools: zpool import -a
Alternatively: Remove all of the zfs modules (steps 3 and 5) before you reboot your system after installing the new kernel and then dkms will re-install everything on the next reboot.

For more info: https://github.com/zfsonlinux/zfs/issues/7704


TL;DR  When building new kernel modules dkms doesn't always rebuild external modules safely, make sure you remove these dependencies when you perform a kernel update so that everything is rebuilt safely