Breaking and unbreaking my Arch installation
An unsolved mystery and a solution?
A couple of weeks ago, I somehow managed to break my Arch Linux installation
through a simple pacman -Syu
. I’m still not sure what caused it, but after a
quite normal package update, I rebooted the system and was met with a message
stating something along these lines, before I could even input my LUKS
passphrase:
Failed to execute /bin/init (No such file or directory)
Failed to execute /sbin/init (No such file or directory)
Failed to execute /init (No such file or directory)
Considering the kernel image was loaded and was trying to load an init system, this was likely a problem with the initramfs. For the unitiated, the initial RAM filesystem, also known as initrd, for “initial ramdisk,” is a temporary file system that is loaded into memory during the boot process, to prepare the system before the init system is loaded.
On a system like mine, with an encrypted root filesystem, the initramfs is
responsible for unlocking the encrypted LUKS volume so the Linux kernel image
can then mount the root filesystem. This should all happen before the system
root is mounted and /bin/init
is executed. The init system could be one of
many, but on a default Arch Linux installation like mine, it is systemd:
$ file /bin/init
/bin/init: symbolic link to ../lib/systemd/systemd
So something seemed wrong with the initramfs, and thus my first instinct was to
chroot
from another system and rebuild all the images with mkinitcpio -P
. I
wouldn’t need a live USB for this, as I had a dual-boot setup with NixOS, to
which I was in the admittedly slow process of migrating. Lucky me, I thought.
Visualizing the state of my system, I had:
NAME SIZE
nvme0n1 931.5G
├─nvme0n1p1 2G <--- /boot (EFI, broken initramfs?)
├─nvme0n1p2 16G <--- swap
├─nvme0n1p3 100G <--- LUKS
│ └─arch 100G <- Arch Linux root (good?)
└─nvme0n1p4 150G <--- LUKS
└─nixos 150G <- NixOS root (good)
Booting up NixOS was working fine, so I mounted the Arch root and the boot
partition under /mnt
, then tried to chroot into it, but instead of being
thrown into a shell within the Arch root, I was met with a strange error that
disturbed my calm:
# mount /dev/mapper/arch /mnt
# mount /dev/nvme0n1p1 /mnt/boot
# mount -o bind /dev /mnt/dev
# mount -o bind /proc /mnt/proc
# mount -o bind /sys /mnt/sys
# chroot /mnt /bin/bash
chroot: failed to run command '/bin/bash': Input/output error
This was strange. I could mount the root filesystem, but not chroot into it.
Have I run out of write cycles in the SSD?! No, this NVMe SSD is almost
brand new, and I did manage to write to and read from it just fine, so that
cannot be it. I was also able to execute the binary file from NixOS just fine
by running /mnt/bin/bash
, so it wasn’t a problem with the binary itself.
Running strace -f chroot /mnt /bin/bash
didn’t give me any more information,
but it did show a ton of errors regarding accessing locale files in the
/nix/store
directory, which was odd to me.
Thinking perhaps this is a problem with NixOS itself, I looked up chroot
on
the NixOS wiki, finding my way to this article, which
recommends the following commands to change root to another NixOS
installation:
mount -o bind /dev /mnt/dev
mount -o bind /proc /mnt/proc
mount -o bind /sys /mnt/sys
chroot /mnt /nix/var/nix/profiles/system/activate
chroot /mnt /run/current-system/sw/bin/bash
This latter part is very important. I knew it made no sense to run that
“activate” script on something that is not NixOS, but I thought perhaps this
was the way chrooting from NixOS was meant to be done. So without backing
anything up, I bind-mounted /nix
from the outer NixOS installation at
/mnt/nix
and ran it.
DO NOT DO WHAT I DID, because this was a huge mistake. Not only did I still
get the same error, I also managed to further break the Arch Linux installation
as the NixOS activate
script overwrote files under /etc
, /var
, /bin
,
/lib
, and many other places with symlinks to the Nix store, which I needn’t
note does not exist in the Arch root unless you’re using Nix on Arch — which
I am not.
Alright, so I had a broken system and I just poked additional holes into it. If NixOS wasn’t going to help me, perhaps an installation media for Arch Linux would. Let’s create one and try to chroot from it.
The saga of the live USB
I looked for a spare USB stick and found one with a capacity of 16 GB. Trust
me, this detail will be relevant later. As per usual, I downloaded the latest
Arch Linux ISO from the official website and flashed it
to the USB stick using dd
:
# dd if=archlinux-2024.04.01-x86_64.iso of=/dev/sdb status=progress
# sync
Pretty standard procedure, right? Well, I booted off this USB stick and after the loud and disturbing beep that plays upon successfully booting up the installation media, I picked the first option on GRUB and…
error: invalid magic number
error: you need to load the kernel first
Press any key to continue...
Sadly there was no prompt for me to input 3. Pressing any key to continue would just go down the boot menu. I tried the other options to load the installation media, but they were all afflicted by this same error. Looking it up online only brought me to forums where replies amounted to “your live USB is broken, make a new one.” Quite insightful.
Maybe I did screw something up when flashing this guy, I thought, so I decided to flash another ISO to the USB stick: the GParted live ISO. My partitions were in need of a resize anyway, I might as well try and see if the problem magically goes away. It didn’t: the same error was there, and I couldn’t get past the bootloader.
A-ha! NixOS must be the problem! It is an immutable system, if there is such a thing as a magic number generated randomly, it is likely to be the same for every NixOS installation…? I don’t know, this is a simple hypothesis and I was desperate. My work machine is a Macbook, so let’s just flash the Arch Linux ISO from there!
macOS absolutely sucks
For whatever reason I believed dd
not to be available on macOS — this is
incorrect, it is available and I should’ve tried using it, so take this section
with a grain of salt — so I decided to use balenaEtcher, which my
wife had used before to create a bootable USB stick for Ubuntu. Everything
looked promising, I granted the stupid permissions it asked for, and it failed
to flash by saying /dev/disk2
was not writable. Even running the program
with sudo
had the same result.
Okay, perhaps UNetbootin would work. I downloaded the macOS version, ran it, and it failed to detect the USB stick no matter which USB port I plugged it into. I checked permissions for the inodes, ran everything I could as root, but it’s like the system was working against me. After struggling with this for a few hours, I gave up and called my wife to ask if I could use her Windows machine to flash the USB stick.
The sudden death of a USB stick
After being granted permission, I downloaded the Arch Linux ISO and flashed it
using Rufus. The process was quick and painless, and I was able to
boot off the USB stick without any issues. My first order of business now that
I could boot off the installation media was to create BtrFS snapshots to backup
my data. The one thing I wanted to separate from those snapshots was the
/home
directory, so I needed to make it into a separate subvolume.
# cryptsetup open /dev/nvme0n1p3 archlinux
# mount /dev/mapper/archlinux /mnt
# mv /mnt/home /mnt/old-home
# btrfs subvolume create /mnt/home
# cp -rp --reflink=always /mnt/old-home/* /mnt/home
!!! No space left on device !!!
Strange, this cp
command was supposed to create a copy-on-write clone of the
directory, meaning no new space would be consumed to begin with. Perhaps
metadata needed to be created anyway, and I was running out of space in the 100
GB root partition. So I decided it was time to resize these partitions and
maybe get rid of NixOS for good; time to flash the GParted live ISO.
I move back to the Windows machine, plug in the USB stick, and… its name and filenames look like Minecraft incantations. Against my better judgment, I flashed the GParted live ISO to the USB stick using Rufus, and afterwards its reported capacity went mysteriously from 16 GB to 32 GB.
I plugged it into my computer expecting the worst, but I was able to boot off of it without any issues.
GParted and LUKS
GParted is perfectly capable of unlocking LUKS volumes and mounting the partitions that it finds within. I unlocked the Arch Linux root partition and clicked around to schedule its resize. All it needed to do was:
- Increase the size of the containing partition in the partition table;
- Unlock the LUKS volume to reveal the partition within;
- Extend the LUKS volume’s awareness of the containing partition;
- Resize the filesystem within the contained partition.
Except GParted refused to resize LUKS volumes for whatever reason. The error I got was said to have been fixed a few years ago, and I was on the latest version of GParted already. I thought perhaps if I went back to an Arch Linux live installation media, I could install GParted and resize the partitions from there. Except now the USB stick no longer worked, so I needed to find yet another USB stick from which to boot. This took longer than I would’ve liked, and in the end I got the same error anyway.
The “fix”?
In the end I used the new installation media (on a new USB stick) to resize the partition manually:
# fdisk /dev/nvme0n1 # Delete and recreate the partition!
# cryptsetup open /dev/nvme0n1p3 archlinux
# cryptsetup resize archlinux
# mount /dev/mapper/archlinux /mnt
# btrfs filesystem resize max /mnt
This was enough for me to be able to create the subvolume I wanted to and the
copy-on-write clone of the /home
directory. I then proceeded to create more
subvolumes for the root filesystem and take snapshots of what I had before.
None of what I tried doing to fix the existing system worked, but how could it when I wiped a lot of it when activating a NixOS profile? I still don’t know why I tried that, but I did, and I paid the price. I’m still not sure what went wrong, but all I could do in the end was create a completely fresh installation of Arch Linux and restore my data from the snapshots I took.
# cryptsetup open /dev/nvme0n1p3 archlinux
# mount /dev/mapper/archlinux /mnt
# mkdir /mnt/snap
# btrfs subvolume snapshot /mnt /mnt/snap/archlinux-broken-root
A copy of the Pacman database was also taken, so I could then reinstall all the
packages I had before. While this resulted in mostly the same setup as before,
I am now using the sd-encrypt
hook for mkinitcpio
instead of the encrypt
one, since the latter wasn’t working for me anymore, and some mysterious
problems came up with Plex Media Player where I am unable to play any media.