[Complete] Upcoming Maintenance: Upgrade to Lemmy 0.19.0

russjr08@outpost.zeuslink.net · edit-2 2 years ago

[Complete] Upcoming Maintenance: Upgrade to Lemmy 0.19.0

russjr08@outpost.zeuslink.net · 2 years ago

Well, that took a lot more blood, sweat, and tears than I thought it would.

Usually when performing an update, I do the following:

Take a snapshot of the VM
Change the version number in Lemmy docker-compose.yml file for both the lemmy and lemmy-ui containers
Re-create the containers, and start following the logs
If the database migration (if any) appears to be taking longer than expected, I temporarily disable the reverse-proxy so that Lemmy isn’t getting slammed while trying to perform database migrations (and then re-enable it once complete)
Upon any issues, examine where things might’ve gone wrong, adjust if needed, and worse-case scenario rollback to the snapshot created at the start

Everything was going to plan until step 4, the database migrations. After about 30 minutes of database migrations running, I shut off external access to the instance. Once we got to the hour and a half mark, I went ahead and stopped the VM and began rolling back to the snapshot… Except normally a snapshot restore doesn’t take all that long (maybe an hour at most), so when I stepped back 3 hours later and saw that it had performed about 20% of the restore that’s where things started going wrong. It seemed like the whole hypervisor was practically buckling while attempting to perform the restore. So I thought, okay I’ll move it back to hypervisor “A” (“Zeus”)… except then I forgot why I initially migrated it to hypervisor “B” (“Atlas”) which was that Zeus was running critically low on storage, and could no longer host the VM for the instance. I thought “Okay, sure we’ll continue running it on Atlas then, let me go re-enable the reverse-proxy (which is what allows external traffic into Lemmy, since the containers/VM is on an internal network)”… which then lead me to find out that the reverse-proxy VM was… dead. It was running Nginx, nothing seemed to show any errors, but I figured “Let’s try out Caddy” (which I’ve started using on our new systems) - that didn’t work either. It was at that point that I realized I couldn’t even ping that VM from its public IP address - even after dropping the firewall. Outbound traffic worked fine, none of the configs had changed, no other firewalls in place… just nothing. Except I could get 2 replies to a continuous ping in between the time the VM was initializing and finished starting up, after that it was once again silent.

So, I went ahead and made some more storage available on Zeus by deleting some VMs (including my personal Mastodon instance, which thankfully I had already migrated my account over to our new Mastodon instance a week before) and attempted to restore Lemmy onto Zeus. Still, I noticed that the same behavior of a slow restore was happening even on this hypervisor, and everything on the hypervisor was coming to a crawl while it was on-going.

This time I just let the restore go on, which took numerous hours. Finally it completed, and I shut down just about every other VM and container on the hypervisor, once again followed my normal upgrade paths, and crossed my fingers. It still took about 30 minutes for database migrations to complete, but it did end up completing. Enabled the reverse-proxy config, and updated the DNS record for the domain to point back to Zeus, and within 30 seconds I could see federation traffic coming in once again.

What an adventure, to say the least. I still haven’t been able to determine why both hypervisors come to a crawl with very little running on them. I suspect one or more drives are failing, but its odd for to occur on both hypervisors at around the same time, and SMART data for none of the drives show any indications of failure (or even precursors to failure) so I honestly do not know. It does however tell me that its pretty much time to sunset these systems sooner rather than later since the combination of the systems and the range of IP addresses that I have for them comes out to about ~$130 a month. While I could probably request most of the hardware to be swapped out and completely rebuild them from scratch, it’s just not worth the hassle considering that my friend and I have picked up a much newer system (the one mentioned in my previous announcement post and with us splitting the cost it comes out to about the same price.

Given this, the plan at this point is to renew these two systems for one more month when the 5th comes around, meaning that they will both be decommissioned on the 5th of February. This is to give everyone a chance to migrate their profile settings from The Outpost over to The BitForged Space as both instances are now running Lemmy 0.19.0 (to compare, the instance over at BitForged took not even five minutes to complete its database migrations - I spent more time verifying everything was alright) and to also give myself a bit more time to ensure I can get all of my other personal services migrated over, along with any important data.

I’ve had these systems for about three years now, and they’ve served me quite well! However, its very clear that the combination of the dated specs, and lack of setting things up in a more coherent way (I was quite new to server administration at the time) is showing that its time to mark this chapter, and turn the page.

Oh, and to top off the whole situation, my status page completely died during the process too - the container was running (as I was still receiving numerous notifications as various services went up and down), however inbound access was also not working either… So I couldn’t even provide an update on what was going on. I am sorry to have inconvenienced everyone with how long the update process took, and it wasn’t my intention to make it seem as if The Outpost completely vanished off the planet. However I figured it was worth it to spend my time focusing on bringing the instance back online instead of side-tracking to investigate what happened to the status page.

Anyways, with all that being said, we’re back for now! But it is time for everyone to finish their last drink while we wrap things up.