I always seem to have issues, particularly with my network, in clusters.
I caught the flu New Year’s Eve, and was out of service for about a week. During the first day or two of being sick, my R710 running Proxmox and all of my VMs had an issue and kernel panicked. I didn’t get any details as to what the exact issue was because I was too out of it. I did try to restart it, but had no luck. It never managed to fully boot. While I was sick, I was able to re-install Proxmox to a new RAID 1 array (PVE was previously installed on a a flash drive, and I think that had something to do with the problem) and restore all of my backed up VMs. I was still pretty out of it while I did this, but everything worked fine after and I was relieved that everything was working again – home-assistant was controlling all of the outside lights, the telephone system was working, and the websites I host were back up.
Around this time the server I shipped off to colocation was installed and I was looking forward to getting services moved over before I had another issue with my infrastructure at home. This couldn’t happen soon enough. The next day (Saturday) I was feeling better, and the universe decided to test just how much better I was feeling. Sometime around one pm, the power went out. I got the generator up and running within a couple of minutes, but found out that three of my four UPS units do not run on generator power. After about twenty minutes, I had to power down the newly-rescued Proxmox server, and the file server with over 180 days of uptime. I was not happy about this. My plan was to work on migrating services over to the newly installed colocation server, but I couldn’t do this if the primary server was down. With the power out and most of the network down, I worked on cleaning up my cabling. I worked on the cabling in the back for an hour or so and when the power came back on, the back looked a bit better. I watched as everything came back online for the second time in two days, and once everything was working, I thought that I wouldn’t have to deal with this issue again for a while, as the R710 used to be very stable.
Everything was stable Sunday, so I thought I was in the clear.
The following day (Monday), I decided to spend another hour or so in the lab and work on cleaning up the cabling for the client network. I didn’t take a before picture, but it definitively looked a mess. I’m pretty happy with the way it came out. Again everything seemed stable so I thought I was in the clear – the cluster of issues was over.
Nope. I woke up on Tuesday with devices having a hard time connecting to wifi, or not connecting at all, and my IP phones were showing as unregistered. I went to the lab and saw that the R710 was completely off. Looking down, the UPS that powers it was completely off. I have no idea what could have caused this. The cats can’t turn it off because they can’t hold the power button, but something weird must have happened. I don’t see what would have caused this. Regardless, I turned it back on and watched all of my services come back online for the third time in a week. Now on to the WiFi issue. Devices either taking a long time to connect, or not connecting at all. I looked at the UniFi dashboard and saw that one of my APs was showing as disconnected from the UniFi controller. I disconnected this one and the WiFi issues seemed to stop. A bit later I thought to try connecting the offending AP to a different switch port and the issue went away, so I must have connected it to a port configured for something weird when I cleaned up the client network cabling the previous day. Fortunatly the cluster seems to be over now and everything is running smoothly. Fingers crossed it stays that way.