We've installed a new QFX member, after initial configuration and restored full redundancy. As of now, all systems are back to being fully redundant again. We'll continue monitoring the infrastructure and close this issue later today.
Posted Nov 10, 2023 - 14:34 EET
Update
As of now, IPv6 connectivity is also restored.
We're proceeding with replacing the virtual chassis member.
Posted Nov 10, 2023 - 11:39 EET
Update
We've identified the main problem, which was related to one of the members of our Juniper QFX Virtual Chassis. At the moment, we believe that one of the virtual chassis members suffered a catastrophic failure of internal storage. Unfortunately, this specific Juniper QFX device was the active master in our VC configuration. As per best practices, we run multiple devices, which can take over the mastership when the current master fails, however in this case - this has not happened, as JunOS was partially running on the failing device. The master killed the internal routing sessions and no packets were being transmitted via the secondary devices. Once we determined the root cause, we've done a manual switchover to another VC member and connectivity was restored.
We're working on replacing the failing device and bringing it back in our VC to restore redundancy.
Service redundancy should still be considered to be at risk.
IPv6 connectivity is not restored on 100% yet, as our priority is bringing v4 redundancy back in place.
Further updates to follow.
Posted Nov 10, 2023 - 11:04 EET
Identified
Hello,
We've identified an issue in our Sofia location, affecting our network. We've isolated it to our QFX Virtual Chassis Core. We've restored back connectivity on a single member of our Virtual Chassis and no redundancy is present at the moment. We're working on bringing the rest of our VC members online to restore redundancy.
Posted Nov 10, 2023 - 08:09 EET
This incident affected: Sofia, Bulgaria (Network Infrastructure in Sofia).