The team is preparing the master nodes in the cluster for a hardware reset, in our continuous attempts to resolve the issue which causes random temporary unavailability on file uploads and the S3 API. During the last maintenance window on March 22, a complete replacement of the hardware of a single master node was done, which ruled out faulty hardware component(s) as the root cause. Based on all collected information available from testing, available system logs and known specific bugs related to the AMD platform (on which Sonic is built), we will perform this reset to load the systems' kernel with the "iommu=pt" flag. This will allow us to pass through AMD's technology which enables virtualisation of I/O resources (AMD-Vi). Should this attempt fail at resolving the issue, a decision has been made to initiate a switch of all master nodes to a different type of servers powered by Intel.
The expected unavailability during this maintenance window is 15 minutes. It could be extended in the event that we hit a boot issue with the kernel flag enabled and need to revert the configuration.