SFS storage disk array failure

Resolved

Major outage

Started almost 3 years agoLasted 2 days

Updates

Resolved
September 29, 2022 at 1:37 PM
Resolved
September 29, 2022 at 1:37 PM
We are closing this incident now.
Monitoring
September 29, 2022 at 10:55 AM
Monitoring
September 29, 2022 at 10:55 AM
The team has managed to restore the entire system and services are now back up. No data loss has been reported for this incident. While we continue to monitor the situation and still need to bring some aux services up, we want to thank all customers for their patience and understanding during this incident.
Update
September 28, 2022 at 9:05 PM
Update
September 28, 2022 at 9:05 PM
We have managed to recover all arrays and we are now verifying that content is actually in place and not lost. At this moment we don't see any missing data. The team is facing issues with booting into the on-disk OS which is resulting in our inability to bring the service back up. Efforts will continue and next update will follow after 10AM CET or, if progress is being made, before that time.
Update
September 28, 2022 at 7:43 PM
Update
September 28, 2022 at 7:43 PM
The array that holds customers' data has now completed the recovery process. Our course of action at this point is to decide between making the customers' data available immediately by mounting the array in the rescue OS, or to attempt to recover the OS and and the /boot arrays, which would allow us to boot into the on-disk operating system. The latter, if successful, will allow us to completely restore the SFS service without further prolonging the downtime. We've decided to attempt this approach at the expense of a short additional downtime and are now starting the procedures that are required. The arrays that remain to be recovered are small in size and expected recovery time should be very short on these. However, this incident should still be considered ongoing to full extend and until further notice the SFS service remains unavailable.
Update
September 28, 2022 at 2:08 PM
Update
September 28, 2022 at 2:08 PM
Array recovery is now nearing 90%. We continue to await the completion before proceeding with next steps.
Update
September 28, 2022 at 10:57 AM
Update
September 28, 2022 at 10:57 AM
We are continuing to monitor and wait for the RAID rebuild to complete. The process is currently at 70%.
Update
September 28, 2022 at 12:11 AM
Update
September 28, 2022 at 12:11 AM
Current ETA for rebuild of the array is 16 hours. The team will let the processes run and we will temporarily suspend further updates until 10AM CET, ~8 hours from this update.
Update
September 27, 2022 at 11:33 PM
Update
September 27, 2022 at 11:33 PM
The team is observing the rebuild process. Unfortunately, it is currently not possible to restart affected systems with their default operating system images and the recovery processes are taking place on a live (rescue) OS. During this process, which may be lengthy, the SFS storage service will remain unavailable. Due to the nature of the incident it remains unclear if any data has been actually lost, but our findings so far show that data shall be in tact once all rebuild processes are completed. At this point in time customers who are using SFS as their primary origin for their content are advised to switch to their alternative storage source via a pull zone to avoid extended content unavailability, and to switch back to SFS once this issue is resolved. Updates will follow as we see progress on the rebuild process.
Update
September 27, 2022 at 10:30 PM
Update
September 27, 2022 at 10:30 PM
Physical drive replacements have now been completed. We are evaluating the scope of the incident in terms of data loss and are starting a rebuild of the array. We can not confirm if there is any data loss at this stage, but we still do not have any data that might suggest so. Updates will continue.
Update
September 27, 2022 at 9:28 PM
Update
September 27, 2022 at 9:28 PM
Physical interventions have now begun. Temporary unavailability of content that is not cached on the edge of our CDN network is expected. Updates will follow.
Update
September 27, 2022 at 8:59 PM
Update
September 27, 2022 at 8:59 PM
Preparations for physical replacement of faulty drives have been completed and request has been sent to remote hands service in data centre. We are now awaiting for the replacement to take place. We do not have any indications of data loss, and content continues to be available in a read-only state. Updates to follow.
Identified
September 27, 2022 at 3:29 PM
Identified
September 27, 2022 at 3:29 PM
The team is now working to isolate the faulty drives from the RAID array as we prepare for their physical replacement. During this stage the storage system will enter a read-only state. Writing new data as well as appending to/changing existing data on the system will not be possible. At this time we still do not have a reason to suspect any data loss. Updates to follow.
Investigating
September 27, 2022 at 2:57 PM
Investigating
September 27, 2022 at 2:57 PM
We are currently investigating a RAID array failure related to our storage product, SFS. There is no evidence of data loss at this time, but drive replacements need to take place. Due to this incident some customers may find that they can not log into their storage spaces. We are working to resolve this issue and will follow up with an update as soon as one is available.

pushr - SFS storage disk array failure – Incident details

SFS storage disk array failure