pushr - SFS storage disk array failure – Incident details

PUSHR's global system status is being updated automatically when our monitoring systems detect an issue with any of our services. If you are aware of an ongoing issue that is not listed here, please report it to us by clicking on the link above, or by opening a ticket from your account's dashboard.

SFS storage disk array failure

Resolved
Major outage
Started almost 3 years agoLasted 2 days
Updates
  • Resolved
    Resolved

    We are closing this incident now.

  • Monitoring
    Monitoring

    The team has managed to restore the entire system and services are now back up. No data loss has been reported for this incident. While we continue to monitor the situation and still need to bring some aux services up, we want to thank all customers for their patience and understanding during this incident.

  • Update
    Update

    We have managed to recover all arrays and we are now verifying that content is actually in place and not lost. At this moment we don't see any missing data. The team is facing issues with booting into the on-disk OS which is resulting in our inability to bring the service back up. Efforts will continue and next update will follow after 10AM CET or, if progress is being made, before that time.

  • Update
    Update

    The array that holds customers' data has now completed the recovery process. Our course of action at this point is to decide between making the customers' data available immediately by mounting the array in the rescue OS, or to attempt to recover the OS and and the /boot arrays, which would allow us to boot into the on-disk operating system. The latter, if successful, will allow us to completely restore the SFS service without further prolonging the downtime. We've decided to attempt this approach at the expense of a short additional downtime and are now starting the procedures that are required. The arrays that remain to be recovered are small in size and expected recovery time should be very short on these. However, this incident should still be considered ongoing to full extend and until further notice the SFS service remains unavailable.

  • Update
    Update

    Array recovery is now nearing 90%. We continue to await the completion before proceeding with next steps.

  • Update
    Update

    We are continuing to monitor and wait for the RAID rebuild to complete. The process is currently at 70%.

  • Update
    Update

    Current ETA for rebuild of the array is 16 hours. The team will let the processes run and we will temporarily suspend further updates until 10AM CET, ~8 hours from this update.

  • Update
    Update

    The team is observing the rebuild process. Unfortunately, it is currently not possible to restart affected systems with their default operating system images and the recovery processes are taking place on a live (rescue) OS. During this process, which may be lengthy, the SFS storage service will remain unavailable. Due to the nature of the incident it remains unclear if any data has been actually lost, but our findings so far show that data shall be in tact once all rebuild processes are completed. At this point in time customers who are using SFS as their primary origin for their content are advised to switch to their alternative storage source via a pull zone to avoid extended content unavailability, and to switch back to SFS once this issue is resolved. Updates will follow as we see progress on the rebuild process.

  • Update
    Update

    Physical drive replacements have now been completed. We are evaluating the scope of the incident in terms of data loss and are starting a rebuild of the array. We can not confirm if there is any data loss at this stage, but we still do not have any data that might suggest so. Updates will continue.

  • Update
    Update

    Physical interventions have now begun. Temporary unavailability of content that is not cached on the edge of our CDN network is expected. Updates will follow.

  • Update
    Update

    Preparations for physical replacement of faulty drives have been completed and request has been sent to remote hands service in data centre. We are now awaiting for the replacement to take place. We do not have any indications of data loss, and content continues to be available in a read-only state. Updates to follow.

  • Identified
    Identified

    The team is now working to isolate the faulty drives from the RAID array as we prepare for their physical replacement. During this stage the storage system will enter a read-only state. Writing new data as well as appending to/changing existing data on the system will not be possible. At this time we still do not have a reason to suspect any data loss. Updates to follow.

  • Investigating
    Investigating

    We are currently investigating a RAID array failure related to our storage product, SFS. There is no evidence of data loss at this time, but drive replacements need to take place. Due to this incident some customers may find that they can not log into their storage spaces. We are working to resolve this issue and will follow up with an update as soon as one is available.