Our data-centre experienced an outage of connectivity from 24/09/09 23:00 to 25/09/09 03:30 due to emergency maintenance on the routers in the data-centre.
Unfortunately due to the critical nature of the maintenance we received virtually no notice of the maintenance window. It was anticipated that the maintenance would take no longer than 45 minutes but during the maintenance there were issues with the required upgrade to the routers which meant that the outage extended to a total of 4.5 hours.
We are currently waiting for a more detailed explanation from our data-centre at which time we will update this page. Please accept our sincere apologies for the outage.
UPDATE: The following information has been received from our data-centre explaining the outage in more detail:-
This is a further update to our earlier message regarding the problems with the scheduled maintenance.
As mentioned in our previous message, there was a complication with the new firmware which required additional troubleshooting. During this time there was no connectivity to Spectrum House North Side (RSH Nth), one of the zones in one of our datacentres in Maidenhead. The network, including the London fibre ring and all external peering points, remained in full working order.
Full service to RSH Nth should have been restored by approximately 03:30. Small outages of less than 5 minutes may still be experienced by individual subnets as final configuration work is completed. These will be completed by 07:00. If anyone is still experiencing any problems please contact us immediately and we’ll do our best to resolve them for you.
Date: 24/09/09
Time: 23:00
Duration: <4.5hrs
The main cause of the extended outage was a problem in getting the VSS cluster to accept the new firmware. We have tried to provide a brief but accurate summary of the events below.
Given that the new firmware was required to avoid the memory leak issue, the situation had to be resolved. A decision was made to focus on correcting a potentially debilitating problem to the network and subsequently this evening’s outage was extended, rather than revert to a flawed firmware version.
1. New firmware image is loaded and prepared for use on reboot.
2. First router is rebooted.
3. Router finishes booting into new firmware, but the configuration has been wiped and it is no longer part of the cluster.
4. Router is reinitialised with cluster settings and rebooted again, to apply these changes.
5. The router hangs during the boot process, shortly after decompressing the image.
6. On consultation with Cisco it is agreed to boot back into the old firmware to try and restore a solid boot. This works.
7. The old image is removed from the boot memory and the new image is again prepared for use on boot.
8. This time both the image and the minor temporary configuration hold.
9. The backup of the configuration is restored to the router and a reboot applied to test that it holds, which it does.
10. The boot process includes bringing up each line card one at a time. During this boot two of the line cards are not initialised, citing an error.
11. Following consultation with Cisco, one of the two line cards is brought back online. This restores connectivity to the remaining racks in RSH.
12. Due to the reconfiguration of the cluster on the first router, these changes have to be replicated on the rest of the cluster. This is a time consuming process.
13. The previously scheduled maintenance that had been prevented by the memory leak needs to be completed. This process is ongoing and should be completed by 07:00.
UPDATE 14:20 - We have received the following information from our data-centre explaining that further emergency maintenance needs to be carried out:-
In order to try and correct the problems being experienced by the RSH North router cluster, we are performing changes to the configuration. These changes are fundamental configuration changes and will result in a loss of connectivity to all servers within the RSH North zone. This could last for anything up to 90 minutes.
We recognise that this is a very significant action but we also need to address the seriousness of the situation. It requires prompt and effective action which is why we are taking this step. As soon as we have updates on the effectiveness of the solution we will provide them to you.
UPDATE 15:35 - We have now received further information from our data-centre:-
In order to try and correct the issues affecting the RSH North router cluster we have made some fundamental changes to the configuration. Whilst making these changes we were forced to reboot the cluster and this caused a loss of connectivity to all servers within the RSH North zone.
This work has now been completed and we are focusing on restoring all connectivity. Over 80% of the servers in the RSH North zone are already back online and we are working on the remainder. Please bear with us whilst we restore your connectivity.
The process has removed the VSS cluster configuration from the RSH North zone routers. These routers are now performing their role in a similar configuration to the one which served RHC stably for over two years. However, this means routing is operating in a non-redundant manner for the time being. Until we have had more time to assess the impact of this work we cannot issue an all clear on the problems experienced in the last two days.
UPDATE 16:00 - All servers are now back online and obviously we are monitoring the situation closely in conjunction with the data-centre. Once again please accept our apologies for the inconvenience caused by this issue.
UPDATE 21:05 - More information on this continuing problem has been provided by the data-centre:-
This is a quick update to keep you informed of the situation. The CPU usage on the RSH North cluster is still at a high level which will result in degraded network performance. We are in permanent contact with Cisco engineers and are continuing to troubleshoot the issue. This work will continue to cause interruptions to connectivity.
As soon as there is any further information we will provide this to you.
UPDATE 26/09/09 12:45 – We have now received the following statement from our data-centre:-
Please be advised that we are now able to officially provide an all-clear notification to all clients regarding the network disturbances, outages, interruptions to service and emergency maintenance experienced by clients in RSH-North since Midday, Thursday 24th September.
To clarify, normal service has been resumed for the large majority of clients since before 3am however we are now able to issue a full all clear.
Please accept our apologies for the problems experienced over the last couple of days.