Planned Mail Server Upgrade 20/12/2009 22:00

We will be running an upgrade to our mail servers on Sunday 20th December 2009 at 22:00.  This will bring our SmarerMail servers up to the latest version 6.6.  This is the first stage of an upgrade which will provide many more email features for customers.

We expect the upgrades to only take about 2 hours in total.   As each server is upgraded POP, SMTP and Webmail will be be unavailable on the server being upgraded.  During the upgrades our backup mail servers will store all email recieved so no email will be lost.   Once each mail server comes back online the backup mail servers will deliver all stored email to the server ready for collection.

We will keep this post updated as the upgrades are carried out to inform customers of progress.

Update 20/12/2009 22:05 - The planned upgrades are now commencing.  Please note that as each mail server is upgraded sending and receiving email, plus webmail, will be unavailable on that server.  We will post further information here once available.

Update 21/12/2009 00:08 -  The planned upgrades are now complete and all mail servers are back online.  An email will be sent out to all customers later today with news of some new email features.

Comments Off

Webserver Qbic11 and mySQL5

Webserver Qbic11 and mySQL5 are currently both offline. We are looking into this as a matter of urgency and will post more information here once it is available.

Update 1:57pm - Qbic11 and mySQL5 are both back online. The issue appeared to be with the RAM which has been replaced and a full check performed. All services are now back online. Please accept our apologies for this outage.

Comments Off

Scheduled Switch Maintenance on 04/12/2009 at 02:00

Our data-centre is going to be performing some maintenance soon, the details are as follows.

Maintenance Type: Switch Maintenance
Expected effect on your service: No Connectivity for primary mySQL4 database server (mysql.controldns.co.uk)
Expected duration: <10 Minutes
This will occur between: 02:00 and 03:00 on 04/12/2009
(All times are UK) 

This maintenance is to reboot the edge switch in one of our racks, as it requires a reboot to resolve a technical problem.   This maintenance will only affect our primary mySQL4 database server (mysql.controldns.co.uk).

We apologise for any inconvenience this may cause, please do not hesitate to contact us if you have any queries or questions regarding this maintenance window.

Update 04/12/2009 11:30 – Our data-centre has confirmed that this maintenance was completed successfully.  The maintenance lasted no longer than 10 minutes and no issues occured during the process.

Comments Off

Scheduled Network Maintenance on 20/10/2009 at 23:00

Our data-centre is going to be performing some maintenance soon, the details are as follows.

Maintenance Type: Network Maintenance
Expected effect on your service: No network access
Expected duration: 30 seconds
This will occur between: 23:00 on Tuesday 20th Oct 2009 and 05:00 on Wednesday 21st Oct 2009 (All times are UK)

The nature of the maintenance window is to reconfigure the core routers as a VSS-1440 cluster. This will provide an increase in redundancy for your service and will provide additional capacity in terms of port aggregation and network resilience. The migration has been carefully planned and our data-centre are using the multiple uplinks that each rack has to stage the process and minimise the impact to service.

All of the racks in the data centre have two uplinks. During the 30 second window each rack will, in-turn, have one uplink migrated to the chassis that is configured as a cluster. The link connected to the existing configuration will then be shut down. Once all the racks are connected to the cluster chassis, the secondary chassis will be configured as part of the cluster. At this point the secondary ports will be brought back online and a fully redundant configuration will once again be achieved.

The issues previously experienced with the Cisco systems have been completely ironed out and our data-centre are confident that this work will be the last major network maintenance for the foreseeable future. They have, however, prepared the process so that a complete rollback is quick and easy to achieve should it be required.

We apologise for any inconvenience this may cause and we would like to reiterate at this time that the patience shown by all clients continues to be much-appreciated.

Update 21/10/2009 9:40 – Our data-centre has confirmed that this maintenance was completed successfully, a stable operating position has been achieved and the network and the VSS cluster are operating well within expected parameters. They will now be monitoring the cluster closely.

Comments Off

Scheduled Network Maintenance on 14/10/2009 at 16:00

Our data-centre will be performing maintenance which will put our service in an at risk period. This will last about 60 minutes, and during this time, our service is running at a higher risk of interruption if a failure occurs.

Date: 14 October 2009

Window: 16:00 for 2 Hours

Duration: < 60 minutes

This maintenance is to improve the network and is necessary under our data-centre’s programme for increasing the reliability and redundancy of the network.

Update 14/10/2009 17:20 – The maintenance has been completed without any impact on the connectivity.

Comments Off

Webserver Qbic11 and mySQL5

Webserver Qbic11 and mySQL5 are currently both offline. We are in communication with our data-centre to determine if this is a problem with the hardware or whether it is connected to the data-centre connectivity problems a couple of days ago.

Please bear with us while we try to find out the cause and we will post further information here once it is available.

UPDATE 11:44 - Both Qbic11 and mySQL5 are back online. Further information on the issue will follow shortly.

UPDATE 12:05 - It appears that this outage was due to the server hanging which required investigation. Initial results seem to indicate that it is possibly an issue with the hard disks. We will be running further diagnostics on these to see if any problems can be found. We are preparing another server to take over from the current hardware of Qbic11 in case a complete hardware change is required. If this does happen or any other maintenance is required for this server which will take services offline we will post scheduled maintenance at this site. Thank you for your patience.

Comments Off

Data-centre Connectivity Outage

Our data-centre experienced an outage of connectivity from 24/09/09 23:00 to 25/09/09 03:30 due to emergency maintenance on the routers in the data-centre.

Unfortunately due to the critical nature of the maintenance we received virtually no notice of the maintenance window. It was anticipated that the maintenance would take no longer than 45 minutes but during the maintenance there were issues with the required upgrade to the routers which meant that the outage extended to a total of 4.5 hours.

We are currently waiting for a more detailed explanation from our data-centre at which time we will update this page. Please accept our sincere apologies for the outage.

UPDATE: The following information has been received from our data-centre explaining the outage in more detail:-

This is a further update to our earlier message regarding the problems with the scheduled maintenance.

As mentioned in our previous message, there was a complication with the new firmware which required additional troubleshooting. During this time there was no connectivity to Spectrum House North Side (RSH Nth), one of the zones in one of our datacentres in Maidenhead. The network, including the London fibre ring and all external peering points, remained in full working order.

Full service to RSH Nth should have been restored by approximately 03:30. Small outages of less than 5 minutes may still be experienced by individual subnets as final configuration work is completed. These will be completed by 07:00. If anyone is still experiencing any problems please contact us immediately and we’ll do our best to resolve them for you.

Date: 24/09/09

Time: 23:00

Duration: <4.5hrs

The main cause of the extended outage was a problem in getting the VSS cluster to accept the new firmware. We have tried to provide a brief but accurate summary of the events below.

Given that the new firmware was required to avoid the memory leak issue, the situation had to be resolved. A decision was made to focus on correcting a potentially debilitating problem to the network and subsequently this evening’s outage was extended, rather than revert to a flawed firmware version.

1. New firmware image is loaded and prepared for use on reboot.
2. First router is rebooted.
3. Router finishes booting into new firmware, but the configuration has been wiped and it is no longer part of the cluster.
4. Router is reinitialised with cluster settings and rebooted again, to apply these changes.
5. The router hangs during the boot process, shortly after decompressing the image.
6. On consultation with Cisco it is agreed to boot back into the old firmware to try and restore a solid boot. This works.
7. The old image is removed from the boot memory and the new image is again prepared for use on boot.
8. This time both the image and the minor temporary configuration hold.
9. The backup of the configuration is restored to the router and a reboot applied to test that it holds, which it does.
10. The boot process includes bringing up each line card one at a time. During this boot two of the line cards are not initialised, citing an error.
11. Following consultation with Cisco, one of the two line cards is brought back online. This restores connectivity to the remaining racks in RSH.
12. Due to the reconfiguration of the cluster on the first router, these changes have to be replicated on the rest of the cluster. This is a time consuming process.
13. The previously scheduled maintenance that had been prevented by the memory leak needs to be completed. This process is ongoing and should be completed by 07:00.

UPDATE 14:20 - We have received the following information from our data-centre explaining that further emergency maintenance needs to be carried out:-

In order to try and correct the problems being experienced by the RSH North router cluster, we are performing changes to the configuration. These changes are fundamental configuration changes and will result in a loss of connectivity to all servers within the RSH North zone. This could last for anything up to 90 minutes.

We recognise that this is a very significant action but we also need to address the seriousness of the situation. It requires prompt and effective action which is why we are taking this step. As soon as we have updates on the effectiveness of the solution we will provide them to you.

UPDATE 15:35 - We have now received further information from our data-centre:-

In order to try and correct the issues affecting the RSH North router cluster we have made some fundamental changes to the configuration. Whilst making these changes we were forced to reboot the cluster and this caused a loss of connectivity to all servers within the RSH North zone.

This work has now been completed and we are focusing on restoring all connectivity. Over 80% of the servers in the RSH North zone are already back online and we are working on the remainder. Please bear with us whilst we restore your connectivity.

The process has removed the VSS cluster configuration from the RSH North zone routers. These routers are now performing their role in a similar configuration to the one which served RHC stably for over two years. However, this means routing is operating in a non-redundant manner for the time being. Until we have had more time to assess the impact of this work we cannot issue an all clear on the problems experienced in the last two days.

UPDATE 16:00 - All servers are now back online and obviously we are monitoring the situation closely in conjunction with the data-centre. Once again please accept our apologies for the inconvenience caused by this issue.

UPDATE 21:05 - More information on this continuing problem has been provided by the data-centre:-

This is a quick update to keep you informed of the situation. The CPU usage on the RSH North cluster is still at a high level which will result in degraded network performance. We are in permanent contact with Cisco engineers and are continuing to troubleshoot the issue. This work will continue to cause interruptions to connectivity.

As soon as there is any further information we will provide this to you.

UPDATE 26/09/09 12:45 – We have now received the following statement from our data-centre:-

Please be advised that we are now able to officially provide an all-clear notification to all clients regarding the network disturbances, outages, interruptions to service and emergency maintenance experienced by clients in RSH-North since Midday, Thursday 24th September.

To clarify, normal service has been resumed for the large majority of clients since before 3am however we are now able to issue a full all clear.

Please accept our apologies for the problems experienced over the last couple of days.

Comments Off