Author Topic: Network Outage at Noon today (Status update)  (Read 364 times)

Offline Doug Hazard

  • Richweb Staff (Admin)
  • Newbie
  • *
  • Posts: 28
  • CFB / MBB Junkie
    • http://twocentsradio.net
Network Outage at Noon today (Status update)
« on: March 22, 2017, 06:06:30 pm »
Around noon a Richweb engr made an incorrect settings adjustment to the vlan allow list for the hyper-v cluster switch. This prevented internet and lan access from working properly for about 20 minutes until it could be discovered, and then it was manually corrected.

Storage was not affected. This was an operator error and not any issue with the hardware or systems.

The reason for the change(s) (frequency and timing) I will explain below. We are investing a lot of engery and money in our cloud to make sure that it offers excellent performance for our customers.

Some of the MS Windows memory settings as originally configured last fall were found to be sub-optimal and we are fixing that (requires a vm reboot) one by one, after hours.

In addition, Richweb has been wrestling with storage issues the past 2 months where VMs will disconnect from their storage. We are pretty sure the issue is a problem with Infrascale backups locking the VMs on a backup snapshot run and then not releasing them properly.

This can cause a whole host to hang in some cases, and require a reboot (we slide the vms to another host in the pool first of course). But its a problem when the vm manager does not agree with where the vms are running. In some cases the vms are turning themselves off when there is a conflict and then from the manager we are not able to correctly restart them. Needless to say, this is not acceptable and we are working on fixing this.

Actions taken/in progress:
We have disabled whole-vm snapshots+backups to test this scenario out. We are still running the backups of the data inside the vms and that is working fine. Its just the whole-snap level backup that seems to be causing a conflict.

Sunday morning at 6AM the storage array that handles some of the web and email hosting (san9) went down for about 10 minutes due to conflicts and contention and the same issue happened again at Monday at 820AM on a different set of vms and storage arrays.

san9 has been replaced (it was scheduled for upgrade anyway) but some of the details captured from the monday event are making it clear that this is likely the backup software that is breaking and not anything with our storage or hyperv cluster.

Since disabling the backups we have seen no storage conflicts (fights over who can read and write to a disk) whereas before we would see them every nite.

Richweb has 4 storage arrays for our cloud hosting:

san8 (new as of summer 2016)
san5 (brand new nfina array with enhanced caching - just purchased)
san11 (new as of fall 2016, rebuilt caching config in mar 2017)
san10 (original array, unchanged)

We have been making a large set of changes to the storage configuration over the last 2 months. 4 older arrays have been retired and 2 new arrays brought into service). In addition the performance on san11 as originally configured was not what it should have been so the hardware and software configs were rebuilt and its now working properly.

The pace of the changes should settle down now that upgrades are mostly completed.
« Last Edit: March 23, 2017, 12:02:34 am by Doug Hazard »
Doug "Bear" Hazard
@BearlyDoug  |  @GridironHistory  |  @Hogville | http://gridironhistory.com

Co-Host of Two Cents Radio, powered by the SportsManCave.com Radio Network, Wednesday Nights from 8 PM to 10 PM Central, covering SEC and Sun Belt Sports.

Listen at SMCRadio.com and follow @TwoCentsRadio on Twitter!

 

With Quick-Reply you can write a post when viewing a topic without loading a new page. You can still use bulletin board code and smileys as you would in a normal post.

Warning: this topic has not been posted in for at least 120 days.
Unless you're sure you want to reply, please consider starting a new topic.

Note: this post will not display until it's been approved by a moderator.
Name: Email:
Verification:
Type the letters shown in the picture
Listen to the letters / Request another image
Type the letters shown in the picture:
Name of the player behind "Curse of the Bambino" (Baseball, Boston, NY Yankees. Two words, 8 letters with a space):