CLOUDSTACK-9114: Reduce VR downtime during network restart by yadvr · Pull Request #2508 · apache/cloudstack

yadvr · 2018-03-23T13:13:35Z

Description

Every time there is a major CloudStack version that requires a new systemvmtemplate, admins need to restart VRs which can take few (sometimes several) minutes. The aim is to make it easier to perform a rolling restart (or blue-green deployment) of the network with acceptable downtime. In this strategy, a new VR is provisioned before the old VR is destroyed/cleaned for non-redundant isolated VRs. For redundant networks, a rolling replacement of routers is performed.

The network downtime is usually the short-time (less than 10s) after the new VR starts (nics are up) and old VR is destroyed. The downtime may be affected by several factors as the new VR has a different mac address, stale ARP cache may take upto 30seconds (default gc_timeout on Linux) to renew in guests making egress traffic downtime slightly higher than ingress traffic (ingress, i.e. public network requesting access to VR's public ip). As an illustration, ingress traffic such as a http request or ssh request to the VR's public IP that port-forwards to guest VM may face lower downtime than egress traffic (say a ping or curl from guest VM to public Internet). Among several environment factors, the following interval/timeouts affect how the old VR's ip-mac address is cached:

/proc/sys/net/ipv4/neigh/default/gc_interval
/proc/sys/net/ipv4/neigh/default/gc_stale_time
/proc/sys/net/ipv4/route/gc_interval
/proc/sys/net/ipv4/route/gc_timeout

To make things faster, a gratituous ARP notification is sent by the new VR (again) after the old VR is successfully destroyed.

Project Goals:

Reduce systemvmtemplate size, see CLOUDSTACK-10341: Reduce systemvmtemplate size, install nftables #2506. Smaller the image size, faster it will be to copy template to primary storage to provision new VR.
Rolling reboot/deployment of VRs for isolated network with non-rVRs
Rolling reboot/deployment of VRs for isolated network with rVRs
Rolling reboot/deployment of VRs for VPCs with non-rVRs and rVRs

Future:

Speed up iptables rules application
Move away from iptables, move to nftables
Move away from Python2 codebase to Python3 or something else like Go? TBD

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

Non-Redundant Isolated Network VRs

Rolling restart in action, a new VR is deployed lifing the VR as a redundant VR:

The old VR is stopped and destroyed:

The new VR's redundancy is removed and is reprogrammed:

Meanwhile, the following screenshot shows ping drops from a guest VM, in a non-redundant isolated network: (usually 2-10 ping drops observed)

Redundant Isolated Network VR

Rolling restart starts by stopping old backup VR:

New backup VR deployed:

Old master stopped, new backup becomes new master:

Old master destroyed and another backup VR deployed:

Redundant VRs after rollling reboot:

In the meanwhile, usually 0-4 ping drops observed in guest VM:

How Has This Been Tested?

Tested with following, benchmarks show rounded off avg/max downtime seconds for restarting non-redundant isolated VRs when cleanup=true:

Env	This PR	4.11.0.0	4.9.3.0
KVM CentOS7	8s	50s	70s
KVM CentOS6	10s	60s	80s
XenServer 71	8s	70s	NA
XenServer 65sp1	20s	70s	100s
VMware 65	10s	120s	NA
VMware 55u3	20s	160s	250s

** The downtime seconds may be higher in real-world environment under heavy load. Redundant routers show a downtime of 0-4 seconds.
*** All used a CentOS7 based management server, tested in environments deployed by Trillian

Ingress traffic was tested using ssh and http services hosted on the guest VM. Example, sshping like utility I wrote for calculating ingress traffic downtime: (number of fail count = downtime seconds)

sshping() {                                                                                                                                              
  ip=$1                                                                            
  total_count=0                                                                    
  fail_count=0                                                                     
  while true; do                                                                   
    start_time=$(date +%s)                                                         
    timeout 1.5 sshpass -p 'password' ssh -o StrictHostKeyChecking=no root@$1 "uptime" > /dev/null 2>&1
    if [[ $? -gt 0 ]]; then                                                        
        fail_count=$((fail_count+1))                                               
    fi                                                                             
    end_time=$(date +%s)                                                           
    diff=$(($end_time - $start_time))                                              
    if [[ $diff -eq 0 ]]; then                                                     
        sleep 1                                                                    
    fi                                                                             
                                                                                   
    total_count=$((total_count+1))                                                 
    echo "Fail/Total = $fail_count/$total_count"                                   
  done                                                                             
}

Checklist:

I have read the CONTRIBUTING document.
My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

@blueorangutan package

ustcweizhou · 2018-03-23T13:23:52Z

goooood @rhtyd

resmo · 2018-03-24T12:53:15Z

looks promising! Can not wait to test it, 👍

DaanHoogland · 2018-03-26T07:46:11Z

looks good @rhtyd

DaanHoogland

good so far. Are we going to allow for more then 2 routers in redundancy sometime soon?

DaanHoogland · 2018-03-27T11:02:06Z


        int routerCounts = 1;
-        if (offering.getRedundantRouter()) {
+        if (offering.getRedundantRouter() || network.isRollingRestart()) {


neat solution

Unlikely we'll have more than 2 routers for redundancy. However, the code has an internal limit of 5 (after which it logs warning/errors that a network has more than 5 routers, which can happen in some edge cases).

rafaelweingartner · 2018-05-08T01:41:23Z

Never mind. It was a problem with my hosts. Everything was working just fine with this PR and the new system vms 👍

yadvr · 2018-05-08T04:16:09Z

@rafaelweingartner great, looking forward to your review and test results.

blueorangutan · 2018-05-08T11:13:04Z

Trillian test result (tid-2636)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 49833 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2508-t2636-xenserver-71.zip
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Smoke tests completed. 66 look OK, 1 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File
test_02_vpc_privategw_static_routes	`Error`	588.83	test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup	`Failure`	639.51	test_privategw_acl.py

yadvr · 2018-05-08T11:36:22Z

@blueorangutan test

blueorangutan · 2018-05-08T11:36:50Z

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

yadvr · 2018-05-08T11:42:05Z

The last test results (xenserver) shows no new errors caused by this PR, and other tests are incoming soon. Given we've two LGTMs on it from @DaanHoogland and @GabrielBrascher , we may merge this but this PR as soon as test pass BUT since this adds a new operational enhancement I would like to request - @rafaelweingartner @resmo @nvazquez @wido @ustcweizhou @mike-tutkowski @fmaximus @jayapalu @marcaurele @PaulAngus @sureshanaparti @terbolous @NuxRo for their feedback, and I'm happy to move this to next 4.11.2.0 milestone if we've consensus aorund this. Please respond by the end of this week, next week will focus only on blockers and start work towards pre-RC1 testing. Thanks.

(Please follow links for packages and systemvmtemplate if you're not looking for building it yourself:
https://lab.yadav.cloud/systemvmtemplates/4.11/
https://lab.yadav.cloud/testing/vr-downtime/ )

rafaelweingartner · 2018-05-08T11:32:50Z

    }

    public Boolean getMakeredundant() {
        if (makeredundant != null) {


Is there is a default value, why not use a primitive value with a default of false? Then, you do not need these logics in a POJO get method.

@rafaelweingartner when the option is not passed, the value should be false which is what the fix it.

I see what you want, to set a default 'false' value and simply do a return.

That is it :)
It is not great change, but it achieves the same result with less logic (conditionals) in the code.

rafaelweingartner · 2018-05-08T11:33:04Z

            return cleanup;
        }
-        return true;
+        return false;


The same is valid here

@rafaelweingartner when the option is not passed, the value should be false which is what the fix it.

This means that the default is false, right?
Can't you declare the variable as private boolean cleanup = false?

Doing the fix now, thanks.

rafaelweingartner · 2018-05-08T11:40:10Z

+        VirtualRouter backupRouter = null;
+        for (final VirtualRouter router : remainingRouters) {
+            if (router.getRedundantState() == VirtualRouter.RedundantState.BACKUP) {
+                backupRouter = router;


Are these routers ordered some how? I mean, reading your next discussion with @DaanHoogland, it seems that the router we do not need will always be the last one (the third or fourth one) when this code is executed. Therefore, there must be some way to order this list before you reach here, right?

rafaelweingartner · 2018-05-08T11:42:14Z

+            backupRouter = routers.get(routers.size() - 1);
+        }
+        if (backupRouter != null) {
+            _routerService.destroyRouter(backupRouter.getId(), context.getAccount(), context.getCaller().getId());


Javadocs are useful to keep the history behind these things. Keep in mind that the method is only public (java delimiter access); this means accessible to programmers to re-use its code. It is not something that end-users will have access to.

rafaelweingartner · 2018-05-08T11:43:49Z

+    }
+
+    @Override
+    public boolean validateNewRouters(final List<? extends VirtualRouter> routers) {


Instead of validate, what about areNewRoutersrunning?
It is not a validation method per se. It does not throw exceptions. This method is only checking if all routers of the giving list are in running state.

This method is used to validate the restart operation, if false is returned you should see error in API/UI. Will change the name.

rafaelweingartner · 2018-05-08T11:47:36Z

@@ -227,7 +227,7 @@ public boolean implement(final Network network, final NetworkOffering offering,
        final List<DomainRouterVO> routers = routerDeploymentDefinition.deployVirtualRouter();

        int routerCounts = 1;


Instead of routerCounts, what about expectedNumberRouter or something similar. This is the number of VRs you expect to have in a network, right?

@rafaelweingartner good idea, maybe for a next/different refactoring PR.

rafaelweingartner · 2018-05-08T11:49:22Z

+            vpc.setRollingRestart(true);
+        }
+        startVpc(vpc, dest, context);
+        if (oldRouters.size() > 0) {


Can you explain this logic here? I mean, it is the same as the one at line 2456, but when it solves to true there, we set vpc.setRollingRestart(true); and then we set vpc.setRollingRestart(false); here. Is this something that was done only to satisfy some conditional at startVpc?

@rafaelweingartner if there are any existing old routers, we do rolling restart which is to deploy a new VR and then destroy old VRs. The setRollingRestart is a way to tell lower facade to treat deployment like for redundant routers, i.e. expect upto 2 VRs at a given point which is seen in changes in VirtualRouterElement.java. The startVpc will call the facade to deploy suitable number of routers. If there is only one old VPC VR, and if rolling restart is true this will cause the lower layers to deploy a new VPC VR. After that, we set the rolling restart to false which works great both in case of rVR (expected VRs is 2) and non-rVRs (expected VRs becomes 1), and finally we destroy old VR. If the VPC has rVRs, a new backup router is created when startVpc is called otherwise this will simply reconfigure existing VRs.

Thanks! This is a good piece of documentation.

rafaelweingartner · 2018-05-08T11:58:15Z

@rhtyd I have just reviewed the code.

I tested it yesterday and everything was fine. I had only one problem, but that was an environment one. One of my hosts was not working properly.

yadvr · 2018-05-08T12:35:33Z

Thanks @rafaelweingartner and @DaanHoogland I've tried to address all the review comments, let me know if I've missed anything. Wrt to testing, existing lifecycle tests in VPC, network and routers related smoketests validate the restart operation.

yadvr · 2018-05-08T12:36:24Z

@blueorangutan package

blueorangutan · 2018-05-08T12:36:49Z

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2018-05-08T13:04:43Z

Trillian test result (tid-2635)
Environment: vmware-65 (x2), Advanced Networking with Mgmt server 7
Total time taken: 56954 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2508-t2635-vmware-65.zip
Intermitten failure detected: /marvin/tests/smoke/test_deploy_vm_iso.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Smoke tests completed. 66 look OK, 1 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File
test_03_vpc_privategw_restart_vpc_cleanup	`Error`	607.27	test_privategw_acl.py

This introduces a rolling restart of VRs when networks are restarted with cleanup option. Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

yadvr · 2018-05-09T10:14:48Z

XenServer and Vmware tests are LGTM, 1-2 failure seen in intermittent/env related. I've rebase and fixed conflicts against latest 4.11, will kick test against KVM after packaging.
@blueorangutan package

blueorangutan · 2018-05-09T10:15:51Z

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2018-05-09T10:42:56Z

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2036

yadvr · 2018-05-09T11:08:22Z

@blueorangutan test

blueorangutan · 2018-05-09T11:08:50Z

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2018-05-09T20:24:55Z

Trillian test result (tid-2651)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 31613 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2508-t2651-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_ssvm.py
Smoke tests completed. 66 look OK, 0 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File

yadvr · 2018-05-10T07:48:53Z

Tests LGTM, we've enough reviews to merge this. If there are no objections, this will be merged by end of the day tomorrow.

blueorangutan · 2018-05-11T07:18:51Z

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2018-05-11T08:23:09Z

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2043

yadvr added component:virtual-router status:work-in-progress component:systemvm-template component:upgrade labels Mar 23, 2018

yadvr added this to the 4.11.1.0 milestone Mar 23, 2018

yadvr changed the title ~~CLOUDSTACK-9114: Reduce VR downtime during network restart~~ WIP - CLOUDSTACK-9114: Reduce VR downtime during network restart Mar 23, 2018

yadvr requested review from DaanHoogland, PaulAngus, nvazquez, rafaelweingartner, resmo and wido March 23, 2018 13:16

yadvr force-pushed the vr-downtime branch from 21942bd to 17ef3dd Compare March 26, 2018 18:18

apache deleted a comment from blueorangutan Mar 26, 2018

DaanHoogland reviewed Mar 27, 2018

View reviewed changes

yadvr changed the title ~~WIP - CLOUDSTACK-9114: Reduce VR downtime during network restart~~ CLOUDSTACK-9114: Reduce VR downtime during network restart Mar 27, 2018

yadvr self-assigned this Mar 27, 2018

apache deleted a comment from blueorangutan Mar 27, 2018

rafaelweingartner reviewed May 8, 2018

View reviewed changes

rafaelweingartner approved these changes May 8, 2018

View reviewed changes

yadvr added 7 commits May 9, 2018 15:39

CLOUDSTACK-9114: Rolling restart for VRs on restartNetwork

59acbc5

This introduces a rolling restart of VRs when networks are restarted with cleanup option. Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

server: fix NPE

85ed83e

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

review fixes

10ee97f

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

tests based fixes

045a24a

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

remove overrides

f8c5ee1

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

address review comments from Gabriel

eca8edf

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

address review comments from Rafael and Daan.

435eea2

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>

fmaximus mentioned this pull request May 24, 2018

CLOUDSTACK-10377: Fix Network restart for Nuage #2672

Merged

12 tasks

		@@ -227,7 +227,7 @@ public boolean implement(final Network network, final NetworkOffering offering,
		final List<DomainRouterVO> routers = routerDeploymentDefinition.deployVirtualRouter();

		int routerCounts = 1;

Conversation

yadvr commented Mar 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Screenshots (if appropriate):

Non-Redundant Isolated Network VRs

Redundant Isolated Network VR

How Has This Been Tested?

Checklist:

Uh oh!

ustcweizhou commented Mar 23, 2018

Uh oh!

resmo commented Mar 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaanHoogland commented Mar 26, 2018

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafaelweingartner commented May 8, 2018

Uh oh!

yadvr commented May 8, 2018

Uh oh!

blueorangutan commented May 8, 2018

Uh oh!

yadvr commented May 8, 2018

Uh oh!

blueorangutan commented May 8, 2018

Uh oh!

yadvr commented May 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafaelweingartner commented May 8, 2018

Uh oh!

yadvr commented May 8, 2018

Uh oh!

yadvr commented May 8, 2018

Uh oh!

yadvr commented Mar 23, 2018 •

edited

Loading

resmo commented Mar 24, 2018 •

edited

Loading

yadvr commented May 8, 2018 •

edited

Loading