Skip to content

CLOUDSTACK-9114: Reduce VR downtime during network restart#2508

Merged
yadvr merged 7 commits into
apache:4.11from
shapeblue:vr-downtime
May 11, 2018
Merged

CLOUDSTACK-9114: Reduce VR downtime during network restart#2508
yadvr merged 7 commits into
apache:4.11from
shapeblue:vr-downtime

Conversation

@yadvr
Copy link
Copy Markdown
Member

@yadvr yadvr commented Mar 23, 2018

Description

Every time there is a major CloudStack version that requires a new systemvmtemplate, admins need to restart VRs which can take few (sometimes several) minutes. The aim is to make it easier to perform a rolling restart (or blue-green deployment) of the network with acceptable downtime. In this strategy, a new VR is provisioned before the old VR is destroyed/cleaned for non-redundant isolated VRs. For redundant networks, a rolling replacement of routers is performed.

The network downtime is usually the short-time (less than 10s) after the new VR starts (nics are up) and old VR is destroyed. The downtime may be affected by several factors as the new VR has a different mac address, stale ARP cache may take upto 30seconds (default gc_timeout on Linux) to renew in guests making egress traffic downtime slightly higher than ingress traffic (ingress, i.e. public network requesting access to VR's public ip). As an illustration, ingress traffic such as a http request or ssh request to the VR's public IP that port-forwards to guest VM may face lower downtime than egress traffic (say a ping or curl from guest VM to public Internet). Among several environment factors, the following interval/timeouts affect how the old VR's ip-mac address is cached:

/proc/sys/net/ipv4/neigh/default/gc_interval
/proc/sys/net/ipv4/neigh/default/gc_stale_time
/proc/sys/net/ipv4/route/gc_interval
/proc/sys/net/ipv4/route/gc_timeout

To make things faster, a gratituous ARP notification is sent by the new VR (again) after the old VR is successfully destroyed.

Project Goals:

  • Reduce systemvmtemplate size, see CLOUDSTACK-10341: Reduce systemvmtemplate size, install nftables #2506. Smaller the image size, faster it will be to copy template to primary storage to provision new VR.
  • Rolling reboot/deployment of VRs for isolated network with non-rVRs
  • Rolling reboot/deployment of VRs for isolated network with rVRs
  • Rolling reboot/deployment of VRs for VPCs with non-rVRs and rVRs

Future:

  • Speed up iptables rules application
  • Move away from iptables, move to nftables
  • Move away from Python2 codebase to Python3 or something else like Go? TBD

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

Non-Redundant Isolated Network VRs

Rolling restart in action, a new VR is deployed lifing the VR as a redundant VR:
screenshot from 2018-03-23 22-36-27

The old VR is stopped and destroyed:
screenshot from 2018-03-23 22-36-53

The new VR's redundancy is removed and is reprogrammed:
screenshot from 2018-03-23 22-37-02

Meanwhile, the following screenshot shows ping drops from a guest VM, in a non-redundant isolated network: (usually 2-10 ping drops observed)
screenshot from 2018-03-23 17-53-46

Redundant Isolated Network VR

Rolling restart starts by stopping old backup VR:
screenshot from 2018-03-23 22-54-56

New backup VR deployed:
screenshot from 2018-03-23 22-55-10

Old master stopped, new backup becomes new master:
screenshot from 2018-03-23 22-55-56

Old master destroyed and another backup VR deployed:
screenshot from 2018-03-23 22-56-07

Redundant VRs after rollling reboot:
screenshot from 2018-03-23 22-57-11

In the meanwhile, usually 0-4 ping drops observed in guest VM:
screenshot from 2018-03-23 22-57-40

How Has This Been Tested?

Tested with following, benchmarks show rounded off avg/max downtime seconds for restarting non-redundant isolated VRs when cleanup=true:

Env This PR 4.11.0.0 4.9.3.0
KVM CentOS7 8s 50s 70s
KVM CentOS6 10s 60s 80s
XenServer 71 8s 70s NA
XenServer 65sp1 20s 70s 100s
VMware 65 10s 120s NA
VMware 55u3 20s 160s 250s

** The downtime seconds may be higher in real-world environment under heavy load. Redundant routers show a downtime of 0-4 seconds.
*** All used a CentOS7 based management server, tested in environments deployed by Trillian

Ingress traffic was tested using ssh and http services hosted on the guest VM. Example, sshping like utility I wrote for calculating ingress traffic downtime: (number of fail count = downtime seconds)

sshping() {                                                                                                                                              
  ip=$1                                                                            
  total_count=0                                                                    
  fail_count=0                                                                     
  while true; do                                                                   
    start_time=$(date +%s)                                                         
    timeout 1.5 sshpass -p 'password' ssh -o StrictHostKeyChecking=no root@$1 "uptime" > /dev/null 2>&1
    if [[ $? -gt 0 ]]; then                                                        
        fail_count=$((fail_count+1))                                               
    fi                                                                             
    end_time=$(date +%s)                                                           
    diff=$(($end_time - $start_time))                                              
    if [[ $diff -eq 0 ]]; then                                                     
        sleep 1                                                                    
    fi                                                                             
                                                                                   
    total_count=$((total_count+1))                                                 
    echo "Fail/Total = $fail_count/$total_count"                                   
  done                                                                             
}  

Checklist:

  • I have read the CONTRIBUTING document.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@blueorangutan package

@yadvr yadvr added this to the 4.11.1.0 milestone Mar 23, 2018
@yadvr yadvr changed the title CLOUDSTACK-9114: Reduce VR downtime during network restart WIP - CLOUDSTACK-9114: Reduce VR downtime during network restart Mar 23, 2018
@ustcweizhou
Copy link
Copy Markdown
Contributor

goooood @rhtyd

@resmo
Copy link
Copy Markdown
Member

resmo commented Mar 24, 2018

looks promising! Can not wait to test it, 👍

@DaanHoogland
Copy link
Copy Markdown
Contributor

looks good @rhtyd

@apache apache deleted a comment from blueorangutan Mar 26, 2018
@apache apache deleted a comment from blueorangutan Mar 26, 2018
@apache apache deleted a comment from blueorangutan Mar 26, 2018
@apache apache deleted a comment from blueorangutan Mar 26, 2018
Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good so far. Are we going to allow for more then 2 routers in redundancy sometime soon?


int routerCounts = 1;
if (offering.getRedundantRouter()) {
if (offering.getRedundantRouter() || network.isRollingRestart()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat solution

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlikely we'll have more than 2 routers for redundancy. However, the code has an internal limit of 5 (after which it logs warning/errors that a network has more than 5 routers, which can happen in some edge cases).

@yadvr yadvr changed the title WIP - CLOUDSTACK-9114: Reduce VR downtime during network restart CLOUDSTACK-9114: Reduce VR downtime during network restart Mar 27, 2018
@yadvr yadvr self-assigned this Mar 27, 2018
@apache apache deleted a comment from blueorangutan Mar 27, 2018
@apache apache deleted a comment from blueorangutan Mar 27, 2018
@apache apache deleted a comment from blueorangutan Mar 27, 2018
@apache apache deleted a comment from blueorangutan Mar 27, 2018
@apache apache deleted a comment from blueorangutan Mar 27, 2018
@apache apache deleted a comment from blueorangutan Mar 27, 2018
@apache apache deleted a comment from blueorangutan Mar 27, 2018
@rafaelweingartner
Copy link
Copy Markdown
Member

Never mind. It was a problem with my hosts. Everything was working just fine with this PR and the new system vms 👍

@yadvr
Copy link
Copy Markdown
Member Author

yadvr commented May 8, 2018

@rafaelweingartner great, looking forward to your review and test results.

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-2636)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 49833 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2508-t2636-xenserver-71.zip
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Smoke tests completed. 66 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_vpc_privategw_static_routes Error 588.83 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Failure 639.51 test_privategw_acl.py

@yadvr
Copy link
Copy Markdown
Member Author

yadvr commented May 8, 2018

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@yadvr
Copy link
Copy Markdown
Member Author

yadvr commented May 8, 2018

The last test results (xenserver) shows no new errors caused by this PR, and other tests are incoming soon. Given we've two LGTMs on it from @DaanHoogland and @GabrielBrascher , we may merge this but this PR as soon as test pass BUT since this adds a new operational enhancement I would like to request - @rafaelweingartner @resmo @nvazquez @wido @ustcweizhou @mike-tutkowski @fmaximus @jayapalu @marcaurele @PaulAngus @sureshanaparti @terbolous @NuxRo for their feedback, and I'm happy to move this to next 4.11.2.0 milestone if we've consensus aorund this. Please respond by the end of this week, next week will focus only on blockers and start work towards pre-RC1 testing. Thanks.

(Please follow links for packages and systemvmtemplate if you're not looking for building it yourself:
https://lab.yadav.cloud/systemvmtemplates/4.11/
https://lab.yadav.cloud/testing/vr-downtime/ )

}

public Boolean getMakeredundant() {
if (makeredundant != null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there is a default value, why not use a primitive value with a default of false? Then, you do not need these logics in a POJO get method.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaelweingartner when the option is not passed, the value should be false which is what the fix it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you want, to set a default 'false' value and simply do a return.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is it :)
It is not great change, but it achieves the same result with less logic (conditionals) in the code.

return cleanup;
}
return true;
return false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same is valid here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaelweingartner when the option is not passed, the value should be false which is what the fix it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the default is false, right?
Can't you declare the variable as private boolean cleanup = false?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the fix now, thanks.

VirtualRouter backupRouter = null;
for (final VirtualRouter router : remainingRouters) {
if (router.getRedundantState() == VirtualRouter.RedundantState.BACKUP) {
backupRouter = router;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these routers ordered some how? I mean, reading your next discussion with @DaanHoogland, it seems that the router we do not need will always be the last one (the third or fourth one) when this code is executed. Therefore, there must be some way to order this list before you reach here, right?

backupRouter = routers.get(routers.size() - 1);
}
if (backupRouter != null) {
_routerService.destroyRouter(backupRouter.getId(), context.getAccount(), context.getCaller().getId());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadocs are useful to keep the history behind these things. Keep in mind that the method is only public (java delimiter access); this means accessible to programmers to re-use its code. It is not something that end-users will have access to.

}

@Override
public boolean validateNewRouters(final List<? extends VirtualRouter> routers) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of validate, what about areNewRoutersrunning?
It is not a validation method per se. It does not throw exceptions. This method is only checking if all routers of the giving list are in running state.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used to validate the restart operation, if false is returned you should see error in API/UI. Will change the name.

@@ -227,7 +227,7 @@ public boolean implement(final Network network, final NetworkOffering offering,
final List<DomainRouterVO> routers = routerDeploymentDefinition.deployVirtualRouter();

int routerCounts = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of routerCounts, what about expectedNumberRouter or something similar. This is the number of VRs you expect to have in a network, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaelweingartner good idea, maybe for a next/different refactoring PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

vpc.setRollingRestart(true);
}
startVpc(vpc, dest, context);
if (oldRouters.size() > 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this logic here? I mean, it is the same as the one at line 2456, but when it solves to true there, we set vpc.setRollingRestart(true); and then we set vpc.setRollingRestart(false); here. Is this something that was done only to satisfy some conditional at startVpc?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaelweingartner if there are any existing old routers, we do rolling restart which is to deploy a new VR and then destroy old VRs. The setRollingRestart is a way to tell lower facade to treat deployment like for redundant routers, i.e. expect upto 2 VRs at a given point which is seen in changes in VirtualRouterElement.java. The startVpc will call the facade to deploy suitable number of routers. If there is only one old VPC VR, and if rolling restart is true this will cause the lower layers to deploy a new VPC VR. After that, we set the rolling restart to false which works great both in case of rVR (expected VRs is 2) and non-rVRs (expected VRs becomes 1), and finally we destroy old VR. If the VPC has rVRs, a new backup router is created when startVpc is called otherwise this will simply reconfigure existing VRs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is a good piece of documentation.

@rafaelweingartner
Copy link
Copy Markdown
Member

@rhtyd I have just reviewed the code.

I tested it yesterday and everything was fine. I had only one problem, but that was an environment one. One of my hosts was not working properly.

@yadvr
Copy link
Copy Markdown
Member Author

yadvr commented May 8, 2018

Thanks @rafaelweingartner and @DaanHoogland I've tried to address all the review comments, let me know if I've missed anything. Wrt to testing, existing lifecycle tests in VPC, network and routers related smoketests validate the restart operation.

@yadvr
Copy link
Copy Markdown
Member Author

yadvr commented May 8, 2018

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-2635)
Environment: vmware-65 (x2), Advanced Networking with Mgmt server 7
Total time taken: 56954 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2508-t2635-vmware-65.zip
Intermitten failure detected: /marvin/tests/smoke/test_deploy_vm_iso.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Smoke tests completed. 66 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_03_vpc_privategw_restart_vpc_cleanup Error 607.27 test_privategw_acl.py

yadvr added 7 commits May 9, 2018 15:39
This introduces a rolling restart of VRs when networks are restarted
with cleanup option.

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@yadvr
Copy link
Copy Markdown
Member Author

yadvr commented May 9, 2018

XenServer and Vmware tests are LGTM, 1-2 failure seen in intermittent/env related. I've rebase and fixed conflicts against latest 4.11, will kick test against KVM after packaging.
@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2036

@yadvr
Copy link
Copy Markdown
Member Author

yadvr commented May 9, 2018

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-2651)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 31613 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2508-t2651-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_ssvm.py
Smoke tests completed. 66 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@yadvr
Copy link
Copy Markdown
Member Author

yadvr commented May 10, 2018

Tests LGTM, we've enough reviews to merge this. If there are no objections, this will be merged by end of the day tomorrow.

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2043

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants