Skip to content

engine, server, services: fix for respecting secondary storage threshold limit#3480

Merged
yadvr merged 7 commits into
apache:masterfrom
shapeblue:secondary-storage-limit-3287
Jul 31, 2019
Merged

engine, server, services: fix for respecting secondary storage threshold limit#3480
yadvr merged 7 commits into
apache:masterfrom
shapeblue:secondary-storage-limit-3287

Conversation

@shwstppr

@shwstppr shwstppr commented Jul 10, 2019

Copy link
Copy Markdown
Contributor

Retrieval of an image store using ImageStoreProviderManager has been refactored by introducing three different methods,
DataStore getRandomImageStore(List<DataStore> imageStores);
To get an image store for reading purpose. Threshold capacity check will not be used here.
DataStore getImageStoreWithFreeCapacity(List<DataStore> imageStores);
To get an image store for reading purpose. Threshold capacity check will be used here and the store with max free space will be returned. If no store with filled storage less than the threshold is found, the NULL value will be returned.
List<DataStore> listImageStoresWithFreeCapacity(List<DataStore> imageStores);
To get a list of image stores for writing purpose which fulfills threshold capacity check.

Correspondingly DataStoreManager methods have been refactored to return similar values for a given zone,

DataStore getRandomImageStore(long zoneId);
DataStore getImageStoreWithFreeCapacity(long zoneId);
List<DataStore> listImageStoresWithFreeCapacity(long zoneId);

Consumers have been refactored to accept NULL value and throw exception or log error if needed.

Fixes -
#3287 - NULL value will be returned when secondary storage is needed for writing but there is not store with free space.
#3041 - Rather than returning random secondary storage for writing, storage with max. free space will be returned.
#3478 - For migration on VMware, all writable secondary storage will be mounted while preparation.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

…old limit

ImageStoreProviderManager will return a null value when there is no image store available with free space and satisfies threshold limit.
Consumers have been refactored to accept null value and throw exception or log error if needed.

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
DataStore sourceTemplateDataStore = dataStoreManagerImpl.getImageStore(srcVolumeInfo.getDataCenterId());
TemplateInfo sourceTemplateInfo = templateDataFactory.getTemplate(srcVolumeInfo.getTemplateId(), sourceTemplateDataStore);
TemplateObjectTO sourceTemplate = new TemplateObjectTO(sourceTemplateInfo);
if (sourceTemplateDataStore != null) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhtyd Throwing an exception or returning null for Image store can fail this. This might not be needing a sec store for writing. Maybe need to handle this separately?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be the reason you are using (sourceTemplateDataStore != null) ? If so, you might need an else at the end to manage the nulls

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kioie It should never be getting null value for image store as it needs the store only for reading and threshold limit shouldn't be affecting this. Will have to make changes for it.

@yadvr yadvr added this to the 4.13.0.0 milestone Jul 10, 2019
Comment thread server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
s_logger.error(String.format("Can't find an image storage in zone with less than %d usage",
Math.round(_statsCollector.getImageStoreCapacityThreshold()*100)));
return null;
}

@kioie kioie Jul 11, 2019

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides checking for whether the store is below the threshold, are you also considering doing a check if the image will fit in the imagestore and what capacity remains afterwards....for example, if the current used capacity is at 89% and the image might move it to even upto 99%....or is that check happening on a separate method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kioie I'm not sure that is possible to implement for scenarios at this point with the current design. For example, while registering a template through URL we don't know actual size after download and install.
But such a check can be considered for copy template, snapshots, etc. I've to check the code for that. Anyway, the requirement of this method is to return an image store with capacity within the threshold and I don't think we can have a check for resultant size here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's okay, just wanted to know the possibility of it happening

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
… cmd

Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
@shwstppr

Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@shwstppr shwstppr closed this Jul 17, 2019
@shwstppr shwstppr reopened this Jul 17, 2019
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
@shwstppr

Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@shwstppr a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-148

@shwstppr

Copy link
Copy Markdown
Contributor Author

@blueorangutan test

1 similar comment
@shwstppr

Copy link
Copy Markdown
Contributor Author

@blueorangutan test

@yadvr

yadvr commented Jul 22, 2019

Copy link
Copy Markdown
Member

@shwstppr were the vmware specific fixes also done? You can kick tests now.
@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-167

@shwstppr

Copy link
Copy Markdown
Contributor Author

@blueorangutan test centos7 vmware-65u2

@blueorangutan

Copy link
Copy Markdown

@shwstppr a Trillian-Jenkins test job (centos7 mgmt + vmware-65u2) has been kicked to run smoke tests

@shwstppr

Copy link
Copy Markdown
Contributor Author

@rhtyd yes I made the changes for vmware migration issue. Would help if you can review last 2 commits


DataStore getImageStoreForWrite(long zoneId);

List<DataStore> getImageStoresForWrite(long zoneId);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename it something more meaningful like listImagesStoresWithFreeCapacity?

DataStore getImageStore(long zoneId);
DataStore getImageStoreForRead(long zoneId);

DataStore getImageStoreForWrite(long zoneId);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about getImagesStoresWithFreeCapacity?

List<DataStore> getImageStoresByScope(ZoneScope scope);

DataStore getImageStore(long zoneId);
DataStore getImageStoreForRead(long zoneId);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRandomImageStore or something that conveys meaning? You may also put javadocs on them to convey which method does what?

@blueorangutan

Copy link
Copy Markdown

Trillian test result (tid-215)
Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7
Total time taken: 37689 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3480-t215-vmware-65u2.zip
Intermittent failure detected: /marvin/tests/smoke/test_deploy_vm_root_resize.py
Intermittent failure detected: /marvin/tests/smoke/test_network.py
Intermittent failure detected: /marvin/tests/smoke/test_routers.py
Intermittent failure detected: /marvin/tests/smoke/test_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_usage.py
Smoke tests completed. 73 look OK, 4 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_reboot_router Error 322.47 test_network.py
test_08_start_router Error 69.58 test_routers.py
test_09_reboot_router Error 1.09 test_routers.py
test_02_list_snapshots_with_removed_data_store Error 1.16 test_snapshots.py
test_01_volume_usage Error 9.32 test_usage.py

@yadvr yadvr changed the title [WIP DO NOT MERGE] engine, server, services: fix for respecting secondary storage threshold limit engine, server, services: fix for respecting secondary storage threshold limit Jul 25, 2019
@yadvr

yadvr commented Jul 25, 2019

Copy link
Copy Markdown
Member

I've rekicked Travis jobs, please check again in case they fail @shwstppr
@blueorangutan test matrix

@blueorangutan

Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins matrix job (centos6 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan

Copy link
Copy Markdown

Trillian test result (tid-220)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33134 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3480-t220-kvm-centos7.zip
Smoke tests completed. 77 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@yadvr

yadvr commented Jul 25, 2019

Copy link
Copy Markdown
Member

@blueorangutan test centos7 vmware-55u3

@blueorangutan

Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + vmware-55u3) has been kicked to run smoke tests

@blueorangutan

Copy link
Copy Markdown

Trillian test result (tid-219)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 37437 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3480-t219-xenserver-71.zip
Intermittent failure detected: /marvin/tests/smoke/test_list_ids_parameter.py
Intermittent failure detected: /marvin/tests/smoke/test_scale_vm.py
Smoke tests completed. 75 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_03_list_snapshots Error 0.05 test_list_ids_parameter.py
test_01_scale_vm Failure 28.99 test_scale_vm.py

@blueorangutan

Copy link
Copy Markdown

Trillian test result (tid-222)
Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7
Total time taken: 43815 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3480-t222-vmware-65u2.zip
Intermittent failure detected: /marvin/tests/smoke/test_deploy_vm_root_resize.py
Intermittent failure detected: /marvin/tests/smoke/test_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_usage.py
Smoke tests completed. 75 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_list_snapshots_with_removed_data_store Error 1.20 test_snapshots.py
test_01_volume_usage Error 10.48 test_usage.py

@borisstoyanov borisstoyanov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm hitting an issue with this one. I cannot sync the templates across a newly added image store. Here's my steps:

Originally created environement with 4 secondary storages
deleted 3 of them so I just have 1 running.
On the deleted storages I've clean all files.
Registered a few templates in cloudstack and they ended up on the running storage
Added back one of the storages and destroyed the SSVM in a while
After a day I went back to the machine and the newly added templates were not on the new image store, just the default ones were added. Also the new secondary storage is not being mounted on any of the hosts

@shwstppr

Copy link
Copy Markdown
Contributor Author

@borisstoyanov will look into it

I think I'm hitting an issue with this one. I cannot sync the templates across a newly added image store. Here's my steps:

Originally created environement with 4 secondary storages
deleted 3 of them so I just have 1 running.
On the deleted storages I've clean all files.
Registered a few templates in cloudstack and they ended up on the running storage
Added back one of the storages and destroyed the SSVM in a while
After a day I went back to the machine and the newly added templates were not on the new image store, just the default ones were added. Also the new secondary storage is not being mounted on any of the hosts

@apache apache deleted a comment from blueorangutan Jul 29, 2019
@yadvr

yadvr commented Jul 29, 2019

Copy link
Copy Markdown
Member

Tests LGTM, any comments @shwstppr @borisstoyanov ?
@shwstppr can you test/fix Bobby's issue?

@shwstppr

Copy link
Copy Markdown
Contributor Author

@rhtyd looking into the issue

@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@shwstppr a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-193

@shwstppr

shwstppr commented Jul 30, 2019

Copy link
Copy Markdown
Contributor Author

I think I'm hitting an issue with this one. I cannot sync the templates across a newly added image store. Here's my steps:

Originally created environement with 4 secondary storages
deleted 3 of them so I just have 1 running.
On the deleted storages I've clean all files.
Registered a few templates in cloudstack and they ended up on the running storage
Added back one of the storages and destroyed the SSVM in a while
After a day I went back to the machine and the newly added templates were not on the new image store, just the default ones were added. Also the new secondary storage is not being mounted on any of the hosts

@borisstoyanov I've verified, the issue you are facing is unrelated to this PR. Currently, code skips downloading a template to new secondary store if is not public and not featured and not a system template.
https://github.com/apache/cloudstack/blob/master/engine/storage/image/src/main/java/org/apache/cloudstack/storage/image/TemplateServiceImpl.java#L517-L520
If you can repeat your steps while making new template public/fetured(or remove your new stores, edit template to make it public and re-add store) it will be downloaded instantly after storage addition. There is no issue with template sync and it should start as soon as a new storage is added. I fixed that with #3302 and have verified with env running this PR.
Not sure if it is by design that a normal template(not featured, not public) will remain on a single secondary store for a zone. @rhtyd @nvazquez any thoughts?

@borisstoyanov

Copy link
Copy Markdown
Contributor

Sure, thanks on the investigation @shwstppr, will check

@borisstoyanov borisstoyanov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that solved the template sync, but the storage pool is not mounted across the hosts still, which I believe is an issue @shwstppr

@yadvr

yadvr commented Jul 31, 2019

Copy link
Copy Markdown
Member

@borisstoyanov @shwstppr have you tested this simple case:
(a) deploy a VM on a host
(b) add a new secondary storage to a zone, then check that the secondary storage is not mounted on any of the hosts
(c) then migrate that VM to another host, check that the secondary storage is mounted on that host.

@borisstoyanov borisstoyanov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, after migrating a VM to the host all image stores were mounted

@yadvr yadvr merged commit b2db897 into apache:master Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants