Skip to content

CLOUDSTACK-9572 Snapshot on primary storage not cleaned up after Stor…#1740

Merged
yadvr merged 1 commit into
apache:masterfrom
yvsubhash:CLOUDSTACK-9572
Jan 5, 2018
Merged

CLOUDSTACK-9572 Snapshot on primary storage not cleaned up after Stor…#1740
yadvr merged 1 commit into
apache:masterfrom
yvsubhash:CLOUDSTACK-9572

Conversation

@yvsubhash
Copy link
Copy Markdown

@yvsubhash yvsubhash commented Oct 27, 2016

Snapshot on primary storage not cleaned up after Storage migration. This happens in the following scenario

Steps To Reproduce

  1. Create an instance on the local storage on any host
  2. Create a scheduled snapshot of the volume:
  3. Wait until ACS created the snapshot. ACS is creating a snapshot on local storage and is transferring this snapshot to secondary storage. But the latest snapshot on local storage will stay there. This is as expected.
  4. Migrate the instance to another XenServer host with ACS UI and Storage Live Migration
  5. The Snapshot on the old host on local storage will not be cleaned up and is staying on local storage. So local storage will fill up with unneeded snapshots.

@syed
Copy link
Copy Markdown
Contributor

syed commented Nov 2, 2016

Wait until ACS created the snapshot. ACS is creating a snapshot on local storage and is transferring this snapshot to secondary storage. But the latest snapshot on local storage will stay there. This is as expected.

Why is it expected that latest snapshot be present on the primary storage after it has been copied? Shouldn't it be deleted?

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Nov 2, 2016

@yvsubhash can you fix the conflicts and rebase against latest 4.8/4.9. Thanks.

@yvsubhash yvsubhash changed the base branch from master to 4.8 November 14, 2016 11:15
@yvsubhash yvsubhash closed this Nov 14, 2016
@yvsubhash yvsubhash reopened this Nov 14, 2016
Copy link
Copy Markdown
Contributor

@jburwell jburwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yvsubhash in addition to my review comments, please add Marvin test cases to verify the behavior of this fix.

SnapshotInfo info = snapshotFactory.getSnapshot(snapshot.getId(), DataStoreRole.Primary);
try {
snapshotSrv.deleteSnapshot(info);
} catch(CloudRuntimeException e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This local variable is only used once. Please consider collapsing into lint 1122. Also, please add the message from the exception to the message to provide greater detail for debugging efforts.

@Override
public void cleanupSnapshotsByVolume(Long volumeId) {
List<SnapshotVO> volSnapShots = _snapshotDao.listByVolumeId(volumeId);
for(SnapshotVO snapshot: volSnapShots) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be an application side join. Please consider creating a new query to retrieve all snapshot info instances associated with volumeId to reduce load on the database and simplify this method.

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Nov 20, 2016

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-183

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Nov 24, 2016

@yvsubhash can you reply on raised review comments?

@yvsubhash yvsubhash changed the base branch from 4.8 to master March 2, 2017 10:25
@cloudmonger
Copy link
Copy Markdown

ACS CI BVT Run

Sumarry:
Build Number 461
Hypervisor xenserver
NetworkType Advanced
Passed=105
Failed=0
Skipped=7

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0

Failed tests:

Skipped tests:
test_01_test_vm_volume_snapshot
test_vm_nic_adapter_vmxnet3
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_nested_virtualization_vmware
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_over_provisioning.py
test_global_settings.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_non_contigiousvlan.py
test_login.py
test_deploy_vm_iso.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_volumes.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_vm_life_cycle.py
test_routers_network_ops.py
test_disk_offerings.py

@yvsubhash
Copy link
Copy Markdown
Author

@syed
On the xenserver we always keep the latest snapshot in primary as xenserver needs it to create delta snapshots. If the snapshot is not present in xenserver, next snapshot would create full snapshot rather than delta snapshot. Please refer to the bug related to that
https://issues.apache.org/jira/browse/CLOUDSTACK-2630

@syed
Copy link
Copy Markdown
Contributor

syed commented Mar 23, 2017

Thanks @yvsubhash Did the code review and it LGTM

@mrunalinikankariya
Copy link
Copy Markdown
Contributor

In ACS, all the snapshots are deleted from primary storage as soon as it is successfully backed up and copied on secondary storage, hence this issue is not valid for ACS.

However a ticket should be raised to analysis and understand how incremental snapshots work if not a single copy kept on primary.

@yvsubhash
Copy link
Copy Markdown
Author

Closing this PR as it is no longer needed

@yvsubhash yvsubhash closed this Apr 12, 2017
@yvsubhash
Copy link
Copy Markdown
Author

The following ticket is opened for fixing this issue
https://issues.apache.org/jira/browse/CLOUDSTACK-9870

@yvsubhash yvsubhash reopened this Nov 6, 2017
@yvsubhash yvsubhash force-pushed the CLOUDSTACK-9572 branch 2 times, most recently from 64448c3 to 1e319da Compare November 20, 2017 08:43
@jahnaviboddu
Copy link
Copy Markdown

LGTM for Testing

  1. Create a zone set-up with 2 host in one cluster.
  2. set system.vm.use.local.storage parameter to "true" in Global Settings.
  3. Create Compute Offerings with Storage Type option selected as local.
  4. Create Disk Offerings with Storage Type option selected as local.
  5. In Zone enable local storage for user vms option selected.
  6. Take recurring snapshot for Root-Disk.
  7. Migrate the instance to another host.
  8. After Migration snapshots from source host local storage are cleaned up.

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Nov 28, 2017

@blueorangutan package

@rafaelweingartner
Copy link
Copy Markdown
Member

@rhtyd I tested this fix in my dev environment and it solved the described problem.
I think this PR is in a good state to get merged. It could benefit from some unit tests though.

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jan 4, 2018

@blueorangutan test centos7 xenserver-65sp1

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + xenserver-65sp1) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-1995)
Environment: xenserver-65sp1 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42502 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr1740-t1995-xenserver-65sp1.zip
Smoke tests completed. 67 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Jan 5, 2018

Merging this based on test results and code reviews.

@yadvr yadvr merged commit 8eca04e into apache:master Jan 5, 2018
@mike-tutkowski
Copy link
Copy Markdown
Member

I have noticed that not only is the snapshot on primary storage deleted when cleanupSnapshotsByVolume is invoked, but so is the copy of it that's on secondary storage. Can anyone else reproduce this?

@mike-tutkowski
Copy link
Copy Markdown
Member

mike-tutkowski commented Aug 9, 2018

In fact, one of my tests fails unless I comment out the cleanupSnapshotsByVolume method. The test checks to make sure the volume that is migrated from one primary storage to another is still associated with the snapshot on secondary storage. As it turns out, when cleanupSnapshotByVolume is invoked, the snapshot is completed deleted.

@mike-tutkowski
Copy link
Copy Markdown
Member

OK, I have been able to reproduce this in a simple environment using only NFS.

I create a VM whose root disk resides on an NFS primary storage. I take a snapshot of this root disk. I then migrate the root disk from one NFS primary storage to another. The volume is no longer associated with the snapshot and the removed cell for the snapshot in the cloud.snapshots table has a date in it.

@rafaelweingartner
Copy link
Copy Markdown
Member

@mike-tutkowski are you sure this problem is caused by this PR?
I mean, can you check the code here? The method snapshotMgr.cleanupSnapshotsByVolume is only called when there is a failure. Moreover, can you check the method cleanupSnapshotsByVolume? If you look there we only list the snapshots in the primary storage (snapshotFactory.getSnapshots(volumeId, DataStoreRole.Primary)), which means, the one that failed in the context of this PR.

@mike-tutkowski
Copy link
Copy Markdown
Member

mike-tutkowski commented Aug 9, 2018

From what it looks like, the method is called upon success at least in two places:

            future.complete(res);
        } else {
            srcVolume.processEvent(Event.OperationSuccessed);
            **snapshotMgr.cleanupSnapshotsByVolume(srcVolume.getId());**
            future.complete(res);
        }
    } catch (Exception e) {

        } else {
            for (Map.Entry<VolumeInfo, DataStore> entry : volumeToPool.entrySet()) {
                VolumeInfo volume = entry.getKey();
                **snapshotMgr.cleanupSnapshotsByVolume(volume.getId());**
                volume.processEvent(Event.OperationSuccessed);
            }
            future.complete(res);

@mike-tutkowski
Copy link
Copy Markdown
Member

mike-tutkowski commented Aug 9, 2018

The code to retrieve only primary-storage snapshots is OK. The problem is when the snapshot is given to the Snapshot Service (snapshotSrv.deleteSnapshot(info);). At some point, not only does the snapshot on the primary storage get deleted, but so does the snapshot on secondary storage.

When I comment out the code in snapshotMgr.cleanupSnapshotsByVolume, everything works the way it used to (meaning the snapshot on the original primary storage is stranded, but the volume is still associated with the correct snapshot on secondary storage).

@rafaelweingartner
Copy link
Copy Markdown
Member

You are right. That code seems to be the other way around. I mean, we should clean when there is a failure, right? At least that is what I understood by the author description.

@mike-tutkowski
Copy link
Copy Markdown
Member

The intention of the code is OK. A volume is being moved from one primary storage to another. If it had any snapshots on the source primary storage, we do not bring them along to the target primary primary storage. Those (primary-side) snapshots only helped in understanding the delta of changes that needed to be copied to secondary storage in the event that another snapshot was taken. Once we move the volume to its new primary storage, we lose knowledge of this delta information (which is OK). If the user asks for a new snapshot of the volume in question, we must take a full snapshot of that volume (but we should leave a new delta snapshot on this primary storage in case yet another snapshot is requested later). The intent of the code is to remove the unused (delta) snapshot that resided on the source primary storage because it is no longer needed. Its mistake is that it not only removes that delta snapshot, but it also removes the entirety of the snapshot (both what might have been on primary storage and what is on secondary storage). This leads to a data loss for the customer.

@mike-tutkowski
Copy link
Copy Markdown
Member

But I should also point out that you are correct in saying that in the event of a failed attempt to create a new snapshot, we should remove the (delta) snapshot from primary storage (as it might act as a false delta if we need to take a future snapshot).

@rafaelweingartner
Copy link
Copy Markdown
Member

Thanks for your details @mike-tutkowski. Can you help me understand where the problem is?

Is the problem in snapshotSrv.deleteSnapshot(info) that is deleting the snapshot from the secondary storage as well, and not only in the primary storage? Or is the code that the PR is introducing that is using the method snapshotSrv.deleteSnapshot in an incorrect manner?

@mike-tutkowski
Copy link
Copy Markdown
Member

mike-tutkowski commented Aug 11, 2018 via email

@rafaelweingartner
Copy link
Copy Markdown
Member

Ok, so what this means is that the author when creating the PR used the snapshotSrv.deleteSnapshot thinking that it would only remove the snapshot in the hypervisor (primary storage). However, that method destroys the snapshot in ACS, which means removing in the DB and then from both storage systems (primary and secondary).

Having said that, what we need is to create a method to remove only snapshots from the primary storage only, right? Then, instead of using snapshotSrv.deleteSnapshot, we use this new method.

@mike-tutkowski
Copy link
Copy Markdown
Member

mike-tutkowski commented Aug 11, 2018 via email

@rafaelweingartner
Copy link
Copy Markdown
Member

rafaelweingartner commented Aug 13, 2018

Hello @mike-tutkowski, I am not able to reproduce the problem you described. I do think though that the code can be improved (I will explain it).

Let's see an example. If I create a VM (vm-1); then I create two snapshots of this VM's root volume. I will have the following entries in snapshot_store_ref table.


id store_id snapshot_id store_role install_path state physical_size size
422 2 253 Image snapshots/2/7345/134a2524-72b6-4cfd-a802-89933fe63407.vhd Ready 1758786048 21474836480
423 5 254 Primary 3037a7ed-5c93-45da-bc73-454ad21df793 Ready 21474836480 21474836480
424 2 254 Image snapshots/2/7345/5ddf9f65-16e7-4edd-9878-4f18ad26137b.vhd Ready 113508864 21474836480

Then, when I migrate the volume from shared storage to local storage, the code we are discussing is triggered. The code is quite odd because it returns a list of SnapshotObject. Each snapshot will have a reference there. That object is then created with SnapshotObject.getSnapshotObject(snapshot, store). This means, there will be some SnapshotObject that will reference a snapshot that is not in the primary storage. Then, when the delete method is called snapshotSrv.deleteSnapshot(info), I get the following entries in the snapshot_store_ref table.


id store_id snapshot_id store_role install_path state physical_size size
422 2 253 Image snapshots/2/7345/134a2524-72b6-4cfd-a802-89933fe63407.vhd Ready 1758786048 21474836480
424 2 254 Image snapshots/2/7345/5ddf9f65-16e7-4edd-9878-4f18ad26137b.vhd Ready 113508864 21474836480

This is the expected behavior. I am not sure if this behavior can be different for other hypervisors. I am using XenServer and testing with the latest commit in master.

There is one thing we can do though. We can change the method snapshotFactory.getSnapshots to return only SnapshotObject objects that reference the snapshots in the current primary storage from which the volume is being migrated out of. What do you think?

@mike-tutkowski
Copy link
Copy Markdown
Member

I should have placed these comments here (in GitHub) and let them propagate to JIRA. I'm going to copy/paste the relevant comments I made in JIRA here:

Comment:

After updating master on my local repo, I can no longer reproduce the issue I described above. With 46c56ea, the workflow behaves as expected. It would seem there was an issue in a previous version of master, but that it no longer exists.

Comment:

With regards to your snapshot-delete question:

As far as I know, if you migrate a volume from one primary storage to another, there should never exist any primary storage-based snapshots of this volume after the migration has taken place. That being the case, I don't think you should get back primary storage-based snapshots from snapshotFactory.getSnapshots that are from different primary storages.

@mike-tutkowski
Copy link
Copy Markdown
Member

This PR has been merged, but the corresponding ticket in JIRA is still marked as Open/Unresolved. I'm going to change that ticket to Closed/Fixed.

@rafaelweingartner
Copy link
Copy Markdown
Member

As far as I know, if you migrate a volume from one primary storage to another, there should never exist any primary storage-based snapshots of this volume after the migration has taken place. That being the case, I don't think you should get back primary storage-based snapshots from snapshotFactory.getSnapshots that are from different primary storages

I do agree with you, but that is not how it is working right now. Should we fix that method then? I can open a PR tomorrow then.

@mike-tutkowski
Copy link
Copy Markdown
Member

mike-tutkowski commented Aug 14, 2018 via email

rafaelweingartner added a commit to rafaelweingartner/cloudstack that referenced this pull request Aug 15, 2018
…mary storage

While checking a problem with @mike-tutkowski at PR apache#1740, we noticed that the method `org.apache.cloudstack.storage.snapshot.SnapshotDataFactoryImpl.getSnapshots(long, DataStoreRole)` was not loading the snapshots in `snapshot_store_ref` table according to their storage role. Instead, it would return a list of all snapshots (even if they are not in the storage role sent as a parameter) saying that they are in the storage that was sent as parameter.
rafaelweingartner added a commit that referenced this pull request Aug 17, 2018
…torage (#2808)

* Fix the problem at #1740 when it loads all snapshots in the primary storage

While checking a problem with @mike-tutkowski at PR #1740, we noticed that the method `org.apache.cloudstack.storage.snapshot.SnapshotDataFactoryImpl.getSnapshots(long, DataStoreRole)` was not loading the snapshots in `snapshot_store_ref` table according to their storage role. Instead, it would return a list of all snapshots (even if they are not in the storage role sent as a parameter) saying that they are in the storage that was sent as parameter.

* Add unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants