CLOUDSTACK-9572 Snapshot on primary storage not cleaned up after Stor… by yvsubhash · Pull Request #1740 · apache/cloudstack

yvsubhash · 2016-10-27T11:59:25Z

Snapshot on primary storage not cleaned up after Storage migration. This happens in the following scenario

Steps To Reproduce

Create an instance on the local storage on any host
Create a scheduled snapshot of the volume:
Wait until ACS created the snapshot. ACS is creating a snapshot on local storage and is transferring this snapshot to secondary storage. But the latest snapshot on local storage will stay there. This is as expected.
Migrate the instance to another XenServer host with ACS UI and Storage Live Migration
The Snapshot on the old host on local storage will not be cleaned up and is staying on local storage. So local storage will fill up with unneeded snapshots.

syed · 2016-11-02T15:12:27Z

Wait until ACS created the snapshot. ACS is creating a snapshot on local storage and is transferring this snapshot to secondary storage. But the latest snapshot on local storage will stay there. This is as expected.

Why is it expected that latest snapshot be present on the primary storage after it has been copied? Shouldn't it be deleted?

yadvr · 2016-11-02T17:24:12Z

@yvsubhash can you fix the conflicts and rebase against latest 4.8/4.9. Thanks.

jburwell

@yvsubhash in addition to my review comments, please add Marvin test cases to verify the behavior of this fix.

jburwell · 2016-11-14T22:58:59Z

+            SnapshotInfo info = snapshotFactory.getSnapshot(snapshot.getId(), DataStoreRole.Primary);
+            try {
+                snapshotSrv.deleteSnapshot(info);
+            } catch(CloudRuntimeException e) {


This local variable is only used once. Please consider collapsing into lint 1122. Also, please add the message from the exception to the message to provide greater detail for debugging efforts.

jburwell · 2016-11-14T23:01:53Z

+    @Override
+    public void cleanupSnapshotsByVolume(Long volumeId) {
+        List<SnapshotVO> volSnapShots = _snapshotDao.listByVolumeId(volumeId);
+        for(SnapshotVO snapshot: volSnapShots) {


This appears to be an application side join. Please consider creating a new query to retrieve all snapshot info instances associated with volumeId to reduce load on the database and simplify this method.

yadvr · 2016-11-20T07:58:02Z

@blueorangutan package

blueorangutan · 2016-11-20T07:59:07Z

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2016-11-20T15:02:41Z

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-183

yadvr · 2016-11-24T07:32:24Z

@yvsubhash can you reply on raised review comments?

cloudmonger · 2017-03-13T11:08:11Z

ACS CI BVT Run

Sumarry:
Build Number 461
Hypervisor xenserver
NetworkType Advanced
Passed=105
Failed=0
Skipped=7

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0

Failed tests:

Skipped tests:
test_01_test_vm_volume_snapshot
test_vm_nic_adapter_vmxnet3
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_nested_virtualization_vmware
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_over_provisioning.py
test_global_settings.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_routers.py
test_reset_vm_on_reboot.py
test_deploy_vms_with_varied_deploymentplanners.py
test_network.py
test_router_dns.py
test_non_contigiousvlan.py
test_login.py
test_deploy_vm_iso.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_volumes.py
test_nic.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_vm_life_cycle.py
test_routers_network_ops.py
test_disk_offerings.py

yvsubhash · 2017-03-23T11:43:12Z

@syed
On the xenserver we always keep the latest snapshot in primary as xenserver needs it to create delta snapshots. If the snapshot is not present in xenserver, next snapshot would create full snapshot rather than delta snapshot. Please refer to the bug related to that
https://issues.apache.org/jira/browse/CLOUDSTACK-2630

syed · 2017-03-23T15:31:17Z

Thanks @yvsubhash Did the code review and it LGTM

mrunalinikankariya · 2017-04-05T01:25:20Z

In ACS, all the snapshots are deleted from primary storage as soon as it is successfully backed up and copied on secondary storage, hence this issue is not valid for ACS.

However a ticket should be raised to analysis and understand how incremental snapshots work if not a single copy kept on primary.

yvsubhash · 2017-04-12T06:21:15Z

Closing this PR as it is no longer needed

yvsubhash · 2017-04-12T06:28:23Z

The following ticket is opened for fixing this issue
https://issues.apache.org/jira/browse/CLOUDSTACK-9870

jahnaviboddu · 2017-11-23T11:35:58Z

LGTM for Testing

Create a zone set-up with 2 host in one cluster.
set system.vm.use.local.storage parameter to "true" in Global Settings.
Create Compute Offerings with Storage Type option selected as local.
Create Disk Offerings with Storage Type option selected as local.
In Zone enable local storage for user vms option selected.
Take recurring snapshot for Root-Disk.
Migrate the instance to another host.
After Migration snapshots from source host local storage are cleaned up.

yadvr · 2017-11-28T14:36:00Z

@blueorangutan package

rafaelweingartner · 2018-01-03T18:45:03Z

@rhtyd I tested this fix in my dev environment and it solved the described problem.
I think this PR is in a good state to get merged. It could benefit from some unit tests though.

yadvr · 2018-01-04T05:31:13Z

@blueorangutan test centos7 xenserver-65sp1

blueorangutan · 2018-01-04T05:31:31Z

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + xenserver-65sp1) has been kicked to run smoke tests

blueorangutan · 2018-01-04T17:19:58Z

Trillian test result (tid-1995)
Environment: xenserver-65sp1 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42502 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr1740-t1995-xenserver-65sp1.zip
Smoke tests completed. 67 look OK, 0 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File

yadvr · 2018-01-05T05:49:26Z

Merging this based on test results and code reviews.

mike-tutkowski · 2018-08-09T19:49:54Z

I have noticed that not only is the snapshot on primary storage deleted when cleanupSnapshotsByVolume is invoked, but so is the copy of it that's on secondary storage. Can anyone else reproduce this?

mike-tutkowski · 2018-08-09T20:15:39Z

In fact, one of my tests fails unless I comment out the cleanupSnapshotsByVolume method. The test checks to make sure the volume that is migrated from one primary storage to another is still associated with the snapshot on secondary storage. As it turns out, when cleanupSnapshotByVolume is invoked, the snapshot is completed deleted.

mike-tutkowski · 2018-08-09T21:10:39Z

OK, I have been able to reproduce this in a simple environment using only NFS.

I create a VM whose root disk resides on an NFS primary storage. I take a snapshot of this root disk. I then migrate the root disk from one NFS primary storage to another. The volume is no longer associated with the snapshot and the removed cell for the snapshot in the cloud.snapshots table has a date in it.

rafaelweingartner · 2018-08-09T21:22:31Z

@mike-tutkowski are you sure this problem is caused by this PR?
I mean, can you check the code here? The method snapshotMgr.cleanupSnapshotsByVolume is only called when there is a failure. Moreover, can you check the method cleanupSnapshotsByVolume? If you look there we only list the snapshots in the primary storage (snapshotFactory.getSnapshots(volumeId, DataStoreRole.Primary)), which means, the one that failed in the context of this PR.

mike-tutkowski · 2018-08-09T22:22:17Z

From what it looks like, the method is called upon success at least in two places:

            future.complete(res);
        } else {
            srcVolume.processEvent(Event.OperationSuccessed);
            **snapshotMgr.cleanupSnapshotsByVolume(srcVolume.getId());**
            future.complete(res);
        }
    } catch (Exception e) {

        } else {
            for (Map.Entry<VolumeInfo, DataStore> entry : volumeToPool.entrySet()) {
                VolumeInfo volume = entry.getKey();
                **snapshotMgr.cleanupSnapshotsByVolume(volume.getId());**
                volume.processEvent(Event.OperationSuccessed);
            }
            future.complete(res);

mike-tutkowski · 2018-08-09T22:25:07Z

The code to retrieve only primary-storage snapshots is OK. The problem is when the snapshot is given to the Snapshot Service (snapshotSrv.deleteSnapshot(info);). At some point, not only does the snapshot on the primary storage get deleted, but so does the snapshot on secondary storage.

When I comment out the code in snapshotMgr.cleanupSnapshotsByVolume, everything works the way it used to (meaning the snapshot on the original primary storage is stranded, but the volume is still associated with the correct snapshot on secondary storage).

rafaelweingartner · 2018-08-09T22:27:20Z

You are right. That code seems to be the other way around. I mean, we should clean when there is a failure, right? At least that is what I understood by the author description.

mike-tutkowski · 2018-08-09T22:56:28Z

The intention of the code is OK. A volume is being moved from one primary storage to another. If it had any snapshots on the source primary storage, we do not bring them along to the target primary primary storage. Those (primary-side) snapshots only helped in understanding the delta of changes that needed to be copied to secondary storage in the event that another snapshot was taken. Once we move the volume to its new primary storage, we lose knowledge of this delta information (which is OK). If the user asks for a new snapshot of the volume in question, we must take a full snapshot of that volume (but we should leave a new delta snapshot on this primary storage in case yet another snapshot is requested later). The intent of the code is to remove the unused (delta) snapshot that resided on the source primary storage because it is no longer needed. Its mistake is that it not only removes that delta snapshot, but it also removes the entirety of the snapshot (both what might have been on primary storage and what is on secondary storage). This leads to a data loss for the customer.

mike-tutkowski · 2018-08-09T23:02:31Z

But I should also point out that you are correct in saying that in the event of a failed attempt to create a new snapshot, we should remove the (delta) snapshot from primary storage (as it might act as a false delta if we need to take a future snapshot).

rafaelweingartner · 2018-08-11T15:49:47Z

Thanks for your details @mike-tutkowski. Can you help me understand where the problem is?

Is the problem in snapshotSrv.deleteSnapshot(info) that is deleting the snapshot from the secondary storage as well, and not only in the primary storage? Or is the code that the PR is introducing that is using the method snapshotSrv.deleteSnapshot in an incorrect manner?

mike-tutkowski · 2018-08-11T16:05:29Z

I am not in front of the code right now, but I think the root problem is that we do not currently have a mechanism which allows us to delete the (delta) snapshot on primary storage without removing what it corresponds to on secondary storage. On Aug 11, 2018, at 9:49 AM, Rafael Weingärtner <notifications@github.com<mailto:notifications@github.com>> wrote: Thanks for your details @mike-tutkowski<https://github.com/mike-tutkowski>. Can you help me understand where the problem is? Is the problem in snapshotSrv.deleteSnapshot(info) that is deleting the snapshot from the secondary storage as well, and not only in the primary storage? Or is the code that the PR is introducing that is using the method snapshotSrv.deleteSnapshot in an incorrect manner? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1740 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AC4SH2w97RdeslrPHRtUHaZeX4bAofPeks5uPv0fgaJpZM4KiQ0->.

rafaelweingartner · 2018-08-11T16:11:01Z

Ok, so what this means is that the author when creating the PR used the snapshotSrv.deleteSnapshot thinking that it would only remove the snapshot in the hypervisor (primary storage). However, that method destroys the snapshot in ACS, which means removing in the DB and then from both storage systems (primary and secondary).

Having said that, what we need is to create a method to remove only snapshots from the primary storage only, right? Then, instead of using snapshotSrv.deleteSnapshot, we use this new method.

mike-tutkowski · 2018-08-11T16:16:54Z

Yeah, we basically need to send the DeleteSnapshot command to the hypervisor and then remove the corresponding row from cloud.snapshot_store_ref. We then want to skip the rest of the stuff around deleting the snapshot from secondary storage and those corresponding DB updates. On Aug 11, 2018, at 10:11 AM, Rafael Weingärtner <notifications@github.com<mailto:notifications@github.com>> wrote: Ok, so what this means is that the author when creating the PR used the snapshotSrv.deleteSnapshot thinking that it would only remove the snapshot in the hypervisor (primary storage). However, that method destroys the snapshot in ACS, which means removing in the DB and then from both storage systems (primary and secondary). Having said that, what we need is to create a method to remove only snapshots from the primary storage only, right? Then, instead of using snapshotSrv.deleteSnapshot, we use this new method. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1740 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AC4SH0AcyA6vUBNe8Lz5u4Is7_UQ9LJTks5uPwIWgaJpZM4KiQ0->.

rafaelweingartner · 2018-08-13T21:45:38Z

Hello @mike-tutkowski, I am not able to reproduce the problem you described. I do think though that the code can be improved (I will explain it).

Let's see an example. If I create a VM (vm-1); then I create two snapshots of this VM's root volume. I will have the following entries in snapshot_store_ref table.

id	store_id	snapshot_id	store_role	install_path	state	physical_size	size
422	2	253	Image	snapshots/2/7345/134a2524-72b6-4cfd-a802-89933fe63407.vhd	Ready	1758786048	21474836480
423	5	254	Primary	3037a7ed-5c93-45da-bc73-454ad21df793	Ready	21474836480	21474836480
424	2	254	Image	snapshots/2/7345/5ddf9f65-16e7-4edd-9878-4f18ad26137b.vhd	Ready	113508864	21474836480

Then, when I migrate the volume from shared storage to local storage, the code we are discussing is triggered. The code is quite odd because it returns a list of SnapshotObject. Each snapshot will have a reference there. That object is then created with SnapshotObject.getSnapshotObject(snapshot, store). This means, there will be some SnapshotObject that will reference a snapshot that is not in the primary storage. Then, when the delete method is called snapshotSrv.deleteSnapshot(info), I get the following entries in the snapshot_store_ref table.

id	store_id	snapshot_id	store_role	install_path	state	physical_size	size
422	2	253	Image	snapshots/2/7345/134a2524-72b6-4cfd-a802-89933fe63407.vhd	Ready	1758786048	21474836480
424	2	254	Image	snapshots/2/7345/5ddf9f65-16e7-4edd-9878-4f18ad26137b.vhd	Ready	113508864	21474836480

This is the expected behavior. I am not sure if this behavior can be different for other hypervisors. I am using XenServer and testing with the latest commit in master.

There is one thing we can do though. We can change the method snapshotFactory.getSnapshots to return only SnapshotObject objects that reference the snapshots in the current primary storage from which the volume is being migrated out of. What do you think?

mike-tutkowski · 2018-08-14T00:29:59Z

I should have placed these comments here (in GitHub) and let them propagate to JIRA. I'm going to copy/paste the relevant comments I made in JIRA here:

Comment:

After updating master on my local repo, I can no longer reproduce the issue I described above. With 46c56ea, the workflow behaves as expected. It would seem there was an issue in a previous version of master, but that it no longer exists.

Comment:

With regards to your snapshot-delete question:

As far as I know, if you migrate a volume from one primary storage to another, there should never exist any primary storage-based snapshots of this volume after the migration has taken place. That being the case, I don't think you should get back primary storage-based snapshots from snapshotFactory.getSnapshots that are from different primary storages.

mike-tutkowski · 2018-08-14T00:34:09Z

This PR has been merged, but the corresponding ticket in JIRA is still marked as Open/Unresolved. I'm going to change that ticket to Closed/Fixed.

rafaelweingartner · 2018-08-14T00:41:12Z

As far as I know, if you migrate a volume from one primary storage to another, there should never exist any primary storage-based snapshots of this volume after the migration has taken place. That being the case, I don't think you should get back primary storage-based snapshots from snapshotFactory.getSnapshots that are from different primary storages

I do agree with you, but that is not how it is working right now. Should we fix that method then? I can open a PR tomorrow then.

mike-tutkowski · 2018-08-14T01:18:42Z

Sure, open a PR and feel free to tag me when you would like a review. Thanks! On Aug 13, 2018, at 6:41 PM, Rafael Weingärtner <notifications@github.com<mailto:notifications@github.com>> wrote: As far as I know, if you migrate a volume from one primary storage to another, there should never exist any primary storage-based snapshots of this volume after the migration has taken place. That being the case, I don't think you should get back primary storage-based snapshots from snapshotFactory.getSnapshots that are from different primary storages I do agree with you, but that is not how it is working right now. Should we fix that method then? I can open a PR tomorrow then. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1740 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AC4SH9fNf0Es9IA7wUeNBwklFvV7QuY9ks5uQhypgaJpZM4KiQ0->.

@mike-tutkowski

…mary storage While checking a problem with @mike-tutkowski at PR apache#1740, we noticed that the method `org.apache.cloudstack.storage.snapshot.SnapshotDataFactoryImpl.getSnapshots(long, DataStoreRole)` was not loading the snapshots in `snapshot_store_ref` table according to their storage role. Instead, it would return a list of all snapshots (even if they are not in the storage role sent as a parameter) saying that they are in the storage that was sent as parameter.

@mike-tutkowski

…torage (#2808) * Fix the problem at #1740 when it loads all snapshots in the primary storage While checking a problem with @mike-tutkowski at PR #1740, we noticed that the method `org.apache.cloudstack.storage.snapshot.SnapshotDataFactoryImpl.getSnapshots(long, DataStoreRole)` was not loading the snapshots in `snapshot_store_ref` table according to their storage role. Instead, it would return a list of all snapshots (even if they are not in the storage role sent as a parameter) saying that they are in the storage that was sent as parameter. * Add unit test

yvsubhash changed the base branch from master to 4.8 November 14, 2016 11:15

yvsubhash closed this Nov 14, 2016

yvsubhash force-pushed the CLOUDSTACK-9572 branch from 6639161 to b704cef Compare November 14, 2016 11:16

yvsubhash reopened this Nov 14, 2016

jburwell suggested changes Nov 14, 2016

View reviewed changes

yvsubhash force-pushed the CLOUDSTACK-9572 branch from 13820fd to 45d65ca Compare December 4, 2016 17:06

yvsubhash force-pushed the CLOUDSTACK-9572 branch from 45d65ca to 98ba338 Compare January 20, 2017 11:57

yvsubhash force-pushed the CLOUDSTACK-9572 branch from 98ba338 to 9f3c11d Compare March 1, 2017 12:06

yvsubhash changed the base branch from 4.8 to master March 2, 2017 10:25

yvsubhash force-pushed the CLOUDSTACK-9572 branch from 9f3c11d to 41a9eca Compare March 7, 2017 13:14

yvsubhash closed this Apr 12, 2017

yvsubhash reopened this Nov 6, 2017

yvsubhash force-pushed the CLOUDSTACK-9572 branch from 41a9eca to 93ac40a Compare November 6, 2017 17:05

yvsubhash force-pushed the CLOUDSTACK-9572 branch 2 times, most recently from 64448c3 to 1e319da Compare November 20, 2017 08:43

yvsubhash force-pushed the CLOUDSTACK-9572 branch from 1e319da to f86fd9f Compare November 23, 2017 11:52

yadvr approved these changes Jan 4, 2018

View reviewed changes

yadvr merged commit 8eca04e into apache:master Jan 5, 2018

mike-tutkowski mentioned this pull request Aug 9, 2018

[CLOUDSTACK-10352] XenServer: Support online migration of a virtual disk from non-managed to managed storage #2502

Merged

11 tasks

rafaelweingartner mentioned this pull request Aug 15, 2018

Fix the problem at #1740 when it loads all snapshots in the primary storage #2808

Merged

12 tasks

nvazquez mentioned this pull request Apr 6, 2022

Revert volume to snapshot failures #6209

Closed

Conversation

yvsubhash commented Oct 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps To Reproduce

Uh oh!

syed commented Nov 2, 2016

Uh oh!

yadvr commented Nov 2, 2016

Uh oh!

jburwell left a comment

Choose a reason for hiding this comment

Uh oh!

jburwell Nov 14, 2016

Choose a reason for hiding this comment

Uh oh!

jburwell Nov 14, 2016

Choose a reason for hiding this comment

Uh oh!

yadvr commented Nov 20, 2016

Uh oh!

blueorangutan commented Nov 20, 2016

Uh oh!

blueorangutan commented Nov 20, 2016

Uh oh!

yadvr commented Nov 24, 2016

Uh oh!

cloudmonger commented Mar 13, 2017

ACS CI BVT Run

Uh oh!

yvsubhash commented Mar 23, 2017

Uh oh!

syed commented Mar 23, 2017

Uh oh!

mrunalinikankariya commented Apr 5, 2017

Uh oh!

yvsubhash commented Apr 12, 2017

Uh oh!

yvsubhash commented Apr 12, 2017

Uh oh!

jahnaviboddu commented Nov 23, 2017

Uh oh!

yadvr commented Nov 28, 2017

Uh oh!

rafaelweingartner commented Jan 3, 2018

Uh oh!

yadvr commented Jan 4, 2018

Uh oh!

blueorangutan commented Jan 4, 2018

Uh oh!

blueorangutan commented Jan 4, 2018

Uh oh!

yadvr commented Jan 5, 2018

Uh oh!

mike-tutkowski commented Aug 9, 2018

Uh oh!

mike-tutkowski commented Aug 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mike-tutkowski commented Aug 9, 2018

Uh oh!

rafaelweingartner commented Aug 9, 2018

Uh oh!

mike-tutkowski commented Aug 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mike-tutkowski commented Aug 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rafaelweingartner commented Aug 9, 2018

Uh oh!

mike-tutkowski commented Aug 9, 2018

Uh oh!

mike-tutkowski commented Aug 9, 2018

Uh oh!

rafaelweingartner commented Aug 11, 2018

Uh oh!

mike-tutkowski commented Aug 11, 2018 via email

Uh oh!

rafaelweingartner commented Aug 11, 2018

Uh oh!

yvsubhash commented Oct 27, 2016 •

edited

Loading

mike-tutkowski commented Aug 9, 2018 •

edited

Loading

mike-tutkowski commented Aug 9, 2018 •

edited

Loading

mike-tutkowski commented Aug 9, 2018 •

edited

Loading

rafaelweingartner commented Aug 13, 2018 •

edited

Loading