CLOUDSTACK-9572 Snapshot on primary storage not cleaned up after Stor…#1740
Conversation
Why is it expected that latest snapshot be present on the primary storage after it has been copied? Shouldn't it be deleted? |
|
@yvsubhash can you fix the conflicts and rebase against latest 4.8/4.9. Thanks. |
6639161 to
b704cef
Compare
jburwell
left a comment
There was a problem hiding this comment.
@yvsubhash in addition to my review comments, please add Marvin test cases to verify the behavior of this fix.
| SnapshotInfo info = snapshotFactory.getSnapshot(snapshot.getId(), DataStoreRole.Primary); | ||
| try { | ||
| snapshotSrv.deleteSnapshot(info); | ||
| } catch(CloudRuntimeException e) { |
There was a problem hiding this comment.
This local variable is only used once. Please consider collapsing into lint 1122. Also, please add the message from the exception to the message to provide greater detail for debugging efforts.
| @Override | ||
| public void cleanupSnapshotsByVolume(Long volumeId) { | ||
| List<SnapshotVO> volSnapShots = _snapshotDao.listByVolumeId(volumeId); | ||
| for(SnapshotVO snapshot: volSnapShots) { |
There was a problem hiding this comment.
This appears to be an application side join. Please consider creating a new query to retrieve all snapshot info instances associated with volumeId to reduce load on the database and simplify this method.
|
@blueorangutan package |
|
@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. |
|
Packaging result: ✔centos6 ✔centos7 ✔debian. JID-183 |
|
@yvsubhash can you reply on raised review comments? |
13820fd to
45d65ca
Compare
45d65ca to
98ba338
Compare
98ba338 to
9f3c11d
Compare
9f3c11d to
41a9eca
Compare
ACS CI BVT RunSumarry: Link to logs Folder (search by build_no): https://www.dropbox.com/sh/yj3wnzbceo9uef2/AAB6u-Iap-xztdm6jHX9SjPja?dl=0 Failed tests: Skipped tests: Passed test suits: |
|
@syed |
|
Thanks @yvsubhash Did the code review and it LGTM |
|
In ACS, all the snapshots are deleted from primary storage as soon as it is successfully backed up and copied on secondary storage, hence this issue is not valid for ACS. However a ticket should be raised to analysis and understand how incremental snapshots work if not a single copy kept on primary. |
|
Closing this PR as it is no longer needed |
|
The following ticket is opened for fixing this issue |
41a9eca to
93ac40a
Compare
64448c3 to
1e319da
Compare
|
LGTM for Testing
|
1e319da to
f86fd9f
Compare
|
@blueorangutan package |
|
@rhtyd I tested this fix in my dev environment and it solved the described problem. |
|
@blueorangutan test centos7 xenserver-65sp1 |
|
@rhtyd a Trillian-Jenkins test job (centos7 mgmt + xenserver-65sp1) has been kicked to run smoke tests |
|
Trillian test result (tid-1995)
|
|
Merging this based on test results and code reviews. |
|
I have noticed that not only is the snapshot on primary storage deleted when cleanupSnapshotsByVolume is invoked, but so is the copy of it that's on secondary storage. Can anyone else reproduce this? |
|
In fact, one of my tests fails unless I comment out the cleanupSnapshotsByVolume method. The test checks to make sure the volume that is migrated from one primary storage to another is still associated with the snapshot on secondary storage. As it turns out, when cleanupSnapshotByVolume is invoked, the snapshot is completed deleted. |
|
OK, I have been able to reproduce this in a simple environment using only NFS. I create a VM whose root disk resides on an NFS primary storage. I take a snapshot of this root disk. I then migrate the root disk from one NFS primary storage to another. The volume is no longer associated with the snapshot and the removed cell for the snapshot in the cloud.snapshots table has a date in it. |
|
@mike-tutkowski are you sure this problem is caused by this PR? |
|
From what it looks like, the method is called upon success at least in two places: |
|
The code to retrieve only primary-storage snapshots is OK. The problem is when the snapshot is given to the Snapshot Service (snapshotSrv.deleteSnapshot(info);). At some point, not only does the snapshot on the primary storage get deleted, but so does the snapshot on secondary storage. When I comment out the code in snapshotMgr.cleanupSnapshotsByVolume, everything works the way it used to (meaning the snapshot on the original primary storage is stranded, but the volume is still associated with the correct snapshot on secondary storage). |
|
You are right. That code seems to be the other way around. I mean, we should clean when there is a failure, right? At least that is what I understood by the author description. |
|
The intention of the code is OK. A volume is being moved from one primary storage to another. If it had any snapshots on the source primary storage, we do not bring them along to the target primary primary storage. Those (primary-side) snapshots only helped in understanding the delta of changes that needed to be copied to secondary storage in the event that another snapshot was taken. Once we move the volume to its new primary storage, we lose knowledge of this delta information (which is OK). If the user asks for a new snapshot of the volume in question, we must take a full snapshot of that volume (but we should leave a new delta snapshot on this primary storage in case yet another snapshot is requested later). The intent of the code is to remove the unused (delta) snapshot that resided on the source primary storage because it is no longer needed. Its mistake is that it not only removes that delta snapshot, but it also removes the entirety of the snapshot (both what might have been on primary storage and what is on secondary storage). This leads to a data loss for the customer. |
|
But I should also point out that you are correct in saying that in the event of a failed attempt to create a new snapshot, we should remove the (delta) snapshot from primary storage (as it might act as a false delta if we need to take a future snapshot). |
|
Thanks for your details @mike-tutkowski. Can you help me understand where the problem is? Is the problem in |
|
I am not in front of the code right now, but I think the root problem is that we do not currently have a mechanism which allows us to delete the (delta) snapshot on primary storage without removing what it corresponds to on secondary storage.
On Aug 11, 2018, at 9:49 AM, Rafael Weingärtner <notifications@github.com<mailto:notifications@github.com>> wrote:
Thanks for your details @mike-tutkowski<https://github.com/mike-tutkowski>. Can you help me understand where the problem is?
Is the problem in snapshotSrv.deleteSnapshot(info) that is deleting the snapshot from the secondary storage as well, and not only in the primary storage? Or is the code that the PR is introducing that is using the method snapshotSrv.deleteSnapshot in an incorrect manner?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1740 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AC4SH2w97RdeslrPHRtUHaZeX4bAofPeks5uPv0fgaJpZM4KiQ0->.
|
|
Ok, so what this means is that the author when creating the PR used the Having said that, what we need is to create a method to remove only snapshots from the primary storage only, right? Then, instead of using |
|
Yeah, we basically need to send the DeleteSnapshot command to the hypervisor and then remove the corresponding row from cloud.snapshot_store_ref. We then want to skip the rest of the stuff around deleting the snapshot from secondary storage and those corresponding DB updates.
On Aug 11, 2018, at 10:11 AM, Rafael Weingärtner <notifications@github.com<mailto:notifications@github.com>> wrote:
Ok, so what this means is that the author when creating the PR used the snapshotSrv.deleteSnapshot thinking that it would only remove the snapshot in the hypervisor (primary storage). However, that method destroys the snapshot in ACS, which means removing in the DB and then from both storage systems (primary and secondary).
Having said that, what we need is to create a method to remove only snapshots from the primary storage only, right? Then, instead of using snapshotSrv.deleteSnapshot, we use this new method.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1740 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AC4SH0AcyA6vUBNe8Lz5u4Is7_UQ9LJTks5uPwIWgaJpZM4KiQ0->.
|
|
Hello @mike-tutkowski, I am not able to reproduce the problem you described. I do think though that the code can be improved (I will explain it). Let's see an example. If I create a VM (vm-1); then I create two snapshots of this VM's root volume. I will have the following entries in
Then, when I migrate the volume from shared storage to local storage, the code we are discussing is triggered. The code is quite odd because it returns a list of
This is the expected behavior. I am not sure if this behavior can be different for other hypervisors. I am using XenServer and testing with the latest commit in master. There is one thing we can do though. We can change the method |
|
I should have placed these comments here (in GitHub) and let them propagate to JIRA. I'm going to copy/paste the relevant comments I made in JIRA here: Comment: After updating master on my local repo, I can no longer reproduce the issue I described above. With 46c56ea, the workflow behaves as expected. It would seem there was an issue in a previous version of master, but that it no longer exists. Comment: With regards to your snapshot-delete question: As far as I know, if you migrate a volume from one primary storage to another, there should never exist any primary storage-based snapshots of this volume after the migration has taken place. That being the case, I don't think you should get back primary storage-based snapshots from snapshotFactory.getSnapshots that are from different primary storages. |
|
This PR has been merged, but the corresponding ticket in JIRA is still marked as Open/Unresolved. I'm going to change that ticket to Closed/Fixed. |
I do agree with you, but that is not how it is working right now. Should we fix that method then? I can open a PR tomorrow then. |
|
Sure, open a PR and feel free to tag me when you would like a review. Thanks!
On Aug 13, 2018, at 6:41 PM, Rafael Weingärtner <notifications@github.com<mailto:notifications@github.com>> wrote:
As far as I know, if you migrate a volume from one primary storage to another, there should never exist any primary storage-based snapshots of this volume after the migration has taken place. That being the case, I don't think you should get back primary storage-based snapshots from snapshotFactory.getSnapshots that are from different primary storages
I do agree with you, but that is not how it is working right now. Should we fix that method then? I can open a PR tomorrow then.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1740 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AC4SH9fNf0Es9IA7wUeNBwklFvV7QuY9ks5uQhypgaJpZM4KiQ0->.
|
…mary storage While checking a problem with @mike-tutkowski at PR apache#1740, we noticed that the method `org.apache.cloudstack.storage.snapshot.SnapshotDataFactoryImpl.getSnapshots(long, DataStoreRole)` was not loading the snapshots in `snapshot_store_ref` table according to their storage role. Instead, it would return a list of all snapshots (even if they are not in the storage role sent as a parameter) saying that they are in the storage that was sent as parameter.
…torage (#2808) * Fix the problem at #1740 when it loads all snapshots in the primary storage While checking a problem with @mike-tutkowski at PR #1740, we noticed that the method `org.apache.cloudstack.storage.snapshot.SnapshotDataFactoryImpl.getSnapshots(long, DataStoreRole)` was not loading the snapshots in `snapshot_store_ref` table according to their storage role. Instead, it would return a list of all snapshots (even if they are not in the storage role sent as a parameter) saying that they are in the storage that was sent as parameter. * Add unit test
Snapshot on primary storage not cleaned up after Storage migration. This happens in the following scenario
Steps To Reproduce