Orphaned disks in OVM and what to do with them

Some time ago I was doing a maintenance on an OVM and noticed that it had significant number of disks without mapping to any virtual machine (I need to mention that the OVM cluster was a home for more than 400 VMs). Having about 1800 virtual disks it was easy to miss some lost disks without any mapping to VMs. Some of them were created on purpose and were possibly forgotten but the most looked like leftovers from an automatic deployment. I attached several of the disks to a test VM and checked the contents:
[root@vm129-132 ~]# fdisk -l /dev/xvdd
Disk /dev/xvdd: 3117 MB, 3117416448 bytes, 6088704 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
[root@vm129-132 ~]# dd if=/dev/xvdd bs=512 count=100 | strings
100+0 records in
100+0 records out
51200 bytes (51 kB) copied, 0.00220462 s, 23.2 MB/s
[root@vm129-132 ~]#
And checked other attributes for the disks from OVM cli:
OVM> show VirtualDisk id=0004fb0000120000d5e0235900f63355.img
Command: show VirtualDisk id=0004fb0000120000d5e0235900f63355.img
Status: Success
Time: 2017-05-19 09:11:26,664 PDT
Data:
Absolute Path = nfsserv:/data1/vms/ovs/crepo1/VirtualDisks/0004fb0000120000d5e0235900f63355.img
Mounted Path = /OVS/Repositories/0004fb0000030000998d2e73e5ec136a/VirtualDisks/0004fb0000120000d5e0235900f63355.img
Max (GiB) = 2.9
Used (GiB) = 0.0
Shareable = Yes
Repository Id = 0004fb0000030000998d2e73e5ec136a [crepo1]
Id = 0004fb0000120000d5e0235900f63355.img [6F4dKi9hT0cYW_db_asm_disk_0 (21)]
Name = 6F4dKi9hT0cYW_db_asm_disk_0 (21)
Locked = false
DeprecatedAttrs = [Assembly Virtual Disk]
OVM>
The disk was completely empty and, according to the name and one of the deprecated attributes, it was clear that the disk was a leftover from a deployed assembly. I remembered one issue in the past when shared disks were not deleted if you were using one of assemblies for Oracle RAC deployed and deleted through Oracle Enterprise Manager Self Service Portal (OEM SS). It was noticed on OVM 3.2.x with OEM 12c. In that case, if you had two or more VMs working with the same shared disks those shared disks were not deleted when all VMs and local disks had been destroyed. The issue has been gone for long time but the lost disks were left behind. I created a script to find all the disks without a mapping to any existing VM. The script was written using expect language and ssh cli for OVM. To run the script you need connection to OVM manager using ssh to port 10000 and expect language working on your machine. I used one of the oracle sample scripts to build my own. Here is the script body:
#!/usr/bin/expect
set username [lindex $argv 0];
set password [lindex $argv 1];
set prompt "OVM> "
set timeout 3
log_user 0
spawn ssh -l $username 10.177.0.101 -p 10000
expect_after eof {exit 0}
##interact with SSH
expect "yes/no" {send "yes\r"}
expect "password:" {send "$password\r"}
#################### Execute Command passed in ##################
expect "OVM> "
set timeout 20
match_max 100000
log_user 0
send "list virtualdisk\r"
expect "OVM> "
set resultdata $expect_out(buffer)
set resultlength [string length $resultdata]
set idindex 0
set id ""
set done 0
while {$done != 1} {
set idindex [string first "id:" $resultdata]
set nameindex [string first "name:" $resultdata]
if {$idindex != -1 && $nameindex != -1 && $idindex < $nameindex} { set id [string range $resultdata [expr {$idindex+3}] [expr {$nameindex-3}]] send "show VirtualDisk id='$id'\r" expect "OVM> "
set getVirtualDiskInfo $expect_out(buffer)
set getVirtualDiskInfoLength [string length $getVirtualDiskInfo]
set getVirtualDiskInfoIndex 0
set getVirtualDiskInfoMapping ""
set doneProcessingVirtualDisk 0
while {$doneProcessingVirtualDisk != 1} {
set getVirtualDiskInfoIndex [string first "VmDiskMapping" $getVirtualDiskInfo]
if {$getVirtualDiskInfoIndex != -1} {
puts "Disk with mapping: '$id \r"
set doneProcessingVirtualDisk 1
} else {
puts "Disk without mapping:'$id \r"
set doneProcessingVirtualDisk 1
}
}
set resultdata [string range $resultdata [expr {$nameindex+1}] $resultlength]
set resultlength [string length $resultdata]
} else {
set done 1
}
}
log_user 1
expect "OVM> "
send "exit\r"
You can see the script is simple enough and doesn't require a lot of time to write. I redirected output of the script to a file in order to analyze the output.
[oracle@vm129-132 ~]$ ./dsk_inventory admin password >dsk_iventory.out
[oracle@vm129-132 ~]$ wc -l dsk_inventory.out
1836 dsk_inventory.out
[oracle@vm129-132 ~]$ grep "Disk without mapping" dsk_inventory.out | wc -l
482
[oracle@vm129-132 ~]$
As you could see, I had 482 orphaned disks out of 1836. It was more than 25% of all disks and it was not only wasting space but it also had a significant impact to interface performance. Every time when you tried to add, modify or delete a disk through OEM SS it took a long pause to retrieve information about the disks. I decided to delete all those disks using the same script but just added a couple of lines to delete the disk if it doesn't have a mapping. Here is modified section of the script:
while {$doneProcessingVirtualDisk != 1} {
set getVirtualDiskInfoIndex [string first "VmDiskMapping" $getVirtualDiskInfo]
if {$getVirtualDiskInfoIndex != -1} {
puts "Disk with mapping:'$id'\r"
set doneProcessingVirtualDisk 1
} else {
puts "Disk without mapping:'$id'\r"
send "delete VirtualDisk id='$id'\r"
expect "OVM> "
set doneProcessingVirtualDisk 1
}
}
The changes were minimal and send "delete" command to OVM if a disk doesn't have any mapping. Of course if you want to exclude certain disks you should add more conditions with "if" using disks ids to prevent them from being deleted. And it is safe since you are using an approved standard interface and it will not allow you to delete a disk if it has an active mapping to any VM. If you try to delete a disk with an active mapping you are going to get an error:
OVM> delete VirtualDisk id=0004fb0000120000493379bb12928c33.img
Command: delete VirtualDisk id=0004fb0000120000493379bb12928c33.img
Status: Failure
Time: 2017-05-19 09:28:13,046 PDT
JobId: 1495211292856
Error Msg: Job failed on Core: OVMRU_002018E crepo1 - Cannot delete virtual device 6F4dKi9hT0cYW_crs_asm_disk_1 (23), it is still in use by [DLTEST0:vm129-132 ]. [Fri May 19 09:28:12 PDT 2017]
OVM>
I ran my script, deleted all the non-mapped disks and repeated the inventory script to verify results. I found a couple of disks which were not deleted.
[oracle@vm129-132 ~]$ ./del_orph_dsk admin Y0u3uck2 > del_dsk_log.out
[oracle@vm129-132 ~]$ ./dsk_inventory admin Y0u3uck2 >dsk_inventory_after.out
[oracle@vm129-132 ~]$ wc -l dsk_inventory_after.out
1356 dsk_inventory_after.out
[oracle@vm129-132 ~]$ grep "Disk without mapping" dsk_inventory_after.out | wc -l
2
[oracle@vm129-132 ~]$ grep "Disk without mapping" dsk_inventory_after.out
Disk without mapping:0004fb0000120000a2d31cc7ef0c2d86.img
Disk without mapping:0004fb0000120000da746f417f5a0481.img
[oracle@vm129-132 ~]$
It appeared that the disks didn't have any existing files on the repository filesystem. It looked like the files were lost some time ago due to a bug or maybe some past issues on the file system.
OVM> show VirtualDisk id=0004fb0000120000a2d31cc7ef0c2d86.img
Command: show VirtualDisk id=0004fb0000120000a2d31cc7ef0c2d86.img
Status: Success
Time: 2017-05-19 12:35:13,383 PDT
Data:
Absolute Path = nfsserv:/data1/vms/ovs/crepo1/VirtualDisks/0004fb0000120000a2d31cc7ef0c2d86.img
Mounted Path = /OVS/Repositories/0004fb0000030000998d2e73e5ec136a/VirtualDisks/0004fb0000120000a2d31cc7ef0c2d86.img
Max (GiB) = 40.0
Used (GiB) = 22.19
Shareable = No
Repository Id = 0004fb0000030000998d2e73e5ec136a [crepo1]
Id = 0004fb0000120000a2d31cc7ef0c2d86.img [ovmcloudomsoh (3)]
Name = ovmcloudomsoh (3)
Locked = false
DeprecatedAttrs = [Assembly Virtual Disk]
OVM> delete VirtualDisk id=0004fb0000120000a2d31cc7ef0c2d86.img
Command: delete VirtualDisk id=0004fb0000120000a2d31cc7ef0c2d86.img
Status: Failure
Time: 2017-05-19 12:36:39,479 PDT
JobId: 1495222598733
Error Msg: Job failed on Core: OVMAPI_6000E Internal Error: OVMAPI_5001E Job: 1495222598733/Delete Virtual Disk: ovmcloudomsoh (3) from Repository: crepo1/Delete Virtual Disk: ovmcloudomsoh (3) from Repository: crepo1, failed. Job Failure Event: 1495222599299/Server Async Command Failed/OVMEVT_00C014D_001 Async command failed on server: vms01.dlab.pythian.com. Object: ovmcloudomsoh (3), PID: 27092,
Server error: [Errno 2] No such file or directory: '/OVS/Repositories/0004fb0000030000998d2e73e5ec136a/VirtualDisks/0004fb0000120000a2d31cc7ef0c2d86.img'
, on server: vms01.dlab.pythian.com, associated with object: 0004fb0000120000a2d31cc7ef0c2d86.img [Fri May 19 12:36:39 PDT 2017]
OVM>
So, we had information about disks in the repository database but didn't have the disks themselves. To make the repository consistent, I created empty files with the same names as the nonexistent virtual disks and deleted them using OVM CLI interface.
root@nfsserv:~# ll /data1/vms/ovs/crepo1/VirtualDisks/0004fb0000120000da746f417f5a0481.img
/data1/vms/ovs/crepo1/VirtualDisks/0004fb0000120000da746f417f5a0481.img: No such file or directory
root@nfsserv:~# touch /data1/vms/ovs/crepo1/VirtualDisks/0004fb0000120000da746f417f5a0481.img
root@nfsserv:~#
OVM> delete VirtualDisk id=0004fb0000120000da746f417f5a0481.img
Command: delete VirtualDisk id=0004fb0000120000da746f417f5a0481.img
Status: Success
Time: 2017-05-23 07:41:43,195 PDT
JobId: 1495550499971
OVM>
I think it can be worth to check from time to time whether you have any disks without mapping to any VM, especially if your environment has a considerable number of disks and has long story of upgrades, updates and high users activity. And now a couple of words about OVM CLI and using "expect" language for scripting... As you can see, the combination provides good options to program your daily routine maintenance on OVM. It would take ages to find and clear all those disks manually using GUI.