VM cloning, udev and network interfaces on KVM

In my previous post on setting up Oracle RAC on KVM, I mentioned problems with network interfaces & udev after cloning a Scientific Linux (in the following: SL) machine. This is just a short follow-up describing how to successfully clone an SL 63.1 machine on KVM.

Firstly, cloning a VM on KVM can be done using virt-manager or with virt-clone (or even manually, see http://rwmj.wordpress.com/2010/09/24/tip-my-procedure-for-cloning-a-fedora-vm). Choosing the virt-manager method, however, for a SL 63.1 VM that also acts as an ISCSI client, I got the following error message:

Error setting clone parameters: Could not use path '/dev/disk/by-path/ip-192.168.100.2:3260-iscsi-iqn.2012-09.com.skyrac:t1-lun-15-clone' for cloning: iSCSI volume creation is not supported.

Fortunately, the error can be circumvented by temporarily stopping the iSCSI pool. Virt-manager complains about the respective disks not existing, but will perform the clone correctly.

This just as an aside – regarding our topic of network interfaces, by default, virt-manager will automatically choose new MAC addresses for the interfaces.

Now after cloning, restart the iSCSI pool and boot the VM in runlevel 1. Prior to any changes,

ip link show

will show a list of interfaces eth<n+1>, eth<n+2> etc., where eth<n> is the highest number interface name from the clone source.

Checking for udev rules, it turns out

/etc/udev/rules.d/70-persistent-net.rules

still has the clone source’s MAC addresses (which don’t exist on clone target), but there is a file

/dev/.udev/tmp-rules--70-persistent-net.rules

that creates the aforementioned <n+1> etc. interfaces.

Now it is sufficient to simply edit the /etc/udev/rules.d/70-persistent-net.rules, entering the new MAC addresses, and remove the /dev/.udev/tmp-rules--70-persistent-net.rules temporary file. On reboot, interfaces will have the expected eth0 - eth<n> interface names.

Of course, on Scientific Linux or other Red Hat like systems, it is also neccessary to edit the /etc/sysconfig/network-scripts/ifcfg-eth<n> configuration files for hardware and IP addresses, and /etc/sysconfig/network for the hostname. And that’s it!

Advertisements

RAC on a laptop, using KVM and libvirt

There are already some guides on the net on how to run RAC on a laptop – but AFAIK, they involve either Virtualbox or VMWare for virtualization – or at least there’s none using KVM that I know of. So I’ve thought I’d add my experiences using KVM and the libvirt library. I don’t intend this to be a howto – there are still open questions, quirks and workarounds in my setup. Nonwithstanding, perhaps, by just basically giving an account of what I’ve done, I can spare others a little time and frustration … So, let’s start.

Preparing the host

The host OS, in my case, is Fedora 17, running on an Intel I7 quadcore with hyperthreading and hardware virtualization, and 16G of memory.
In addition to the basic libvirt and KVM libraries (plus the graphical interface, virt-manager), it makes sense to install the libguestfs tools. Libguestfs provides commands like virt-filesystems and virt-df to display information on the guests’ physical and logical volumes, and the extremely useful virt-resize that you can use to enlarge the guest disk on the host, at the same time enlarging the physical volume in the guest.

yum install qemu-kvm libvirt python-virtinst virt-manager
yum install *guestf*

Guests overview

In my setup, there are 3 guests: 2 cluster nodes and a third machine combining the functions of DNS server and iSCSI target, providing the shared storage.
The cluster nodes are running Scientific Linux (SL) 5.8, the storage/DNS server SL 6.3. (I plan to set up a 6.3 cluster next, but for starters it seemed to make sense to be able to use asmlib…).
The easiest way to create a guest is using virt-manager – which is what I did. However, to check on and later modify the configurations, I’ve used virsh-dumpxml and virsh edit, and I’ll refer extracts of the xml dumps in the following.

Defining the networks

In virt-manager, prior to installing the guests, you need to define the networks. By default, every guest will be part of the default network, which uses NAT and DHCP to provide the guest IPs. Through this network, the guests will be able to access the internet and the host, but not each other.
So, 2 additional bridged networks will be created (called isolated in virt-manager). Through bridging, guests on these networks will be able to communicate with each other and the host, but not the outside world.

This is how the cluster’s public network looks in the resulting configuration:

[root@host ~]# virsh net-dumpxml priv0
<network>
<name>priv0</name>
<uuid>ee190ff5-174e-450d-16e1-65372a309dfc</uuid>
<bridge name='virbr1' stp='on' delay='0' />
<mac address='52:54:00:A2:FA:79'/>
<ip address='172.16.0.1' netmask='255.255.255.0'>
</ip>
</network>

The network to be used for the private interconnect looks basically the same (of course, the mac address and ip differ) – from the libvirt point of view, both are private, that is bridged/”non-natted” networks.

Installing the guests

When creating the cluster nodes (yes, unfortunately, this is not a typo – I didn’t manage to successfully clone the first node, I had to install the second in the same way – see the Open questions section below …), in virt-manager you need to add two additional interfaces, one each using the configured bridged networks.
This is how one of them, priv0, will look in the guest configuration file:

[root@host ~]# virsh dumpxml sl58.1 | xpath /domain/devices/interface[3]
Found 1 nodes:
-- NODE --
<interface type="network">
<mac address="52:54:00:44:02:a3" />
<source network="priv0" />
<target dev="vnet4" />
<model type="virtio" />
<alias name="net2" />
<address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x0" />
</interface>

The interfaces will be recognized by SL during the installation process. Installation completed, one should disable Network Manager in the guest so it cannot overwrite the network configuration. For reference, this is how my ifcfg-eth* scripts look for the three interfaces on the first cluster node:

# Virtio Network Device
DEVICE=eth0
BOOTPROTO=dhcp
HWADDR=52:54:00:7F:F7:94
ONBOOT=yes
DHCP_HOSTNAME=node1.skyrac.com
TYPE=Ethernet

# Virtio Network Device
DEVICE=eth1
BOOTPROTO=none
HWADDR=52:54:00:83:53:b8
ONBOOT=yes
NETMASK=255.255.255.0
IPADDR=192.168.100.10
GATEWAY=192.168.100.1
TYPE=Ethernet
USERCTL=no
IPV6INIT=no
PEERDNS=yes

# Virtio Network Device
DEVICE=eth2
BOOTPROTO=none
HWADDR=52:54:00:44:02:a3
ONBOOT=yes
NETMASK=255.255.255.0
IPADDR=172.16.0.10
GATEWAY=172.16.0.1
TYPE=Ethernet
USERCTL=no
IPV6INIT=no
PEERDNS=yes

For the iSCSI target & DNS server, just one interface has to be added, the one for the public network. So there will be just eth0 and eth1 on that vm.

Setting up the DNS

For DNS, I’ve installed bind on the combined DNS & storage node, stor1.skyrac.com.
In /var/named/, four zone files need to be added, one each for the cluster domain, localhost, cluster domain reverse lookup, and localhost reverse lookup.
For this, I basically followed the instructions in Steve Shaw’s and Martin Bach’s excellent “Pro Oracle RAC on Linux”. For a quick reference, this is how my /var/named/master.skyrac.com and /var/named/192.168.100.rev look – included in the latter 2 are the nodes’ public IPs, the VIPs, and the SCAN IPs:

$TTL    86400
@               IN SOA  stor1.skyrac.com. root.localhost. (
42              ; serial
3H              ; refresh
15M             ; retry
1W              ; expiry
1D )            ; minimum
@               IN NS   stor1.skyrac.com.
localhost       IN A    127.0.0.1
stor1           IN A    192.168.100.2
node1           IN A    192.168.100.10
node2           IN A    192.168.100.11
node1-vip       IN A    192.168.100.20
node2-vip       IN A    192.168.100.21
cluster1-scan   IN A    192.168.100.30
IN A    192.168.100.31
IN A    192.168.100.32

$TTL    86400
@               IN SOA  stor1.skyrac.com. root.localhost. (
42              ; serial
3H              ; refresh
15M             ; retry
1W              ; expiry
1D )            ; minimum
@               IN NS   stor1.skyrac.com.
2               IN PTR  stor1.skyrac.com.
10              IN PTR  node1.skyrac.com.
11              IN PTR  node2.skyrac.com.
20              IN PTR  node1-vip.skyrac.com.
21              IN PTR  node2-vip.skyrac.com.

It’s important to check both hostname lookup and reverse lookup (using nslookup or dig). Errors in the reverse lookup zone file can lead to OUI complaining with “INS-40904 ORACLE_HOSTNAME does not resolve to a valid host name”, which is in fact a misleading error message as you don’t have to set this environment variable when reverse lookup works.
One thing to pay attention to, for those new to bind as me: Don’t forget the dot after the hostname. It signifies to bind that a FQDN has been defined (see, e.g.,  http://www.zytrax.com/books/dns/apa/dot.html).

In /etc/named.conf, I’ve added zone blocks for these 4 zones, and defined stor1.skyrac.com to be the master DNS server for the 192.168.100 domain. For records not in this domain, the server forwards to the host on the default network.

options
{
directory               "/var/named";
dump-file               "data/cache_dump.db";
statistics-file         "data/named_stats.txt";
memstatistics-file      "data/named_mem_stats.txt";
listen-on port 53       { 127.0.0.1; 192.168.100.2; };
allow-query { localhost; 192.168.100.0/24; };
recursion yes;
forwarders { 192.168.122.1; };
};

zone "skyrac.com" {
type master;
file "master.skyrac.com";
};
zone "localhost" {
type master;
file "master.localhost";
};
zone "100.168.192.in-addr.arpa" {
type master;
file "192.168.100.rev";
};
zone "0.0.127.in-addr.arpa" {
type master;
file "localhost.rev";
};

Finally, to make it work, stor1.skyrac.com is configured as the DNS server to be contacted on the cluster nodes, in /etc/resolv.conf:

nameserver 192.168.100.2
search skyrac.com

Preparing the storage

On stor1.skyrac.com, iscsi-target-utils is installed and the service is started:

[root@stor1 ~]# yum install iscsi-target-utils
[root@stor1 ~]# service tgtd start
[root@stor1 ~]# chkconfig tgtd on

In order to define the LUNs, I need to provide logical devices on the storage server. For convenience, what I did was add a second disk in virt-manager, resulting in a second physical volume in the vm, and from this create a second volume group. From that volume group, I then created one logical volume for every LUN I want to export.

Prior to the LUNs themselves, I create a target:

[root@stor1 ~]# tgtadm --lld iscsi --mode target --op new --tid 1 --targetname iqn.2012-10.com.skyrac:t1

Then, the LUNs are bound to the target:

[root@stor1 ~]# tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 --b <path of logical volume>

All LUNs defined, the configuration is written to /etc/tgt/targets.conf:

[root@stor1 ~]# tgt-admin --dump > /etc/tgt/targets.conf

This is how it looks in my case:

[root@stor1 ~]# less /etc/tgt/targets.conf
default-driver iscsi
<target iqn.2012-09.com.skyrac:t1>
backing-store /dev/vg_stor/vol1
backing-store /dev/vg_stor/vol2
backing-store /dev/vg_stor/vol3
backing-store /dev/vg_stor/vol4
backing-store /dev/vg_stor/vol5
backing-store /dev/vg_stor/vol6
backing-store /dev/vg_stor/vol7
backing-store /dev/vg_stor/vol8
</target>

On the clients, having installed iscsi-initiator-utils, I can try if discovering the target works:

iscsiadm -- mode discovery --type sendtargets --portal 192.168.100.2
192.168.100.2:3260,1 iqn.2012-09.com.skyrac:t1

Finally, I’ve added the new storage pool in virt-manager, and then added the LUNs as additional storage to the client vms. Actually, the latter step proved tedious and not quite working in virt-manager, so the easier way was to add one LUN in virt-manager, and then, using virsh edit, define the others based on that example. This is, for example, how the first 2 LUNs look on one client:

virsh dumpxml sl58.1 | xpath /domain/devices/disk
<disk type="block" device="disk">
<driver name="qemu" type="raw" />
<source dev="/dev/disk/by-path/ip-192.168.100.2:3260-iscsi-iqn.2012-09.com.skyrac:t1-lun-1" />
<target dev="vdb" bus="virtio" />
<alias name="virtio-disk1" />
<address type="pci" domain="0x0000" bus="0x00" slot="0x0b" function="0x0" />
</disk>-- NODE --
<disk type="block" device="disk">
<driver name="qemu" type="raw" />
<source dev="/dev/disk/by-path/ip-192.168.100.2:3260-iscsi-iqn.2012-09.com.skyrac:t1-lun-2" />
<target dev="vdc" bus="virtio" />
<alias name="virtio-disk2" />
<address type="pci" domain="0x0000" bus="0x00" slot="0x0c" function="0x0" />
</disk>-- NODE --

Now on the clients, the new devices will appear as /dev/sd* and can be partitioned as usual.

And the rest is – Oracle

On the partitioned devices, I’ve used asmlib to create the asm disks, and proceeded to install GI. Here, I ran into a problem: the OUI “flashing-and-blinking-bug” (see also: http://www.masterschema.com/2009/09/flashing-and-blinking-bug-in-oracle-universal-installer-database-11gr2/). Actually, this is related to screen resolution, which according to the GI installation guide, has to be at least 1024*768. With the default video driver chosen for the vm during install, qxl, it was not possible to select any resolution higher than 800*600, but switching to vga and editing the xorg.conf on the vm did the trick – for this, see the very helpful instructions on http://blog.bodhizazen.net/linux/how-to-improve-resolution-in-kvm/.
The screen resolution problem out of the way, the GI install went as usual, and the database home install as well… but with clusterware running, performance was horrible. With iowaits of 50-60% on the OS disks (not the asm disks!), installing iotop

yum install iotop

revealed the culprit: ologgerd. Reducing the log level

oclumon debug log ologgerd allcomp:1

leaves iowait at around 10%, and performance very acceptable. And that’s it so far – I have a working cluster, and first experience with KVM, DNS, and iSCSI …
Of course, I also have lots of things to do more economically, in a more automated way, another time.

Open questions

The most annoying, and certainly to be investigated further, problem was for sure the inability to successfully clone a vm.
After cloning (done from virt-manager), starting in single mode, editing the network configuration (change IPs and MAC addresses), the bridged interfaces did not come up, saying “RTNetlink answers: No such device“. This looks like a udev problem – but I did not find any MAC addresses hard coded in the udev configuration files. Of course, installing every RAC node from scratch is a no-go, so there’s certainly some problem to be solved…