question about ‘hot’ or ‘live’ backups

raffiki · Post by **raffiki** » Jun 04, 2009 5:06 am this post

G’day

Just have a question about ‘hot’ or ‘live’ backups.

Firstly some background... we have 3 esx hosts which were originally meant to be connected to a san, then our budget got cut for the next 6-12 months (hosts were bought before the san), so now we have 3 hosts each with local storage (sas drives on raid 5) on which to place our virtual environment *cry*

... someone please kill me.

We now have a directive to only use esx (not hyper v or vmware server), so I am wondering how do I go about backing up said hosts on local storage to another location?
vcb cannot be used as there is no san and the powers that be want a backup of the vm image, that has left me with veeam backup and vranger as the candidates (naturally my question is about veeam).

I have been testing the image level backup and restore which works a treat however, I am wondering how in the hell do I take a image level backup of a running vm (server 03 and 08) without the vm thinking that it was a dirty power off?
Currently if I do a image level backup of a running vm and then restore it to another location, the vm boots and windows asks if you would like to start normally, use safe mode or use the last known good configuration just as it would if I had powered it off and not shut it down cleanly.

Snapshots can restore to a running state if the vm is powered off why cant veeam backup or can it and I'm just stupid? Can I set it to somehow take a snapshot then take a image level backup, which can then both be restored to a location?

My understanding of hot backups means you can take an image level backup of a running vm and then restore that vm in its running state to another location? Is that correct or am I way off the mark?

Anyway thanks in advance for any help and/or advice.

Cheers

Post by **Gostev** » Jun 04, 2009 11:02 am this post

Hello, no - running state is not preserved during hot backups. This is "industry standard" at the moment. Besides, I would expect problems when restoring VM including running state (restoring to different host hardware, different network segment may result in system failure).

However, we leverage Windows VSS intergration to commit all application transactions, including registry and file system operations, so your system will be in consistent state after you restore it. Yes, OS will still report "dirty poweroff" because the functionality of detection of improper shutdown was not designed with virtualization and image-level backups in mind.

Does this answer your question?

raffiki · Post by **raffiki** » Jun 04, 2009 10:51 pm this post

indeed it does, thanks mate.

cicero · Post by **cicero** » Jul 23, 2009 3:57 pm this post

Hi gostev,

just booted a replicated guest without VSS on the backup-ESX
(Linux Guest, most Red Hat Enterprise - some are Debian 4 and 5, all ext3 Filesystem).

On boot-time RedHat complains about dirty power off and filesystem-problems (lost inodes, ...)
I've to start checkfs fore about 2 hours, and finaly, after a lot of corrected problems,
i got the guest up.

But i wonder if there is a better way to take a image of a linux guest without that problems when recovering.
(and i maybe wonder, if there could be the risk, that the guest is someday not able to boot
because of too bad filesystem errors?)

David

Post by **Gostev** » Jul 23, 2009 4:15 pm this post

David, dirty power off message is expected, but not file system problems... Do you have VMware Tools installed on the guest? It should properly quiesce file system before snapshot creation...

cicero · Post by **cicero** » Jul 23, 2009 8:24 pm this post

yes, vmware-guestd is up and running.
It's Linux 2.6.9 kernel with oracle 10g.
pre-freze-script puts even oracle into suspend mode, before snapshoting.

i'll do some more testing tomorrow, if you say, that there shouldn't be filesystem errors.

David

Post by **Gostev** » Jul 23, 2009 10:12 pm this post

Frankly I don't know for sure if this is normal... I just did not expect them to appear. Please try do the testing with regular VMware snapshots (using VMware Infrastructure Client, snapshot the running VM, shutdown it down, and revert to snapshot), and see if you get these errors. Veeam Backup does not affect snapshot creation (unless Veeam VSS is used), so you do not need to do full cycle with backups/restores for this testing.

Post by **tsightler** » Jul 24, 2009 2:58 am this post

We backup a lot of Linux VM's, including several running Oracle databases, and have not experienced this issue at all. The ext3 filesystem should generally be robust enough to handle this without requiring a fsck of the filesystem on reboot, just a replay of the journal. You might try adding a "sync; sync; sync" line to the pre-freeze script to flush all currently uncommited I/O, probably after the Oracle instance is suspended, this might help. Is you database pretty quiet at the time the snapshot it taken?

cicero · Post by **cicero** » Jul 25, 2009 5:19 pm this post

So, some testing with following results:

- Guest (RHEL 4.3 with Oracle 10g on ext3) replicated to Backuphost while Oracle is suspended
advices me while boottime, that there was an unclean power-off (as usual), and asks if a want to check filesystem.

-> if i say No, nothing seems unusual, and Oracle is starting well (without deeper testing)
-> if i say Yes, there are realy quite a lot of filesystem errors (> 100).
chkfs.ext3 is trying to correct filesystem but warns, maybe that dataloss is possible.
-> same with offline correction of /dev/sda with boot-cd (systemrescue-cd)
-> after reboot, system is slightly or evem heavily corrupted (deleted /var/libs etc) or even doesn't boot

- fearing, that perhaps source-system is corrupted too,
i checked that filesystem. But without any problems.

So ext3 filesystem seems to get corrupted while thaking a snapthot (?).

I give tsightler a try with "sync; sync; sync" on pre-freeze
and keep you informed afterwards.

Maybe somebody has another idea?
Oracle is lazy while snapshot, and I notice similar filesystem-correction
on other RedHat guests on Backup-Host.

David

cicero · Post by **cicero** » Jul 27, 2009 9:01 am this post

- "sync;sync;sync;" doesn't help.

- RHEL 4.3 Guest with oracle 10g has a lot (50-100) of filesystem errors like:

Code: Select all

"extended atrrribute block 22343 has reference count 3, should be 2. Fix<y>?"
"Invalid inode number for '.' in a directory inode 44454. Fix<y>?"
"Entry 'filename' in /directory/.../???/.../ (99303) has an incorrect filetype(was 1. should be 2)"
"Unconnected directory inode 762765 (/usr/bin/???)
Connect to /lost+found<y>?"
"Block bitmap differences: (24571--24572)..."
"Free blocks count wrong for group #0(201, counted=199)"

afterwards, root-filesystem is a little bit f**ked)

- the other RHEL Guest (Lotus Domino) has errors too, but not that much ... only a hand full.

- all other guests (debian) without errors.

Has somebody any idea, why,
or how to debug more deeply (vmware tools maybe sucks?)

I assume, thas FS errors won't inherit on the last incremental replikation
from the last full replication. (?)

David

David

Post by **tsightler** » Jul 27, 2009 2:34 pm this post

Are your filesystems on top of LVM or just basic partitions? We don't see anything like this at all on our systems, but I'll admit that we're not running anything nearly as old as RHEL4.3.

cicero · Post by **cicero** » Jul 27, 2009 4:50 pm this post

no, just basic partitions.

RHEL 4.3 is about 2 or 3 years old (now 5.x is current) - thats not uncommon ... (?).

David

Post by **tsightler** » Jul 27, 2009 5:51 pm this post

RHEL 4.3 was released around March of 2006. Current release of RHEL4 is 4.8, in May of 2009. I understand that running RHEL4 is not that uncommon, but I would hope that running 4.3 is at least somewhat uncommon since there have been five point releases since then. That's basically like running Windows 2003 with only SP1, not SP2. My oldest system is running 4.7 with plans to update to 4.8 or 5.x within the next few weeks.

Still, I'm not trying to imply that this is your problem, I was only pointing out that I simply don't experience this issue at all and I'm not running anything that old. This may imply that a newer 4.x release has ext3 changes which correct the problem, or it might not mean anything at all. I do know that there were some major ext3 regressions with on of the early 4.x updates, and I think it might have been around the 4.3 release, but my memory has faded.

I'll try to do some additional testing with some of my VM's and see if I can get anything at all like your problem. Any unusual flags on your ext3 mounts?

Post by **tsightler** » Jul 27, 2009 8:34 pm this post

OK, this won't help you much, but I just tested two systems which are reasonably close to your configuration (RHEL4.7, Oracle 10G database) and the replicated copies boot with nothing more than a minor warning about "replaying journal" and a few orphaned inodes, pretty normal recovery after a "hard power down" which is basically the equivalent of what a snapshot would be.

I wish I had an idea for you, but nothing jumps out at the moment.

Good Luck!

cicero · Post by **cicero** » Jul 28, 2009 8:52 am this post

Hi Tom,

thanks for your suggestions and all your testing!

I'm also not sure, that an upgrade will resolve the issue,
because there are no snapshot problems on a much
older linux guests (Suse 8 Ent. with Oracle 9.2).

Anyway I'll test a Upgrade to RHEL 4.8 with Boot-DVD on the Test-System
and do some snapshots afterwards.

though we are not allowed to upgrade on the Live-system
because the ventor of the ERP-System belives, that there
may be Problems afterwards with Oracle.

David

cicero · Post by **cicero** » Jul 28, 2009 10:13 am this post

something nasty ... :

Kudzu has a SCSI Controller on his ignore list:

vim /etc/sysconfig/hwconf

Code: Select all

class: SCSI
bus: PCI
detached: 0
driver: mptscsih
desc: "LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI"
vendorId: 1000
deviceId: 0030
subVendorId: 0000
subDeviceId: 0000
pciType: 1
pcidom:    0
pcibus:  0
pcidev: 10
pcifn:  0

after removing this entry on the replicated machine,
kudzu finds it:

Code: Select all

"LSI Logic / Symbios Logiv 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI"
Configure -> YES -> blabla [ok]

but this one seems already installed
lspci

Code: Select all

00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 01)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 01)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:0f.0 VGA compatible controller: VMware Inc [VMware SVGA II] PCI Display Adapter
[b]00:10.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 01)[/b]
00:11.0 Ethernet controller: VMware Inc VMware High-Speed Virtual NIC [vmxnet] (rev 10)

after configuring the 'new' LSI Logic, there is no change in lspci.

-

on the other hand, i did a lot of hard shutdowns on the
replicated (an prior cleaned filesystem) problem guest,
and der is no problem after snapshoting or power-off the guest.

so i think, there is maybe something wrong with the SCSI driver (?)
or while doing the replication to the other host ...

David

R&D Forums

question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Re: question about ‘hot’ or ‘live’ backups

Who is online