lego (12) Linux (39) pi (20) Thinkpads (4)

Monday, 8 December 2014

BTRFS - One Year Later

It was a year ago when I decided to give BTRFS a try.  Prior, I was using primarily EXT4 partitions, with a few XFS partitions on really slow HDs (4200rpm).

It was also about a year ago that I started using LXC (Linux Containers).  I needed a way to maintain multiple LXCs while reducing disk storage overhead.  Initially, this is what caught my attention for BTRFS.  BTRFS provided a typical journaled Linux filesystem a keen to EXT4, but provided "snapshot" and "cloning" a keen to QCOW2 with KVM.  I could maintain multiple LXC containers, and let them share common files, reducing disk consumption.  With BTRFS, I could clone or snapshot LXC instances and avoid duplication -- only worry about storing the changes between instances.  There are not many filesystems that provide this ability while still maintaining properties associated with a typical EXT filesystem and avoid using a file container like QCOW2.  I immediately started using a BTRFS filesystem for storing all my LXC instances.

Since my LXC usage was recreational and non-critical, I experimented with subvolumes, snapshots (including read-only), compression and a few other features of BTRFS.  Having encountered no stability or reliability issues, in the summer of 2014, I decided to give BTRFS a trial run as the primary filesystem for a new Linux Thinkpad.   BTRFS proved beneficial for me for LXC, and the snapshot functionality could equally provide me benefits from effortlessly backing up and rolling back Linux upgrades in the future.  Prior, whenever I moved to new update packs (distro upgrades) in LMDE (Linux Mint Debian Edition), I would create a clonezilla backup -- obviously an offline process, that required me to boot up with a clonezilla image to perform the backup.  I've had to rollback upgrades previously, for all kinds of reasons.

Like I typically do, I cloned a Linux EXT4 image of a working system (using Clonezilla) to the new system.  I utilized the BTRFS tools to convert the EXT4 image to a BTRFS filesystem.  Since the conversion process doesn't modify the data itself (only the inodes or meta data), it was a relatively quick process (mere minutes) and provides an equally quick process to rollback the filesystem back to EXT4.  I even (successfully) tested the rollback.

It was around this time I had my first BTRFS data disaster.  Ironically, it had nothing to do with the new system but on an existing system where I was using BTRFS only for LXC storage.  I had previously encountered situations where I would find myself with my LXC storage mounted in "read only".  A reboot would cure the issue, but I wasn't sure what was triggering the situation.  I was convinced that hibernating the system while LXC were running may have been causing the issue, so I modified my hibernation scripts to explicitly freeze any running LXC instances prior to hibernating.

One day when the problem had surfaced, I decided to resolve the issue by remounting the BTRFS subvolumes as read-write.  This proved disastrous.

When simply unmount and remounting in rw failed, I decided to try using the BTRFS fsck equivalent to examine the filesystem and fix any errors.  There were reported errors with the BTRFS structure, later which I determined to be false, but I attempted to fix these issues.  That was my first mistake.  I was not aware that the BTRFS filesystem repair utilities are wildly experimental and inherently will corrupt and cause further problems.  Warnings throughout the net, reiterate not to use btrfsck to try to repair filesytems.  Why does the tool exist, I cannot be certain, but definitely one day the tool may provide some kind of meaningful function.  My use of the tool led to a corrupt BTRFS root filesystem, which could no longer be mounted.  By virtue that the root filesystem was corrupt, all subvolumes were inaccessible and considered lost.

I had later determined that my BTRFS filesystem was never initially damaged or corrupt (at least, prior to running btrfsck).  I discovered that whenever I had encountered that my BTRFS had been mounted or remounted as read-only had only occurred after shutting down an LXC.  I quickly realized that the mount scripts installed in some of the init.d routines on a number of my LXC were "successfully" causing my BTRFS subvolumes (or root filesystem itself) to unmount and remount in ro.  I had not seen this behaviour with EXT4.  And because I don't usually shutdown my LXCs without rebooting the host itself, the issue was rarely triggered.  So, the issue was never caused or contributed to hibernating the host.  By simply not shutting down the LXC (use halt or leave them running and let them stop when the host shutsdown) or by removing/disabling these mount scripts, the problem was resolved.  I have not encountered the issue since.

I was very lucky in the disastrous scenario.  I had lost the entire filesystem hosting all my LXCs on one system, but because I had used the send/receive functionality in BTRFS to "clone" or backup my LXCs from the system to a central hard drive, I was able to simply "receive" backups of the filesystems back onto the drive.  My LXCs don't contain data, themselves, so the instances are fairly static.  The disaster let me at least validate recovery of the filesystems from backups.

But from the disaster and my other day-to-day experience with BTRFS on various systems for LXC storage and on the newly setup laptop using BTRFS as the primary drive, here is my list of cautions when using BTRFS:

1) BTRFS is considered "stable" but is still being highly developed.
For this reason, use the most current "stable" kernel as you can.  I would not consider using anything less than 3.13 for day-to-day use, and definitely nothing less than 3.10.  If you don't have other issues or reasons to not use 3.15, I would highly recommend 3.15 be your day-to-day kernel release.  The most current BTRFS changes are in 3.15 and there has been a considerable amount of enhancements and fixes that might prevent unnecessary issues.

2) "Stable" but not necessarily "production" ready.
Considering the repair tools themselves are destructive, I would not use this filesystem for "data" storage.  For example, I would not use these filesystems to store primary or sole backups of data.  I continue to use EXT3 and EXT4 with it's journaling and widely reliable and stable filesystem tools (such as fsck).  Generally, my "data drives" are mirrored across different hardware to account for hardware failure.  When proper processes are followed (performing clean dismounts and syncs), I haven't encountered a filesystem corruption issue that journaling could not resolve.  When that day occurs, the fact I maintain mirror devices will hopefully assist with recovery.  Accounting for the fact that these tools can be considered not production ready for BTRFS, you have no recourse to address filesystem corruption caused by software error, human error or hardware failure.  Even with mirroing, it is plausible that if your only recourse is to restore your data, there is a higher probability of a double-failure, where both mirrors fail, when using a filesystem that doesn't provide for reliable recovery.   Thusly, on systems where there is constant turn of data, I would also not recommend BTRFS.  On systems such as laptops for the primary operating system filesystem, as long as you have backups of the filesystem on other devices to restore from, I would recommend you do included BTRFS for your consideration.  However, if these systems are critical, where downtime is not tolerable, I would approach with caution.  If you are using the filesystem on a primary laptop while traveling, I would make sure a copy of the filesystem or alternative filesystem be accessible either on a memory key or flash device.  If you are traveling (where obviously you may not have your backup drives with you) and unfortunately encounter an issue with your filesystem for which you can't boot or mount your system, you'll need that alternative boot device.

3) Don't treat subvolumes within the same device as your sole-backups.
If you create subvolume snapshots and store them on the same root filesystem, then don't considered these backups.  They'll prove worthwhile for rolling back changes on your primary filesystem, but if you encounter a hardware failure or corrupt your root filesystem, your subvolumes will serve no purpose.

4) Mount subvolumes only, wherever possible.
Instead of mounting your root filesystem to access your subvolumes, consider only mounting your subvolumes.  Human errors such as a "rm -rf *" on your root filesystem will take out all your subvolumes, but the same human error when performed on a subvolume will only affect the subvolume itself.  Likewise, if you encounter a bug or other filesystem corruption, if only your subvolume is mounted at the time, the risk to the root filesystem (and thus, other independent subvolumes) should be unaffected.

5) Use send/receive to store mirror copies of your subvolumes to other devices.
When backing up, generate a read-only snapshot (snapshots are essentially free in terms of disk storage), and send these snapshots to a suitable BTRFS filesystem stored on a separate disk.  Space-saving  properties are maintained when transmitting your backups.  Use these read-only snapshots as a reference point.  Create your read-write filesystems as snapshots of these.

6) Track your snapshots and subvolumes by using logical naming and track hierarchies in a text file.
If you are creating a lot of subvolumes, performing a lot of snapshots and sending/receiving (cloning) your filesystems across multiple systems, it is smarter to ensure the subvolumes and snapshots include references in their name such as source and date.  Also, I recommend tracking this to a text file, to track which subvolume originated from which snapshot and is shared with which systems.  As you start having branches distributed among multiple targets, it'll be harder to trace the origin of a particular subvolume in the future.  I store this information in a text file typically found in the root filesystem so that I don't rely on memory or tracing through output from tools to familiarize or validate the origin of any given subvolume.

7) Use compress with lzo whenever possible.
When comparing performance with EXT4, BTRFS can sometimes perform "on par" but generally can be 3x slower in some real-world situations.  The ability to have inherent compression in the filesystem, coupled with the "lighter' lzo compression algorithm, a BTRFS drive will perform better than EXT4.  Your individual situation needs to be evaluated to determine if the benefits of BTRFS will outweigh the performance risks.  Understanding under what situations BTRFS will perform well is critical.  Generally in random read-write scenarios (most predominately on filesystems housing databases), BTRFS will under perform to EXT4 by up to 3x.  Compression in this case provides little or no benefit because the data tends to be in a state not suitable.  In cases where data is primarily read, BTRFS with lzo compression will outperform EXT4.  In situations such as a primary filesystem for a Linux distro on a laptop, you are generally reading and loading programs, so the benefits of BTRFS will outweigh the costs.

8) Understand how the benefits will actually help you.
For example, of the benefits of BTRFS that I haven't touched on is built in filesystem checksum tracking.  This means the filesystem has a means of checking for corruption from such bit rot, etc.  On an EXT4 filesystem, bits that have changed, files that have become corrupt from reasons other than from journal errors or apparently hardware failures, is detectable as the filesystem tracks the checksum of files.  Although bit rot is not new, it is increasingly becoming more common as society moves to cloud-storage (where bits are stored elsewhere and can be lost or distorted from transmision) and to SSD or other flash devices where data corruption is not caused by traditional hardware failures such as heads smashing into platters, or loss of magnetic polarity / or magnetic interference, but by memory chips failing,  electrical interference, wear-and-tear on the chips themselves or by software error/bugs in the controllers managing the data on the SSD devices.  But you have to understand what the benefit really is.  Bit rot on a EXT4 filesystem will most likely not occurring unless you access the data file and realize it is corrupt (if you even realize it).  It could be as simple as a corrupt number being incorrect on a spreadsheet or an "artifact" in a media file.  You may not realize the data is corrupt (and if you backup the corrupt data to another device, your backup will become corrupt).  You would have to exert ongoing effort to track checksum values on files and determine which changes in checksum values are due to valid file changes and those introduced by bit rot.  However, like on EXT4, identifying bit rot is only part of the solution.  On an EXT4 filesystem, you would replace the corruption with a "known good" version of the file that is stored elsewhere.  Likewise, the checksum benefit in BTRFS is the same.  It does not provide a means to fix corruption, only the means of detecting it.  Throwing away your backups because your filesystem can detect bit rot corruption does not mean it will provide you a means to automatically repair the bit rot.  There is a difference between perceived benefit and actual benefit.

My year-long experiment with BTRFS has been successfully and for reasons already explored in previous posts or ones to be explored in future posts, I'll continue to expand my use of BTRFS in the years to come.