Organizing and Backing up your data for the Digital Photographer

When you start out in photography, you’ll find it fairly easy to find images — you’ll remember that landscape came from your trip last summer, or that Penguin was from the zoo visit in august. Because of that, in the early days, it’s easy to get into the habit of sticking photos onto the computer without much thought and organization.

As you shoot more images and spend more time with the camera, it’s going to be harder and harder to keep all of this in your head and you’ll run out of space on your computer’s hard disk as well, especially if you are running a system with an SSD instead of a traditional spinning drive. If you’re smart, you’ll create some systems for organizing your images before that happens, but few of us are (I certainly wasn’t).

This article will show you the options you have for building out your computing system as your needs and your photo library grows, and suggest some organizational ideas and post-processing workflows to help you keep track of everything.

Also, since the value of these images and data will continue to grow as your collection expands, we’ll discussion various options to protect your data with backups and some of the issues and complications needed to keep your data safe no matter what might happen.

This is a diagram of my existing computer setup’s data storage systems.

The main drive in the computer is a 500 Gigabyte SSD (Solid State Drive). It is purely electronic and has no moving parts, which makes it a lot faster than traditional spinning drives, but because SSDs cost a lot more, especially as you try to get the bigger capacity models, you have to make some compromises between size and cos. It should also be noted that when an SSD drive fails it tends to be catastrophic and recovery of data from them difficult even for experts (and therefore expensive), so if you are using an SSD, it’s extremely important that you have backups and you keep them updated.

When I’m in my home office, I have two external Thunderbolt drives attached to the computer. One is a 2 Terabyte Mirrored RAID drive that is used for storing data that I want to keep local to the computer. The other is a 5 Terabyte drive that I’ll explain in detail in a minute.

If you’re not familiar with RAID, a Mirrored Raid contains two physical drives (SSD or spinning) that it treats as one drive. Any data written to the drive is actually written to both drives, and data read from the drive can come from either of the physical drives. This means that any data stored on a Mirrored RAID is actually kept on two physical drives, so if one drive fails the data is still okay and accessible from the other drive. This reduces the severity of a drive failure (since there’s a redundant copy), but Mirrored RAID does not replace the need for backups, since there are still many failures you can run into that can destroy both copies of that data or destroy that drive completely. I can’t emphasize this enough: Mirrored RAID is not a backup and you can’t trust it to fully protect your data. What it does is reduce the chances of needing to recover data from backup — but that chance of needing the backup still exists.

Synology DS414 Network Attached Server

Buy at Amazon

Also on my home network is a Synology DS414 Network Attached Server, and inside that are four drives set up in a variation of RAID 5 called SHR. That makes sure all data exists on at least two drives so the data can survive a single drive failure without loss of data — or the NAS stopping working. With a NAS, you can then replace the failed drive on the fly, configure it into the system, and the NAS will automatically copy data onto it to replace the lost drive.

My NAS has a total of 10 Terabytes of usable space, which I use both to store my data on and also to back up all of the computers on the network via Time Machine.

This may seem complicated, and honestly, it kinda is. But each of these serves a specific purpose.

The boot drive: is small and being an SSD attached to the motherboard, fast. it stores the Operating System and my Applications, and since this computer is a laptop, I keep the files that I have to have even when I’m on the road or in a cafe on it. I am currently using about 250 Gigabytes of that 500 Gigabytes.

The external 2TB drive: This drive is slower than the internal drive but faster than the NAS. It’s useful for really large data sets that I will only use when I’m sitting at my desk. This drive has about 250 Gigabytes on it, of which 180 Gigabytes are the movies and videos I have in my iTunes library (the library files are on the SSD), plus my windows VM files. The primary use of this disk is rendering video where the source files can be larger than my SSD can handle but I need them local for performance purposes while I’m in the middle of a Final Cut project.

The NAS Disk: stores all of the files I’m not actively working on and almost all of my photography images, plus archives, rendered video, archived music and videos, historical documents, and all of that other stuff you collect over the years. I have about 4.5 Terabytes of data that lives on the NAS today. The NAS also stores about 2 Terabytes of backup data from the three Time Machine partitions I have configured on it.

This is where that other Thunderbolt drive comes in. It’s split into two partitions, a 500 Gigabyte partition that I use to create a boot drive backup via SuperDuper! that gets updated every night, and a large HFS partition where I’ve written a script that mounts every data partition on the NAS and rsyncs the data off the NAS onto that drive. Since that drive lives on my computer, it’s then backed up offsite by Crashplan into the cloud. I’ll discuss that in more detail later.

Organizing your images

This is a diagram of how I organize my images on the disks. My collection is too large to store on one disk, especially an SSD.

Once you outgrow a single disk you need to think about your organization a bit. What I have chosen to do is that the SSD is used only to store the Lightroom catalog and images that are actively being worked on or in queue to be worked on. All other photography files are stored on the NAS.

When I come back from a day’s shoot, I will import all images into a subfolder inside the Incoming folder named ‘YYYYMMDD location name’; in a few cases where I do a multi-day trip, I’ll compile them into a single folder with just year and month on it. On days where I visit multiple field locations, I’ll import into a folder for one location and then move the appropriate images into another folder for that location.

I do all my post-processing on the folder in that location. When I’m done working on the images in a folder, I will take the images that I’m choosing to retire (not technically flawed — those get deleted — but not work adding to my collection) and move them into a folder within the Retired folder. The folder with the rest of the images gets moved into the Images folder on the NAS, where I’ve created a hierarchy of organizing folders for each year, and within that folders for each quarter (Q1 through Q4). That usually keeps the number of folders listed within each quarter’s folders small enough to be easily usable within the Lightroom Folder listings without huge amounts of scrolling (and cussing).

The images in the retired folder are left for 3-6 months, then I re-evaluate them to see if there are any that I want to keep; those get added to the main folders, and after a second look, anything that isn’t found by that second evaluation gets deleted.

I know there are people who suggest you keep everything, and if you want to, that’s fine. But in early 2015 I decided it was time to do a fresh evaluation of all of my retired images, and out of about 35,000 images, I found about 400 that I chose to add to my library. That makes me comfortable that if I give the images some time to rest and then come back to them for a second fresh look, I’m not going to miss any significant images, and I see no reason to keep them.

I do think it’s important you don’t do that kind of deletion until the second look; if you process the images when you return from a shoot you can still find your impression of images biased by the fresh memories of the shoot. Allowing some time for that to fade and then doing a second look will help you see the images for what they are and not what you remember from having taken them.

All of my images are stored in this folder structure, organized by date and tagged in the folder with the location. Everything else I do with the images — everything — is done by use of keywords, collections and smart collections.

The Work in Progress folder is there for me to keep files when I’m doing experiments, reworks or dealing with some extended post-processing project — especially timelapses, panorama stitching and HDR processing where I might spend some time evaluating and tweaking an image. That allows me to carry those projects with me around the house or on the road without needing access to the local network or the NAS. Once I’m done those files get merged back into the folder structure on the NAS.

When does it make sense to use a NAS?

A NAS (Network Attached Server) hooks up to your home network and acts as a virtual disk for your computers; it has a number of features you can take advantage of, such as:

  • Wiring up multiple disks into a single fault-tolerant virtual disk where a any single drive failing won’t cause data loss.
  • Allowing you to replace a disk with a different drive (including larger ones) on the fly without having to copy data, because the NAS handles it for you.
  • Creating virtual drives, and if you want, attach quotas to limit how much data you can put on one. That lets you keep your backups to reasonable size while making all of the free drive avaialble to any computer that needs to use it; no more having to upgrade a disk on one computer that’s full while another drive on another computer is half empty.
  • Store all your backups on the NAS, including Apple’s Time Machine.
  • Simplify your data backups and make your offsite backups easier to make sure happen on a regular basis.

The disadvantages of a NAS? Cost is the big one, although if you start adding the cost of all of the drives and housings you’ve picked up to store and back up your growing collection of data, it’ll probably be a lot closer to the cost of a NAS than you realize. Once you switch to a NAS, you stop paying the extra money for those external disk housings and you can buy bare drives instead, and you can use that data very efficiently, meaning you’ll have to buy more or bigger drives much less often. Making sure you keep it well backed up — especially with offsite backups — can be more complicated, and it takes a bit of geeking to get it running and keep it upgraded, but it’s actually more user friendly than you might think.

When I did the cost analysis, I found that from a dollar view, it seemed to make sense to switch to a NAS when my data set reached about 4 terabytes. Since I need to back up multiple computers that live in different rooms, the NAS starts making sense sooner than that because each of those computers was using its own backup drive (or drives), meaning more money — plus cables, the hassles of making sure everything is plugged in right and working properly. I don’t know about you, but I was always chasing some backup that had stopped working and needed me to figure out how to fix. Unreliable backups are a terrible thing because if you don’t realize they’re broken, you don’t know you’re at risk at lost data until its too late. Since moving backups to the NAS, they’ve been almost 100% reliable and painless.

Another advantage of the NAS: if you are someone who moves around the house working on things, all the files on your NAS are available anywhere you are, assuming the NAS is hooked up to the network and the network has WIFI on it. Files you keep on an external drive are only available when plugged in at your desk. While accessing data over the network is slower, the only time I’ve ever noticed or run into challenges using files on the NAS is when I’m deep in a video project and copying hundred-gigabyte video files around. For my photography, it simply isn’t a problem for me.

So, what did this cost?

It takes a bit of money to get started with a NAS — the Synology DS414 is about $450, and to fill the four bays with 4TB drives will run you another $500, so the cost to get the box ready to go is about $950. There are costs you might not think about, too. I realized, for instance, that my ethernet hubs were 100 megabit hubs, not gigabit hubs. Replacing the one driving the key areas of the network cost me $25, and I ended up needing to buy a few more cables. I also realized that the Airport driving my wifi was the generation before 802.11AC. If you’re going to move your data to the network, it makes sense to be wired in at your main work desk for maximum performance, and to have a good, modern high-speed WIFI around the house.

In my case I was able to repurpose existing drives into the unit to reduce cost, you may well be able to as well. Four 4TB drives will give you 12 Terabytes of usable data with the other 4 Terabytes being used for data redundancy. If you did this with external drives, 12 Terabytes would cost you about $400 but you’d lose all of the RAID and redundancy and you’d lose the ability to set up your data partitions to match your needs or share them across multiple machines, so there’s a good change you’ll end up buying an extra drive or two and wasting a lot of space because it’s not where you need it. If you were to set up that 12TB of data as mirrored RAID drives for redundancy, it’d cost you anywhere from $700 to $1400 depending on how you decide to configure the drives.

In other words, once you start protecting your data by using RAID, the costs of using a NAS catch up with buying drives quickly. And your data deserves to be protected both via some kind of RAID mirroring or redundancy and by use of good backups).

How to back up your data

I think computer users can be broken down into three camps:

  • Computer users who haven’t had a hard disk fail and haven’t yet figured out they need to back up their systems.
  • Computer users who have had a disk fail but still don’t back up their systems reliably (or at all), even though they know they should.
  • Grouchy old computer geeks who yell at the first two groups because we’re the ones who get that call at 10PM because a disk failed and they need a file back because they’re on deadline and oh my god please help me I don’t have a backup what do I do?

I warn you up front, I am one of that last group. My goal is to convince you to start backing up your computer before it’s too late, because I want those late night on deadline oh my god I’m doomed please help me phone calls to stop. Even though I know it’ll never happen in my lifetime.

Here’s the reality: whatever you store your data on is going to fail some day. If you don’t plan for that, bad things will happen. And when bad things happen, you call your geek friend late at night blubbering and crying and asking for help. Neither of us want that.

You can’t prevent the failure, but you can reduce the chances of it happening, and you can back up your data so that if a disk fails, it’s not a big deal, because that data also exists on another hard disk. Or two. Or three. The more the merrier.

This article will help you understand how to reduce the chance of that failure and to limit the pain and damage when it happens.

The Best Backup is Never Needing your Backup

The best and most reliable backup is never needing to recover data from your backup. You can never guarantee that a drive will never fail — but you can reduce the chances of it happening.

How? Simple: replace your drives before they fail. Backblaze is a company that will back up your data over the internet to their servers. They have lots of data on lots (and lots) of hard drives, and it’s their job for that data to never be missing. They’ve got lots of experience with failing hard drives and how long it takes for one to fail, and they’ve been nice enough to provide the data. If you’re interested in the details, read their study. The executive summary is that after a hard drive is three years old, the failure rate starts to rise rapidly. So the first thing you can do to reduce the chance of a hard drive failing on you is retiring it and replacing it with a new one before it gets to be four years old.

I take this one step further: if you have a laptop that you carry around, that laptop tends to get bounced and jostled. Inside that laptop is a hard drive, which is also getting jostled and bounced around. My experience is that laptop hard drives have a tendency to die younger than hard drives in machines that don’t move around, so if you have a laptop, you really want to replace that hard drive earlier.

My hard drive policy is simple:

  • Any hard drive used I use as a working drive (attached to a computer and powered up for use on a daily basis) is replaced when it is between two and three years old.
  • Any spinning hard drive installed inside a laptop is replaced earlier: between 18 months and two years. With an SSD, I don’t use this standard any more. I’m still deciding what my replacement policy will be for laptop SSDs, because many of the things that cause spinning disks to fail early, like small drops and bumps, aren’t relevant for SSDs. But I’m thinking it’ll be around 3 years for SSDs to be on the safe side.

That doesn’t mean their useful life is over: the drives I used as my day to day drives get turned into backup drives (unless they’re too small). They’re used as backups until they’re around four years old, and then they’re retired.

Backup drives tend to be powered off a lot more, their usage is much lower, and you don’t put them under stress. That reduced stress means they’re less likely to fail. You use a drive hard when it’s new, give it a reduced role as it ages, and retire it before it hits that point in time where failure becomes likely.

If you do that, you will rarely have a drive fail on you. It costs a little money, but the cost of a new laptop drive these days is under $100, so it’s not that expensive. It’s a lot less expensive than the time and stress of recovering from a failure, that’s for sure.

Setting up backups

Even if you never have a hard drive fail, you still need backups, because sooner or later, you will. Beyond that, there are many ways for your data to disappear: your house or office could burn down. Your computer could fail and scribble Shakespeare’s Sonnets all over your disks and data. You could be sitting in Starbucks and watch as someone grabs your laptop and runs out the door. You could drop your laptop (yes, I know, that never happens, right?). The number of ways your data could disappear are many, the ways to recover it if you don’t have good backups are expensive and maybe impossible.

The only way to protect yourself from these bad things is to keep multiple copies of your data. At least one of those copies needs to be offsite, because if your house burns down, having a backup on the desk next to the computer that’s just turned into a pile of slag won’t help very much.

Here’s how I back up my data today:

My setup: There is no native or supported backup client for the NAS, so I’ve written a script that runs on my computer that mounts each NAS partition and rsyncs it to a disk on the local computer. It backs up data only, not the time machine backups.

My SSD/boot drive and the external data drive are backed up to the NAS via Time Machine.

SuperDuper!

Visit the Web Site

My boot drive is also backed up to a small partition on that external backup drive (same one as the NAS backup goes to) via Superduper!, which creates a bootable clone of the drive. That way, if my boot drive fails, I can literally boot this clone and get back to work losing only the data since my last backup, which happens around midnight daily.

All three drives are set up to back up offsite over the network. I currently use Crashplan, but I may switch to Backblaze because it’s client is native and the Crashplan setup uses a java app set that takes up more memory than I like.

I also have a backup drive that I carry when I’m travelling so I can do backups every night on the road, and I occasionally hault it out and do a backup even when I’m here, giving me a spare copy that isn’t updated regularly in case I run into a problem where things get corrupted and I don’t notice for a while. Just in case, it’s one more copy.

If you count that out, the boot drive data exists in four copies, plus a fifth that’s updated irregularly. The external data drive, which is a Mirrored RAID, is backed up to two places, one of which is also a redundant RAID, so there are three copies of the data but the data exists on five separate disks.

The NAS system is redundant with two copies of every file across the disks, it’s backed up (third copy) to the drive on the computer, and that drive is backed up (fourth copy) to the cloud. I also occasionally make a backup of the NAS data via the Synology native backups and store those offiste at work, although I’m considering shifting that to a safe deposit box.

So, all of the data exists in four or more places and on four or five disk drives at any time. Everything is updated daily, although when I have large data sets the offsite backup can take some time to sync up. Since it exists only for catastrophic backups, I’m comfortable with that.

A note on backing up over the network to a cloud service: some ISPs put data caps on your internet connection. If yours does, doing an online backup could cause you to use more data than the cap allows and you can find your network throttled to a really slow speed, or turned off completely. Before you go online, you need to understand how big your data set it you want to back up, how long it will take to upload, how long it might take to recover if you need to, and whether you have a data cap to worry about. I generally recommend that people consider using these online services to back up the important data, but not everything.

How to back up your data

This section assumes you’re using a Macintosh. If you don’t, there are other equivalent tools you can use to back up your computer, but I’m not the person to tell you which one to use.

Backing up a Macintosh can actually be very simple: use Time Machine. For a lot of people, this will work quite well and it’s free with all copies of Mac OS X. I use Time Machine for part of my backups system because I like it’s incremental backups so you can go back and find a file and it’s data at a given time.

Time Machine’s big weakness is large data sets. Because it’s doing incremental backups, it is going to want a backup drive larger than the amount of data you have created. I’ve found that it works best when the backup drive is at least 2X the data being backed up, and I prefer 3X. This means if you have, say, a 500Gb boot drive in a laptop and a firewire drive with 1.2 Terabytes on it, your total data set is 1.7 Terabytes. Time Machine is going to struggle keeping that backed up on a 2 Terabyte drive, so you really need 3TB for your backup at a minimum. If you update large parts of your stored data, you can really give it indigestion (for instance: take 1000 photos in Adobe Lightroom, and assign a new keyword to each, and make sure the updated metadata is flushed to the DNG with an embedded XML sidecar. You just created 60-70 gigabyte backup). The larger the data set, the larger the disk Time Machine needs to back it up and work efficiently, and as your data set continues to grow, this is going to be a challenge.

To create the cloned copy of the boot drive, I use Superduper. This tool makes an exact clone of a disk, one that you can plug into a computer and use without any work; even boot the computer from it. I use it to make bootable copies of my computer’s main drives; so if I lose one, I can clone a copy quickly, or just boot the backup drive and get back to work. And it creates another copy of my data for me (never a bad thing).

Do you need this? How badly do you want to protect your data? How quickly do you want to recover from a drive failure and get back to work? How many hard drives are you willing to buy and manage? If your data is really worth the effort, it’s a good way to create a reliable and quick-to-recover copy of it — but it does entail more time, energy and money. Whether it’s worth it to you is a decision you’ll have to make. It’s worth it to me.

Are Your Backups Good Enough?

Consider this problem: you’re sitting in a cafe working on some images and someone walks by, grabs your computer, and sprints out of the shop before anyone can stop him.

How comfortable are you that you’re going to be able to get your data back onto your new computer and how long is it going to take to get that new computer back into operation?

If you’re currently thinking that maybe some (or all) of your data left the building with that thief, you have some work to do. Since you can’t plan the timing of an event like this, you better get started on upgrading your backups right away, right?

One final thought on this: your backups need to be both easy and automated. If they’re difficult to back up or restore the data — even something as simple as needing to remember to plug in a backup disk — you will end up not doing them reliably, especially when you’re busy and on deadline, which is when you need them most. Some complexity in restoring files or a disk is okay, but the harder it is, the more you’re going to stress out when it happens — assume you’re on deadline, because it’ll probably happen.

So set up your backups so they happen whenever you’re in your primary work situation — plug a backup disk into the USB hub and set up Time Machine, for instance. The less you have to think about the backup happening, the more reliable the backup will be.

At the same time, check your backup every so often to make sure it is, in fact, backing up. The only thing worse than hitting a disk failure and finding out your backup stopped working two weeks ago and you didn’t notice is finding it stopped working a month ago and you didn’t notice. So remember to check them every so often. It also is a good idea to once in a while recover a file from the backups to make sure you can and that the data in them is good.

And once you know your backups are good? Enjoy that latte knowing you have one less thing you need to worry about, because your data is safe.

Well, safer. In my view, backups are never perfect, but you can make them good enough and sleep better at night.