Data Storage and Management

What is the price of your data? That is, what would you pay to make sure your laboratory data was safely stored and easily accessible? For me as an experimentalist, data is everything. And yet I was continually losing data. Hard drives would fail or my undergraduate students would forget to backup their data or analysis. So after the 7th hard drive failure—yes, it took 7 times for me to get the message—I got serious about data storage.

At first, I asked other professors at meetings how they stored their data. They had to write the ominous “Data Management Plan” for the NSF just like I did, right? But instead of getting answers, I mostly got horror stories. Apparently other professors are just as incompetent with data as I am! But from these conversations and other conversations with the information technology (IT) professionals at my small liberal arts college, I began to form a plan.

It turns out that data management is not as easy as it first appears. So here I give you the gist of the issues I faced and some current solutions that don’t quite work for me. Then I explain my current data solution, which is still not ideal.

 

THE ISSUES

Issue #1: I need to store large amounts of data. I have about 5 terabytes (TB) of data that I am actively working on. I want to be able to analyze this data by pointing my analysis program (Matlab, IGOR, LabView, etc.) to an appropriate folder on a mapped (local or network) drive. That is, I don’t want to have to download the data and then upload the analysis every time.

 

Issue #2: I don’t want to have to rely on users in my laboratory to store their own data. I employ undergraduates exclusively in my laboratory, and they may only work for a short period of time over the summer or the winter holidays. I need a system that doesn’t require them to physically do anything to back up their data. Even uploading their data to a network drive can create problems. Data backup should occur seamlessly and in the background.

 

Issue #3: I want the data to be accessible to multiple users on multiple devices anywhere there is an internet connection. I often work from home and frequently travel. I also tend to collect data at several user facilities at other institutions. Having data folders accessible offsite would be a great benefit to my research productivity.

 

The above three issues do not appear to be insurmountable, but there are frequent glitches in the majority of the current solutions. Below are the three most common solutions used by laboratories to backup and store data.

 

THE CURRENT SOLUTIONS

Solution #1: External Hard Drives. In this solution, everyone in the lab is given two external hard drives and is required to back up their own data. Pros: External hard drives are cheap and a 5 TB external hard drive is available. Cons: These drives fail often (lifetime ≈ 2 years) and you have to rely on users to back them up. In addition, drives must be carried from workstation to workstation to make the data accessible for multiple users in multiple places, further limiting their lifetime. This is not going to work!

 

Solution #2: Network-Attached Storage (NAS). Another solution is to use a network-attached storage (NAS) device. Typically, a NAS setup involves saving data to a networked computer with a series of hard drives. The hard drives are typically setup as a redundant array of independent disks (RAID) to ensure against loss. Pros: A 5 TB version is available and any user can access the NAS drive from any networked laboratory computer. The NAS can also be set up so that users can use a Virtual Private Network (VPN) connection to access the storage drive from offsite locations, or a website can be setup to allow for uploading and downloading of content. Cons: Since the NAS is made up of a series of hard drives it will have parts that will eventually fail and will need replacement. (I have been in a lab where the NAS died and the data was gone, even though the RAID was setup properly. Gornisht!) A NAS setup is also more expensive and requires some effort on the part of the lab to maintain. However, the primary problem with a NAS setup is that writing data directly to a NAS drive is slower than writing data to the root (e.g., C: ) drive. This slow write time would mean that the data acquisition software will still save the data to the root drive and the user will have to upload their files periodically. Because, I want the backup to be seamless, this is not going to work either!

 

Solution #3: Cloud Storage. Finally, the last solution is to store data in the cloud using a cloud storage provider (Google Drive, Dropbox, Microscoft OneDrive, SugarSync, JustCloud, Box, etc.). Pros: The data can be made available to multiple users on multiple platforms and devices, and there is no maintenance of the system by the laboratory. In addition, since most options sync to the cloud in the background, the laboratory’s data acquisition software can write data directly to a folder on the root drive that is then automatically saved to the cloud account. Also, cloud storage is incredibly inexpensive. My Google Drive account through my institution has unlimited storage that is free for the institution. Cons: Many storage providers are not necessarily private enough for storage of patient data or other sensitive information. In addition, each user of a computer must mount their version of the cloud account to the root drive. This necessity means that the only way a 5 TB cloud account could be mounted to a computer is if the computer has 5 TB of storage on the root drive. If another user wanted to login to the computer and mount the cloud account, then the root drive would need 10 TB of storage. You can of course selectively sync particular folders (easier to do on Dropbox than Google Drive), but then all of the data is not accessible at once, and you are back to downloading and uploading files. This is not going to work either!

 

THE CURRENT SOLUTION

The solution that is most appealing to me combines the NAS drive with the cloud storage account. To do this is no easy feat. Currently, all of the cloud storage providers require you to mount the cloud account to the root drive. If you happened to screw up and move the folder linked to the cloud account to a network dive, such as a NAS drive, then if the network loses connectivity, the cloud account assumes the user deleted all the data and all of the data is deleted in the cloud! Yikes! To work around this unfortunate feature, there are many third party applications (PortableDropboxAHK, Boxifier, etc.) that act as a patch to allow the cloud storage account to be linked to the mounted NAS drive and maintain that connection when the drive isn’t present. You can even write your own program using symbolic links.

 

Given this ability, my current solution is to use a Dropbox Business account and a NAS drive with 5 TB of storage. (I chose Dropbox because of their strong customer support, the ability to selectively sync folders, the ability to mount both personal and business Dropbox accounts simultaneously, and the availability and compatibility of third party software including smart phone and tablet apps.) To set up this system I commandeered one computer in the lab and mounted my Dropbox Business account to the NAS using the third party software (PortableDropboxAHK). This computer’s only job is to sync the data on the NAS drive and cloud. On the data acquisition computers, I mount the Dropbox Business account to the C: drive and selectively sync just the data folder for that instrument. This portion of the workflow allows the data acquisition software to take data to the C drive, and have the data synced automatically to the cloud (which then is pushed to the NAS drive). All of these workstations have a single user logged on to the computer permanently. The rest of the computers in the lab, have access to all data through the NAS drive. These computers can be maintained as multiple user stations or as single user stations. On these computers, users are told that they can only write files to the NAS drive, which then syncs to the cloud.

 

I like this setup so far, but am wary. The good thing is that the IT professionals at my institution manage the NAS, which is an allotment of hard-drive space on a server they already maintain, and the cloud storage account manages the syncing of data to and from the NAS for the lab. In addition, the cloud storage account can be accessed from virtually anywhere, using a web login, desktop, laptop, smartphone, or tablet. A laboratory fire or monsoon would not destroy the data! But, I have to setup two data management systems, and I am worried about the third party software that is not supported. I know my data is worth the time to get the right solution. I hope this is it. Stay tuned!

Leave a comment