Yesterday I met with execs from a company called Gluster, which is developing an open-source, software-only, scale out NAS system for unstructured data. As we discussed their market, products and competitors, we got into the nitty gritty of their technical differentiation as well – pasted below is an extended Q&A with CTO and co-founder Anand Babu Periasamy about Gluster’s way of handling metadata, most often a bugaboo when it comes to file system scalability.
Beth Pariseau: So as I’m sure you’re aware, there are many scale-out file system products out there addressing unstructured data growth. What’s Gluster’s differentiation in this increasingly crowded market?
ABP: What we figured out was that centralized and distributed metadata both have their own problems, so you have to get rid of them both. That’s the most important advantage when it comes to the Gluster file system. The reason why we got to a production-ready stage very quickly – we wrote the file system in six months and took it to production, because a customer had already paid for it, and they had a desperate need to scale with seismic data that was very critical, and they could no longer reason with that data because it was all sitting on tapes. I looked around, there was no file system around – the file systems they had used before were for scratch data, they had found a scalability advantage [to scale-out], but the problem was metadata.
The problem with the metadata is if you have centralize metadata it becomes a choke point, and distributed gets extremely complicated, and the problem with both is if your metadata server is lost, your data is gone, it’s finished…became very clear we had to get rid of the metadata server. The moment you separate data and metadata you are introducing cache coherency issues that are incredibly complicated to solve. By eliminating the need to separate data and metadata we made the file system resilient. On the underlying disks, you can format them with any standard file system – we don’t need any of its features. We just want the disk to be accessible by a standard interface, so even tomorrow if you don’t like the Gluster file system or there is a serious crash in your data center, you can just pull the drives out, put them in a different machine and have your data with you – you’re not tied to Gluster at all. Because we didn’t have any metadata the data can be kept, as files and folders, the way users copied it onto the global namespace.
Within the file system the scalability problem became seamless because we didn’t have to put a lock on metadata and slow down the whole thing, we can pretty easily scale because every machine in the system is self-contained and intelligent, equally, as all other machines. So if you want more capacity, more performance, you just throw more machines at it, and the file system pretty much linearly scales, because there’s nothing centrally holding the scalability.
BP: So it’s an aggregation of multiple file systems rather than one coherent file system that has to maintain consistency?
ABP: No. The disk file system is just a matter of formatting the drives. The Gluster file system is a complete storage operating system stack. We did not rely on the underlying operating system at all, because we figured out very quickly [things like] memory manager, volume manager , software RAID, we even already support RDMA over 10 Gigabit Ethernet or InfiniBand, you pretty much have the entire storage operating system stack that’s a lot more scalable than a Unix or Linux kernel. We treat the underlying kernel more like a hypervisor, or a microkernel and don’t rely on any of its features. By pushing everything to user space we were able to very quickly innovate new complicated things that were not possible before and pretty much scale very nicely across multiple machines.
Gluster VP of marketing Jack O’Brien: The three big architectural decisions we made early on…one is that we were in user space rather than kernel space and the second is that rather than having a centralized or distributed metadata store, there’s this concept called elastic cacheing where essentially you algorithmically determine where the data lies and the metadata is with the data rather than being separated. And the third is open source.
BP: Did you see the EMC announcement about VPlex or are you familiar with YottaYotta and what they did with cache coherency, having a pointer rather than having to make sure all data is replicated across all nodes? Is it similar to that?
ABP: What it sounds like they’re describing is basically asynchronous replication with locking, that’s how they bring you the cache coherency issue. But what I explained was, the file system is completely self-contained and distributed so we don’t have to handle the cache coherency issue. The cache coherency issue comes when you separate the data and metadata so when you’re modifying a file you have to hold the lock until a change appears…because we don’t have to hold metadata separately, we don’t have to hold the lock in the first place because we don’t have the cache coherency issue.
JO: Another way to think of it is, every node in the cluster runs the same algorithm to identify where a file’s located and every file has its own unique filename. The hash translates that into a unique ID—
BP: Oh, so it’s object based.
ABP: It is a hashing algorithm inside the file system, but for the end user it’s still files and folder.
BP: But this is how Panasas is, too, right? Underneath that file system interface to the user they have an object-based system with unique IDs..
ABP: But those IDs are stored in a distributed metadata server. We don’t have to do that.
JO: Our ID is part of the extended attributes of the file itself.
ABP: The back end file name is already unique enough, you don’t really need to store it in a separate index in a separate metadata server, we figured out we can come up with a smarter approach to do this. The reason [competitors] all had complications is because they parallelized it at the block layer, basically they took a disk file system and parallelized it, it’s a very complicated problem…you should parallelize at a much higher layer, at the VFS layer, and have a much simpler, more elegant approach.
JO: So a node doesn’t have to look up something centrally and it doesn’t have to ask anybody else in the cluster. It knows where the file’s located algorithmically.
BP: I think that’s what’s giving me the ice-cream headache here. So each node has a database within it? The thing I’m sticking on is ‘algorithmically knows where to look for it?
ABP: At the highest level …given a file name and path that’s already unique, if you hash it it comes out to a number. If you have 10 machines, the number has to be between 1 and 10. No matter how many times from wherever you calculate it, you get the same number. So if the number is, for example seven, then the file has to be on the seventh node on the seventh particular data tree. The problem in hashing is when you add the 11th and 12th node you have to rehash everything. Hashing is a static logic, as you copy more and more data you can easily get hot spots and you can’t solve that problem. The others parallelize at the block layer and put the blocks across. Because we solve the problem at the file level, if you want to find a file…internally what happens is the operating system sends a lookup call…to verify whether the home directory exists and [the user] has the necessary permissions…and then it sends an open call on the file. Internally what happens is by the time the directory calls come, the call on the directory…has all the information about the file properties. We also send information about a bit map.
Instead of taking a simple plain hash logic which cannot scale…you don’t have to physically think that you only have 10 machines. You can think logically, mathematically, you can think you have a thousand machines, there is nothing stopping you from doing that, it’s the idea of a virtual storage solution. It’s like with virtual machines, you may have only 10 machines but you think you have a thousand virtual machines, so we mathematically think we have a thousand disks. It can be any bigger number, and the actual number is really big. Then we present each logical disk as a bit, so the entire information is basically just a bit array, and the bit array is stored as a standard attribute on the data tree itself. By the time the OS or application tries to open the file, a stat call comes and the stat call already has this bitmap, and the hash logic will index into a virtual disk which really doesn’t exist, it could be some 33,000th cluster disk. And whichever directory wants that bit, you know that the file is in that machine, and don’t need to ask the metadata server, ‘tell me where my block is, hold the lock on the metadata because I need to change this bit.”
BP: But then if two people want to write at the same file at the same time…
ABP: We have a distributed locking mechanism. Because the knowledge of files is there across the stack, we only had to write a locking module that knows how to handle one file.