Many of us have had the experience when we prepare a file – say, an important presentation – on one computer, where it looks and performs beautifully, and then load it up on a different computer only to have it glitch, look strange, or not function at all.
Now imagine that on the scale of a major big data project for a large corporation.
The problem is real; a slightly different version of the application or operating system between development and run can cause big problems, requiring expensive delays and fixes.
And this is where data containers come in.
What Are Data Containers?
A container is an application, including all its dependencies, libraries and other binaries, and the configuration files needed to run it, bundled into a single package that can be moved, in total, from one computing environment to another.
A container might be used when moving from a developer’s laptop to a testing environment, from that testing environment to live production, or even from a physical machine to a virtual machine in the cloud. It can be used to get around differences in operating systems, software versions, infrastructure, security protocols, and storage.
In fact their flexible and portable nature often makes them very well suited for cloud based applications – certainly something which has contributed to their rise in popularity among IT systems architects. Many think that as computing and storage increasingly moves into the cloud, containerisation will become an increasingly important tool.
Data containers are a separate technology from virtualisation, though they are based on some of the same theories. With virtualisation, an entire machine is replicated up to and including the operating system, and can be several gigabytes in size. By contrast, a data container shares an operating system with any other container on the same machine, making the file size only tens of megabytes, and therefore much lighter and more resource friendly.
There is no need for data containers to be provided with virtual memory and system resources in the same way as virtual machines, meaning they consume less processing power when running. They also boot and load faster. While a typical server at a web scale enterprise might be expected to support 10 or 15 virtual machine environments, the same server might run hundreds of containerised applications. Crucially, they are also are far easier to transfer from one environment to another.
Another important distinction is that virtual machines must be provided with dedicated memory and storage resources, while data containers can share. These containers can run on a single operating system, but when users access a container, the container looks and behaves as if it owns the entire operating system. But because containers must be able to interact with the outside world, they can network and share data between containers.
Why should you use data containers?
A data container can be created that allows multiple application containers to access the same data. These application containers can be created, moved, or destroyed without affecting the original data. This gives data held in containers a “stateless” nature, where the data will be identical no matter how many times it is iterated across different operating systems and applications. This is an important development for organisations wanting to run multiple tests or analyses with persistent data. It also eliminates those problems that arise when an entire application is set up in one environment and moved to another.
It’s also this facet of their nature that make them particularly suited for deploying microservices, where large scale applications are built from a number of components, each one being a separate and distinct application in itself. This system of software engineering allows applications to be scaled quickly, by updating existing components or adding new ones while ensuring that the overall integrity of the parent application remains stable.
A notable example of the large scale adoption of containers in a cloud service is provided by Spotify. It recognised the advantages of this technology in late 2013 when it deployed the open source container management platform Docker in order to reduce coding workload and CPU overheads. Google is another large scale user of containers, reportedly launching around two billion every week.
Another advantage of a containerised approach to data is the potential it offers for more comprehensive governance. Laws pertaining to data rights and privacy are in a state of flux and subject to change. Containerised data can be packaged with information regarding who does or does not have the right to access the data, and what purposes it can be used for.
Why might you choose not to use data containers?
Of course, because they share an operating system, data containers are somewhat less secure than virtual machines, and so are unlikely to totally replace virtualisation any time soon. This means that for certain data, containers may not be suitable. great care would have to be taken with storing personal medical or financial information.
An inherent flaw in the concept of containers is that data could possible be leaked through security flaws in the operating system. There is also the possibility that a malicious or inefficiently coded application sharing the same OS could give rise to a security threat.
Due to these inherent disadvantages, many see virtualisation and containerisation as complementary, not competing, technologies. Neither are they mutually exclusive as virtual machines are as capable as any of running containerised applications. The tools required for containerisation are not yet as advanced as those for running VM, but they are certainly gaining ground, meaning it is quickly becoming an efficient and reliable option for forward-thinking application architects.