When it comes to coding Big Data and analytical applications, a select group of programming languages have become the default choices.
This is because their feature sets make them well suited to handling large and complicated datasets. Not only were they originally designed with statistical purposes in mind, a broad developer ecosystem has evolved around them. This means there are extensions, libraries and tools out there for performing just about any analytics functions you might need.
R, Python and the relative newcomer Julia are currently three of the most popular programming languages chosen for Big Data projects in industry today. They have a lot in common, but there are important differences which have to be considered when deciding which will get the job done for you. Here’s a brief introduction to each of them, as well as some ideas about applications where one may be more suitable than the others.
R, which has been around since 1993, has long been considered the go-to programming language for data science and statistical computing. It was designed first and foremost to carry out matrix calculations – standard arithmetic functions applied to numerical data which is arranged in rows and columns.
R can be used to automate huge numbers of these calculations, even when the row and column data is constantly changing or growing. It also makes it very easy to produce visualisations based on these calculations. The combination of these features has made it an extremely popular choice for crafting data science tools.
Because R has been around for a while it has a large and active community of users and enthusiasts. They’ve spent the last couple of decades building extensions and libraries which increase the scope of what the language can do, make it simpler for the user to access its functions, or automate monotonous jobs.
Among the popular ones are SparkR, which provides access to Apache Spark, ggplot2 which provides visualisations, and an extension has recently been announced that will allow IBM’s Watson cognitive computing engine to be accessed through R.
The fact is though, that in becoming the ultimate programming language for statistical applications, R can sometimes fall flat in other areas. Other languages competing for developers’ affections – including those mentioned below – are often more generalised. Because of this, a common approach is to build the framework of an analytical application in R, taking advantage of its modular nature and support infrastructure. Once a solution – such as a working analytics engine – has been devised, the code might be recreated in another more general purpose programming language to complete the application’s production.
Python is far more general purpose than R, and will be more immediately familiar to anyone who has used object oriented programming languages before.
Python’s sheer popularity has helped cement its place as the second most popular tool for data science – and although it may not be quite as widely used as R, its user base has been growing at a greater rate. It’s certainly easier to get to grips with than R if you haven’t already got a solid background in statistical computing
This user base has devoted itself to producing extensions and libraries aimed at helping it to match the usefulness of R when it comes to data wrangling. One of the first was the NumPy extension which gives it many of the same matrix-based algorithm capabilities a R. This attracted coders interested in analytics and statistics to the language, and over the years has led to more and more complex functions and methodologies being developed.
Because of this Python has become a popular choice for applications using the most cutting edge techniques, such as machine learning and natural language processing. Open source applications such as scikit-learn and Natural Language Toolkit make it relatively simple for coders to put these technologies to work. PySpark gives it access to the Apache Spark framework. However if you’re only interested in more traditional analytical and statistical computing, then you may find that R presents a more complete and integrated development environment than Python.
R and Python are still the reigning champions when it comes to data and analytics-oriented programming languages, but there are several other languages which are attracting attention for their suitability in this field.
One that is certainly worth giving a mention to is Julia. It has only been in development for a few years but is already proving itself to be a popular choice. Like Python and R, Julia is built for scalability and speed of operation when handling large data sets. It was designed with a “best of all worlds” ethos - the idea was it would combine the strengths of other popular analytics-oriented programming languages. One key influence was the widely used statistical programming language MATLAB, with which it shares much of its syntax.
Julia has specific features built into the core language that make it particularly suitable for working with the real time streams of Big Data which industry wants to work with these days, such as parallelisation and in-database analytics. The fact that code written in Julia executes very quickly adds to its suitability here.
In a head-to-head comparison with R or Python, Julia’s youth is her Achilles’ heel. The ecosystem of extensions and libraries is not as mature or developed as it is for the more established languages. However it is getting there and most of the popular functions are available, with more emerging at a steady rate.
The Right Tool for the Job
It may seem that R would be the natural choice for running large numbers of calculations against big-volume datasets, Python would be the go-to for advanced analytics involving AI or ML, and Julia a natural fit for projects involving in-database analytics on real time streams.
In reality the nuanced differences between each language and the environment they provide to the programmer means there’s rarely a one-size-fits-all solution. It’s also worth remembering that their open nature (they are all open source projects) means that they don’t pretend to live in isolation. The active communities behind each language frequently cooperate to port functionality between them, and extensions can be used to run code written with one language from within another language.
All of the languages here are living projects which are constantly evolving and updated to be capable of new things. Each has its strength and weaknesses but they are all robust choices for enterprise initiatives involving Big Data and analytics.
Bernard Marr is a bestselling author, keynote speaker, and advisor to companies and governments. He has worked with and advised many of the world's best-known organisations. LinkedIn has recently ranked Bernard as one of the top 10 Business Influencers in the world (in fact, No 5 - just behind Bill Gates and Richard Branson). He writes on the topics of intelligent business performance for various publications including Forbes, HuffPost, and LinkedIn Pulse. His blogs and SlideShare presentation have millions of readers.