WHEN IT COMES TO ANALYTICS, GO OPEN-SOURCE

Updated: Feb 3


Should you go open-source, really? Let's dive into it.


For the past few months, I have been listening to the Data Engineering Podcast, with Tobias Macey. It is just astounding how many startups and open source solutions there are. The people interviewed are brilliant and are solving real problems. 


I have found that the worlds of data management and analytics are starting to blend. I believe the trend driving this is that:

  1. We want to aim for a single source of truth (because we will never have the single version)

  2. Massive data sets (IoT, voice, documents, etc.) are now the norm

  3. We would rather not have to move the data around and leverage it "in place"

  4. We have multiple use cases for the same data (reporting, ML, AI)

So we are moving to a distributed data platform ecosystem (multi-cloud and on-premise) where also spread across the organization are the responsibilities for creating data pipelines and leverage the results. We see a lot of different tools that do some relatively the same thing, but each better suited to the use case at hand. So, going commercial can start getting expensive.


A TDWI survey states that 45% of organizations are using both open source AND commercial software in this ecosystem. TDWI further states in their Q1 2020 Pulse Report on the subject that - this will not come as a surprise - that R and Python are de facto, open-source standards once we want to begin leveraging data science and machine learning. Next, Spark ML libraries and Tensorflow are also at the top. Then there is PostgresSQL.


The benefits of open-source are non-disputable:

  • Low cost

  • Latest algorithms

  • Coders prefer them

  • Pre-built templates for fast development

  • No vendor lock-in

However, commercial software also has its strengths:

  • Best practices are built-in

  • Support

  • Compliance

Moreover, they often differ from their open-source alternatives by focusing on enterprise-desirable features such as:

  • End-to-end functionality

  • Ease of use

  • Putting models in production

The report goes on to outline the various factors involved in using open-source and commercial software.

Image courtesy of TDWI


So while the cost is a definite consideration, the cost of ownership is not equal to the cost of acquisition or annual licensing. We need to be extremely careful about dismissing all commercial solutions; some of the better ones often propose a compelling vision and a bring-it-to-market focus to which only the greed of capitalism can give birth.

©2020 by Modern Data Analytics