Everyone wants to future-proof their choices in data platforms. Concerning analytics, there are significant trends in infrastructure to consider:

  • In the last five years, the move from Hadoop to cloud services to Kubernetes

  • Data governance, cataloging, and lineage: managing data is top of mind

  • The advance of AI-specific infrastructure stacks

Data infrastructure is rapidly changing, and there seems to have been a three-phase transition from Hadoop to cloud services to hybrid/Kubernetes environments. Hadoop was standing alone a few years ago, but today, industry watchers have pronounced it dead. When Hadoop came about, the cloud was not a real option, data was on-premise, and network latency was a problem; the world has changed now. We have seen the latest trouble manifest itself in the market with the majors in that space fading in the background. But there has been so much deployment of Hadoop that we will still see it around for a while if just not for new implementations of data platform strategies. The cloud is part of Fortune 1000 plans, and we see Microsoft data shops moving to Azure and AWS growing like crazy for analytics deployments. AWS' revenue grew by 46% from 2017 to 2018 and Azure, 73%.

But costs for the cloud are beginning to worry CIO's and CDO's. We read stories like Capital One's cloud bill growing 60% from 2017 to 2018 ($200M+ by the way, wow). So the cloud offers a lot of agility, but it does come at a cost. And what about lock-in? So it is not surprising that projects like Kubernetes which support hybrid architectures seem to be so popular. By containerizing workloads and services, organizations can leverage an infrastructure involving a combination of public cloud, private cloud and on-prem, allowing them to select the best place or tools for their workloads, and by the same token optimizing performance and costs.

Kubernetes is also becoming an exciting choice for machine learning; data scientists can choose their preferred language and libraries and do not have to be experts in infrastructure to run their workloads where they need to; small data on their own PC or five years of data in the cloud. That is flexibility. Kubeflow is now picking up steam too for deep learning.

So where will the machine learning market go? Cloud ML platforms or a world where Kubernetes is pervasive, and we have cloud large data warehouses, like Snowflake. That may work for the data science crowd, but if we want to bring machine learning to the masses, this will still be too complex and create the opportunity for some vendors to provide complete platforms (SaaS in the cloud but supporting hybrid environments) to enable citizen data scientists.

The Data & AI landscape is not impervious to the serverless trend either, which is an attempt at simplifying all this infrastructure complexity. I read about companies like Nuclio and Algothmia, relatively new but offering exciting options.

So now that data placement can be optimized using on-prem, private cloud and in multi-cloud environments, organizations have a more pressing need to be able to find, take control of, curate and trace all their data. So here, data cataloging, data lineage, and data governance tools are seeing a lot more attention. The other type of tool we are seeing emerge too are query-anything tools, like Starburst and PartiQL. This is not virtualization, but then again, similar, no?

Finally, ML engineers need to be able to run experiments and rapidly iterate, accessing resources such as GPUs when required. So another category that is emerging is MLOps/AIOps, which Algorithmia is part of. Others include Spell, Weights & Biases, Pachyderm, Seldon, Snorkel, and MLeap. We even see the rise of GPU databases (Brytlyt, OmniSci) and the birth of AI chips (Graphcore, Cerebras, etc.).

And what will happen when quantum computing is usable? One thing is for sure, to future-proof is to keep things flexible and diversify, a bit like managing risks in your retirement portfolio.

©2020 by Modern Data Analytics