Databricks is no longer playing David and Goliath
Databricks is no longer playing David and Goliath.
Imitation being the sincerest form of flattery pretty well summarizes the challenges of running an open source software business. Over the past 4 - 5 years, Apache Spark has taken the big data analytics world by storm (for fans of streaming, no pun intended). As the company whose founders created and continue to lead the Apache Spark project, Databricks has differentiated itself as the company that can give you the most performant, up to date, Spark-based cloud platform service.
In the interim, Spark has continues to be the most active Apache open source project based on the size of the community (over a thousand contributors from 250 organizations) and the volume of contributions. Its claim to fame has been a simplified compute model (compared to MapReduce or other parallel computing frameworks), heavy leverage of in-memory computing, and availability of hundreds of third party packages and libraries.
Spark has become the de facto standard embedded compute engine for tools performing anything related to data transformation. IBM has given the project a bear hug as it rebooted its analytic suite with Spark.
But as a measure of its maturity, there is now real competition. Most of the competition was with libraries and packages, where R and Python programmers had their own preferences. There has also been competition for streaming where a mix of open source and proprietary alternatives supported true streaming, while Spark Streaming itself was based on microbatch (that's now changing). More recently, Spark is seeing renewed competition on the compute front, as emerging alternatives like Apache Beam (which powers Google Cloud Dataflow) are positioning themselves as the onramp to streaming and high-performance compute.