If you want the story of how a group of seven PhD students and professors at Berkeley came to co-found a fast-growing data and AI company, then this article is for you.
If you just want the list of co-founders of Databricks, then here you are (in alphabetical order):
- Ali Ghodsi
- Andy Konwinski
- Scott Schenker
- Ian Stoica
- Arsalan Tavakoli
- Patrick Wendell
- Reynold Xin
Read on to learn more about the fascinating origins and founders of Databricks.
Table of Contents
Databricks was founded by a group of academics within a lab at Berkeley University that was dedicated to big data, machine learning, and analytics.
The lab was called the AMPLab and was designed to foster collaboration amongst professors and PhD students on innovative projects.
Three professors at AMPLab were co-founders of Databricks. Let’s look at them in turn.
Ion Stoica grew up in Romania where he completed his undergraduate degree at the University of Bucharest. He came to the United Sates to pursue a PhD in computer science at Carnegie-Mellon, which he finished in 2000.
Stoica joined the faculty at Berkeley that year and did extensive research in cloud computing and distributed systems. He was appointed as co-director of the AMPLab.
Aside from his academic interests, Stoica had an entrepreneurial bent. He co-founded Conviva in 2006 with his PhD supervisor at Carnegie-Mellon. The company provided analytics for real-time video distribution.
Ali Ghodsi’s family fled from Iran to Sweden as refugees when he was five years old. Ali grew up in Sweden where he completed a PhD in distributed computing in 2006 at Stockholm’s Royal Institute of Technology. He joined the faculty at the Institute as Assistant Professor in 2008.
The following year, Ghodsi was invited to join Berkeley’s AMPLab to collaborate with Ian Stoica for a year. He was so enthused by the computing research that he ultimately stayed on.
Scott Schenker is the odd-man-out amongst his co-founders in that he didn’t get a PhD in computer science. He followed in his father’s footsteps and earned his doctorate in physics in 1983 from the University of Chicago. His research was in chaos theory.
Scott’s brother Stephen also studied physics but stayed in the field. Stephen Schenker’s research in string theory is renowned.
After Scott completed his PhD, he joined the Xerox PARC research center. This is where he got interested in computer networking. He spent the nineties doing extensive research in processor scheduling and distributed computing.
Schenker joined the faculty at Berkeley in 2002. A prodigious researcher, he is one of the most cited authors in computer science.
Like Stoica, Schenker was also a successful entrepreneur. He co-founded Niciria in 2007, a company that specialized in network virtualization.
Ben Horowitz, a leading venture capitalist, was an early investor in his company. This would be important to the Databricks story.
Four of the co-founders were post-graduate student who met while pursuing their PhDs at the AMPLab.
If you asked me to pick one of the seven co-founders who “sparked” everything off, it would have to be Matei Zaharia. So, let’s start there.
Matai Zaharia is of Romanian heritage and attended high school in Toronto. After graduating from the University of Waterloo, he joined the AMPLab in 2009 to start a PhD in distributed computing. His advisor was Ion Stoica.
Zaharia had already worked on several projects using Apache Hadoop and MapReduce. At AMPLab, he started collaborating with two other students on a class project that they called Mesos.
Mesos was a cluster manager designed to run different analytics engines. Benjamin Hindman, Andy Konwinski, and Matei were the three students who started the project.
We won’t meet Ben Hindman again in this narrative. The other student moved away form Mesos, but Hindman stayed with it. He eventually co-founded a company called Mesosphere to commercialize the code base.
After Andy Konwinski graduated from the University of Wisconsin-Madison, he enrolled at Berkeley to do a master’s in computer science. He then joined AMPLab in 2007 to pursue a PhD.
Konwinksi, Matei Zaharia and Hindman were the original collaborators on the Mesos project.
Reynold Xin did his undergraduate degree at the University of Toronto. He joined AMPLab as a PhD student in 2010, the year after Zaharia enrolled.
Xin also worked for data infrastructure teams at Google that year.
Patrick Wendell graduated from Princeton in 2011 and joined AMPLab to do a PhD in scheduling optimization. Ion Stoica was his PhD supervisor.
Zaharia and Konwinski had already moved on from Mesos to focus on anothe project (yes, that was Spark, we’re coming to it!)
Wendell spent a few months working on Mesos as it was winding up as a mature project. But he was already throwing glances at the col new project that the others were working on.
Arsalan Tavakoli joined the AMPLab several years earlier than the others. He started his PhD under Scott Schender in 2005 and completed it in 2009. That was the year that Zaharia started at the lab.
Tavakoli joined McKinsey in 2010 and spent four years with them as a consultant. However, he also kept working on open-source research projects, including a notable one at AMPLab.
Okay, I’ve kept having to hint at the advent of Spark. Let’s get into it now.
Spark As A Staging Area To Databricks
We have a separate article that goes in depth into the origins and creators of Spark. I’ll repeat some the events here.
While working on the Mesos project, Matei Zaharia wanted to show that the cluster could support a machine learning engine. He built Spark as that engine.
The AMPLab was all about collaboration. Lester Mackey was a PhD student specializing in AI. He was part of a team that entered a machine learning competition sponsored by Netflix.
Mackey wanted a faster platform to process the ML models than Hadoop was offering.
He got talking with Zaharia who helped set him up on the new Spark engine that ran on top of Hadoop. This powered Lester’s team to second place.
Maharia quickly realized that his engine could combine distributed batch processing, streaming, and interactive queries within the same model. This brought a whole new level of excitement to the Spark project in the lab.
Gravitating to Spark
Andy Konwinski moved away from the Mesos project to work with Zaharia on developing Spark.
Patrick Wendell started contributing to Spark halfway through his first year at the lab. His code went into the second public release of the code base.
Reynold Xin built the graph processing library in Spark, known as GraphX. You may not be familiar with it if you don’t work with networks and graphs.
But if you’ve even typed a few lines of Spark code, then you’ll be familiar with DataFrames. Xin co-designed this crucial piece of the data processing technology.
Arsalan Tavakoli was working with McKinsey but he also contributed to the Spark code base.
Zaharia and his fellow PhD students had no intentions of monetizing Spark directly. Zaharia released it originally under the Berkeley free licensing scheme and then donated it to Apache.
As the lab students continued to develop Spark in those early years form 2010, they found that large enterprises weren’t yet interested in switching from Hadoop to their new project.
But they also noticed that the same enterprises were finding it increasingly challenging to build Hadoop clusters while supporting a myriad of AI and ML tools.
There was a shared vision of the need for abstraction and simplification amongst a core set of professors and newly minted PhDs at the AMPLab.
The seven academics increasingly discussed this vision through 2012 over meals at inexpensive restaurants near the Berkeley campus.
This could have remained as just talk. But professors Scott Schenker and Ion Stoica had already founded successful companies. Tavakoli had his experience a a McKinsey consultant while Ghodsi had started a small company in Sweden.
This brought some business nous to the group.
Enter Ben Horowitz
Now comes a key part of the Databricks origin story. I’ve already mentioned that Scott Schenker’s company had received investment from Ben Horowitz, an investor with deep pockets.
Schenker invited Horowitz to meet the other six academics at Berkeley. Horowitz was late to the meeting (he blamed the traffic around the campus). But the investor was impressed.
Horowitz put $14 million on the table on condition that the academic group formed a company with a proper commercial structure. Some of the group had been lukewarm about taking on an outside investor. But this was an offer they couldn’t refuse.
And so…Databricks was born.
The company was formed in 2013 and released their first product in 2014.