What is Data Science?
The best starting point is to clarify what a data science platform is and what it can be used for. There is so much buzz around data science and its close analogs machine learning, deep learning and AI, that what is actually meant by these terms continues to perplex.
Data science is a technical discipline that is all about using code and data to build complex models that can make predictions in the world. Data science as a discipline uses machine learning and deep learning are techniques for building data science models.
What kinds of things can data science predict?
Models built by data scientists are used to predict actual events in the real world. Examples of the kinds of questions that data science models can answer include:
- How can I reduce customer churn?
- How can I optimize my supply chain to ensure on-time delivery?
- How can I ensure that the right site visitors are seeing the right ad?
- How many units of product x will I sell next year?
- How can I spot errors on health insurance claims?
- How can I detect a data breach at the moment it happens?
Some questions, however, are harder to answer than others. The customer churn question is relatively common, and not difficult to model and answer because the inputs to a churn model are very well understood and limited in number. Building a model to, say, predict earthquakes is a much more difficult problem. Earthquake prediction is infinitely more difficult because the precursor events that might predict a major earthquake are not well understood. Scientists have looked at many potential inputs such as radon gas concentrations, electromagnetic activity, and even animal behavior, but a causal relationship between these potential precursor events and a major earthquake and too weak to build a robust predictive model.
When do I need a platform?
Since there are many open source data science products like Python, R, MLib, and Jupyter, most commercial data science platforms are not trying to replicate the capabilities of these tools. Rather they aim to integrate these tools into a single platform. Examples of these integrated platforms are Domino Data Lab, IBM Watson Studio and Cloudera Data Science Workbench. Users of these platforms are clear about the value over open-source tools:
“One single IDE (browser-based application) that makes Scala, R, Python integrated under one tool” Brad C. in review of Cloudera Data Science Workbench
I asked the Southard Jones, Vice-President of Marketing at Domino Data Lab (a data science platform vendor) how buyers should go about selecting a platform. In his view, there are two main vectors to consider: Problem commonality and the scale of the data science program in the enterprise.
The first thing to determine is whether a data science platform is necessary, or open-source tools are sufficient:
- Common problems like customer churn do not require a data science platform. Off-the-shelf models exist and can be modified to answer the problem in a specific context. There are many open-source data science tools that have been designed to build models to answer relatively common or simple problems. A platform is not necessary for this.
- The second vector is the level of maturity of the data science program within an enterprise. If there is a single data scientist, it may be sufficient to build models in open-source tools like Python or Jupyter or Scala, and again a platform is not really required.
In fact, the majority of data scientists use open-source products like the three aforementioned, in addition to R, Apache Spark, H20, Apache Hive and Apache Pig. Many also use predictive analytics tools like SPSS.
However, if there is a data science team and data science is used to solve a variety of different kinds of problems, a platform approach makes sense, according to Jones. Using a commercial platform, data scientists can choose any language and package that they wish to use.
Stephen Smith, Research Leader of the Eckerson Group has a similar view. He argues that for a data science team to be truly effective, a platform is needed to automate as many repetitive operation tasks as necessary. He agrees that it’s about scale, and provides a list of typical operations issues that usually indicate the need for a platform:
- You don’t know how many models you have
- You’re not sure how much you spent on AWS last month
- Feature creation is taking a long time
- There is a disconnect between data engineers and data scientists
- Your data scientists are spending a lot of time babysitting their models after they are deployed
Smith and Jones also Jones both emphasize the importance of collaboration capabilities once a program begins to scale.
If you are evaluating data science platforms, check out our TrustMap of products in the category, and read reviews. Reviews from end users of these products can be particularly helpful in understanding the kinds of problems they are being used to solve, and how well they perform in different scenarios.
Here are three of the big trends that have begun to emerge as a result of enterprise adoption of these platforms:
With the introduction of cloud-based platforms, data scientists can also share components of their work with their colleagues or collaborate with them securely on specific tasks. For data science to be successful, models must be iterated constantly with feedback from the business user. Platform tools typically include the ability to retrain models based on direct feedback from the business person with a real problem to solve. Buyers should ask vendors about their all-important collaboration capabilities. Domino Data Lab, RapidMiner, DataRobot and others all contain collaboration tools to ensure that models are informed by the insight of ultimate owner of the problem to be solved, but buyers should compare capabilities in this area.
Model deployment is also an important platform capability. Until recently, much of the focus has been on the technology and how to accelerate model development. But there is little point in accelerating model development if deployment is a bottleneck.
Most data scientists are very proficient at things like data ingestion, cleaning, manipulation, visualization, and modeling. However, deployment is typically not a strength because the skill set is quite different. Model deployment is a software engineering discipline. Several vendors are building simplified deployment capabilities into their tools to try to bridge this knowledge gap. Again, buyers should compare these deployment capabilities with available internal skill-sets in mind.
Enterprises with models in production are starting to think about compliance risk, financial risk, and even bias risk following the Cambridge Analytica debacle.
Many organizations now explicitly have a governance role in the data science team. This person’s primary function is to build confidence in the underlying data sets, build confidence in the model, and minimize risk. They do this by questioning assumptions and identifying discrepancies.
Another key factor in minimizing risk is helping to provide transparency in how a model works. Most models today are black boxes to everyone except the person who built them. It will become crucial for model building to include things like automated documentation and explainer functionality (like Local Interpretable Model-Agnostic Explanations) that will provide an inside view of how a specific model actually works. Buyers should not neglect this important capability when evaluating platforms.