It's impossible to discount the extraordinary success of data discovery and visualization tools, and the emancipation of business analysts from old-school ETL processes and data modeling. This revolution has democratized data and greatly accelerated the speed of data analysis to help companies make data-driven decisions in fast-moving, highly competitive environments. The structured world of data warehouses and ETL processes as the single source of truth within the enterprise has been permanently challenged.
The hard distinctions between the various types of BI tools described in the section above are becoming less significant, as all vendors orient their products away from the IT-centric model towards a more agile, self-service approach. For example, full-stack vendors have all now built data discovery and visualization tools as a component of their platforms, and these new capabilities are improving all the time:
- SAP Lumira started life as SAP Visual Intelligence and was renamed SAP Lumira in 2013, and SAP BusinessObjects Lumira more recently. The product has gone through several enhancement cycles and has become a solid discovery and visualization tool for SAP Business Objects users who use the two products together. Additionally, SAP has built an entirely new cloud-based BI platform called Cloud for Analytics, incorporating not just standard BI components, but also business planning, predictive analytics, and Governance, Risk, and Compliance (GRC).
- MicroStrategy 10, and subsequent point releases, has also significantly improved the discovery and visualization tool called Analytics Desktop, since its initial release in 2013. Since then the company has updated the licensing model, usability and governed data discovery capabilities.
- IBM released Watson Analytics at the end of 2014. This is an entirely new cloud-based analytics platform designed to process natural language queries, pattern detection, and data discovery with advanced analytics. The product has already amassed a significant user base through the freemium sales model and is starting to acquire paying customers. IBM is now applying Watson user experience and design principles to the Cognos platform, and the latest release of Cognos has been given a new name: Cognos Analytics.
- SAS released Visual Analytics in 2012, and this has now become the flagship product. The Enterprise BI Server product still available for large deployments but is no longer a focus for the company.
- Microsoft released the second major version of Power BI in 2015, and continues to enhance its capabilities. The product is a cloud-based data discovery and visualization framework with both desktop and browser-based authoring. It has pre-built connectors to 60+ different data sources.
Conversely, data discovery and visualization vendors are being pushed in the opposite direction. Products like QlikView/Qlik Sense, Tableau and TIBCO Spotfire comprised the first entirely successful attempt to wrest analytics away from the control of the IT department and make these capabilities available to business users who no longer have to rely on the IT department for data analysis. The IT department typically has sophisticated data management expertise, and for this reason has functioned as the de-facto BI service bureau for business departments that do not necessarily have those skills. But as business has speeded up and the urgency of understanding data has increased, the IT department became a bottleneck hindering the ability of business units to make data-driven decisions.
Ironically, the very success of this data discovery and visualization movement has sewn the seeds of reaction. A common scenario is that multiple groups within an organization purchase more and more seats for a data discovery and visualization tool, which began as a small departmental purchase. This inevitably leads to calls for enterprise licenses, and once a product becomes a standard offering deployed across the enterprise, the IT organization is inevitably involved again. Data discovery and visualization tool vendors have responded by building enterprise features like security, data governance, data preparation, and even report generation into newer versions of their products in order to satisfy the requirements of the IT department.
“Qlik and Tableau started by selling into the business side directly, many times by-passing IT. It became clear that multiple instances could cause fractured environments with inconsistent results produced. Each department could develop its own numbers which often did not mesh. There was no single source of data. Eventually customers wanted enterprise licenses, which means the vendor had to deal with IT. The IT department gets involved and brings in questions about governance, metadata, administrative functions. That's when Qlik and Tableau had to add more sophisticated data management functions and features. Qlik bought Expressor and integrated it into their Sense product for easier data integration / metadata management and Tableau now has its own data preparation and metadata management capabilities. Both companies have enhanced their governance and audit capabilities as well. They are now able to satisfy most of the IT requirements for enterprise implementations.”
“Tableau and Qlik are now being forced down the same road as traditional vendors, with increasing demands for enterprise level data governance, and even report generation. Ultimately, the question of whether the data is clean and reliable becomes central.”
The ascendance of big data and Hadoop is another major thread in the development of the business Intelligence landscape.
Hadoop Adoption Rate
2015 saw increased adoption of Hadoop and Hadoop-related tools. This is not exactly a new technology: Hadoop has been around for 10 years, and is still only being used by a relatively small number of early adopters. However, a Hadoop adoption survey based on 2,200 responses conducted by AtScale indicates that of those who already use Hadoop, 76% plan on doing more within the next 3 months. Of those who have not yet deployed Hadoop, almost half say that they plan to do so within the next 12 months. In addition, 94% of respondents are bullish about their ability to achieve value from Hadoop. Hadoop does look as if it's poised for significant growth and adoption.
How Does Hadoop Differ from a Traditional Data Warehouse?
Hadoop is often erroneously thought of as a database. It is, in fact, an ecosystem of open-source components including MapReduce, the Hadoop Distributed File System (HDFS), the HBase NoSQL database, along with other databases, and many other packages facilitating import and export of data into and out of the HDFS. All of this software is deployed and run on inexpensive commodity hardware—usually many different servers—to cope with the massive volumes of data.
One of the most significant differences between Hadoop and a relational data warehouse is the way in which the data is stored. In a data warehouse, the data is carefully structured and organized before it is stored, so the data is highly structured and easily accessible through well-constructed queries. A Hadoop data lake by contrast contains large volumes of raw, unstructured data, which can be analyzed by business analysts and data scientists without the constraints of any preconceived structure being imposed on the data.
The terminology often used to describe this difference is “Schema-on-Write” versus “Schema-on-Read”. In Schema-on-Write, the data is mapped and parsed before being written into predefined columns and rows in the warehouse. Conversely, in Schema-on-Read, analysts can use tools like Hive, Spark and other similar tools to analyze the data in its native format. Another way of putting this is that ETL is performed on the fly. There are advantages and disadvantages to both approaches, but one of the big advantages of Schema-on-Read is the ability to analyze raw, unstructured data without being slowed down by an existing structure or schema that may inhibit creativity and flexibility.
Why Hadoop Matters
In the first edition of this guide, we described the problem that big data vendors are trying to solve: How to harness the Terabytes of unstructured data like streaming data, video data, machine data, etc. to improve business decision-making and business outcomes. Business data is no longer only collected from operational and transactional, internal systems, but also from physical devices like sensors and machines, and from human sources like social media, image designers, etc. The relational data warehouse was designed for highly structured data stored in tables, and cannot comprehend this kind of unstructured data, or this volume of data—hence, the rapid ascension Hadoop and of so-called “data lakes,” or vast repositories of raw data stored in its native format until needed.
BI tools must all now be capable of ingesting and analyzing this data, often in conjunction with more organized, structured data. Virtually all BI vendors now integrate with Hadoop in some fashion, and many legacy BI vendors have formed partnerships or acquired vendors in the big data space to be able to tap into this new data universe. Notable big data acquisitions include:
- Teradata acquired four Hadoop-related companies in 2014: Think Big, RainStor, Hadapt, and Revelytix
- IBM acquired two healthcare big data companies, Explorys and Phytel, in 2015 as it builds out healthcare big data analytics capabilities on Watson. It also acquired Cleversafe, a big data storage product.
Microsoft acquired Metanutix, a startup designed to help people crunch big data, in December of 2015.
More deals of this kind are likely in the second half of 2016 and in 2017.
The two trends just described: a shift away from IT-managed BI deployments towards agile, data discovery and visualization tools, and an increasing emphasis on schema-less Hadoop data lakes, have both led to a third major trend: data preparation and machine learning.
The task of data preparation used to be performed by IT departments running Extract, Transform and Load (ETL) processes. ETL extracts the data from the various data sources, and loads it into a data warehouse where it is normalized by organizing it into tables, while cleaning the data and removing redundancy and inconstancies. Once it has been appropriately structured, it is then available for querying and analysis. In the new self-service, agile world, this paradigm no longer holds. As data becomes more democratized, one of the biggest challenges for business users trying to make sense of data for analysis, is that the data must first be prepared. Data from multiple different sources has to be integrated and cleaned before any analysis can occur. How successful less technical business users (rather than data analysts and data scientists) can be at this task is a matter of debate. But given the scarcity of technical data scientists and analysts, the goal of many vendors is to create a kind of “ETL light” that requires the minimum amount of expertise in order to be successful.
Data Preparation for Data Discovery and Visualization Products
Data discovery and visualization tools like Tableau and Qlik have typically relied on third-party tools like Alteryx to clean and prepare data for analysis. Indeed these product types are quite complimentary, and the vendors have lead sharing agreements in place. However, data discovery software vendors are increasingly developing their own data integration and data cleaning capabilities. For example, Tableau 9 introduced some Excel-based data preparation capabilities that are a first step in that direction. Qlik has always had the ability to perform data loads and basic data preparation tasks though scripting, but non-technical users have been forced to rely on third-party tools. TIBCO Spotfire 7.6 has what it calls visual “in-line data wrangling” functions that lets users perform data preparation functions while performing their analysis, an approach they believe is more useful than workflow style preparation tools. Ultimately though, all of these vendors are likely to build robust data preparation capabilities into their products so that users are not forced to purchase separate products to perform this mechanical but crucial process.
But the jury is still out on the ability of business users to use these tools successfully.
“Tableau and Qlik are terrific additions to the marketplace; they expanded the ability of people to be able to analyze data. But while these tools have made a great leap forward, I don't see a comparable leap on the data side – if you look at wrangling tools and data prep tools, they are still aimed at techies, or IT people, or consultants like me. We're not really quite there yet.”
A more likely scenario is that data analysts or even data scientists will still be the primary users of these tools, at least in the medium term. The increasing prevalence of machine learning technology may eventually bridge this gap, but this is still an issue in the current environment.
Data Preparation for Big Data
Big data represents an even bigger challenge. The explosion of interest in big data technology, as organizations begin to understand the potential competitive advantage to be gained from analyzing massive quantities of unstructured data, has triggered a burst of innovation across the entire analytics landscape. As pointed out, data warehouses and big data stores like Hadoop are vastly different, but the need to prepare the data for analysis is equally crucial for both scenarios. This is and has always been an arduous task and frequently takes longer than the time required to actually analyze the data.1
One direct result of this is the emergence of a new class of data preparation tools like the aforementioned Alteryx, specifically designed for big data preparation: Trifacta, Paxata, and Tamr, are some of the newer entrants in addition to more established vendors like Informatica, Datawatch, and IBM.
When IT was the steward of enterprise data, specialists handled data preparation and integration as part of the ETL process. The difference today is that data preparation tools are no longer being designed for the IT specialist, but rather for the self-service data analyst or business user.
Machine learning is a relatively new data analysis method that has become a hot topic in analytics generally, but is also getting a lot of attention in the context of data preparation. As data analysts and even business users with limited data management expertise are now frequently performing data preparation on the fly, it becomes critically important to build software that can intuit and understand large volumes of data automatically. Machine learning technology uses algorithms that are capable of learning iteratively. Given that business users and data analysts now have to perform their own data preparation without assistance from IT, modern data preparation tools have started to build machine learning under the hood, in order to make things as easy as possible for non-experts to perform things like data integration and format conversions on their own. For example, the software can suggest tactics to users for blending data or other data preparation scenarios based on what others have done before. By drawing on a library of past actions, the software is capable of guiding the user to accomplishing tasks that might otherwise have been too complex or require the assistance of a data expert.
Many large enterprises are making big bets on Hadoop as a critically important data framework for the future. This technology will be indispensible for managing huge volumes of unstructured and semi-structured data in a cost-effective and highly flexible environment. However, it should not be understood as a replacement for the traditional data warehouse. This is not a rip-and-replace technology. According to data warehouse pioneers like Barry Devlin and Ralph Kimball, the two technologies will exist side-by-side for the foreseeable future. Hadoop will be used for mass storage of unstructured data for predictive and exploratory data discovery. Data warehouses, particularly those running against very fast, massively parallel processing (MPP) relational databases like Redshift, Vertica, Netezza, and Google BigQuery, will remain the best infrastructure for structured reporting, which is still the lifeblood of most organizations. As mentioned above, a new breed of data warehouse automation tools have mitigated some of the drawbacks of setting up and managing these data stores.
Ralph Kimball described this coexistence succinctly in a webinar in 2014: “Everyone has now realized that there's a huge legacy value in relational databases for the purposes they are used for. Not only transaction processing, but for all the very focused, index-oriented queries on that kind of data, and that will continue in a very robust way forever. Hadoop, therefore, will present this alternative kind of environment for different types of analysis for different kinds of data, and the two of them will coexist.”2
“The data warehouse is where you create production analytics. These use trusted, high quality data, the data you count on for KPIs, regulatory and compliance reports, precise financial analyses, and so on. But there is also a need for an experimental environment. This environment uses (big) data without formally vetting it, without rigorous data quality processing or even data integration processing. Data analysts and data scientists just want to experiment with the data, try different analytical techniques, perform general, unplanned queries and analyses. This environment is what I call the Investigative Computing Platform. It is not as rigorously controlled as the data warehouse and has more flexible governance and schema support. These two environments (the production data warehouse and the more experimental investigative platform) are diametrically opposed to each other but serve important purposes in the world of analytics. Will there be a single technology that can handle them both? The technologies are certainly moving in that direction, but perhaps not today; I believe it will be a while before we see the data warehouse and the experimental environment fully supported in a single technological environment.”
The data infrastructure of the future will include a variety of data repositories including relational database servers, Apache Hadoop, and other NoSQL platforms, interlinked by a metadata catalog defining the characteristics and context of all the data in each store.
Data warehouses and standard BI reporting are not going to disappear any time soon.
2 Building a Hadoop Data Warehouse, Dr. Ralph Kimball, 2014
- Traditional, data warehouse-based BI systems sometimes referred to as “legacy BI” are unlikely to be at the top of the modern BI buyer's list, at least to begin with. Although these systems are remarkably effective within their domain, they have traditionally not been able to supply the agility and exploratory freedom that is required by today's business environment. However, many of these vendors are remaking their product suites to provide more agile capabilities and reduce dependence on the IT department. These products are likely to remain good options for larger enterprises, as they continue to re-design their products to meet the needs of the modern agile enterprise.
- Data discovery and visualization systems provide the agility and freedom missing from traditional tools, but have their own shortcomings. These very powerful tools tend to lack data preparation and data governance capabilities which are required if they are to be used as the primary BI system across an enterprise. The addition of a data preparation tool will almost certainly be needed to blend and structure data from multiple sources, although eventually these capabilities are likely to be built-in.
- It is almost certainly unwise to swap out data warehouses and their associated BI systems for the new world of Hadoop and data lakes. The big data Hadoop infrastructure is designed to house high-velocity and high-volume unstructured data generated by machines. However, Hadoop makes little sense—at least today—as a repository for business-critical, highly structured, core business data, which is best, stored in a data warehouse or other structured data store.
A number of BI products are not discussed in this guide because TrustRadius does not have adequate data. Some of the omissions include:
- Actuate OpenText Analytics
- IBM Watson Analytics
- Information Builders
- Microsoft Power BI
- Oracle BI Foundation Suite
- Qlik Sense
- SAS Enterprise BI Server
- SAS Visual Analytics