Integrating Analytic Data Using Data Virtualization
April 29, 2013 3 Comments
More Data = Better Analysis
The analytic data domain includes all kinds of data, including:
- traditional enterprise data from sources like the Enterprise Data Warehouse;
- big data from sources like Hadoop;
- cloud data from SaaS applications and public providers;
- social media data from sites like Facebook and Twitter; and
- personal/desktop data from spreadsheets and flat files.
Data virtualization is an optimal solution to integrate all these sources, thereby improving analysis and business insight.
The effective analyst needs access to all enterprise data, not just the data in the warehouse. While most of this data is relational, it still exists in a dispersed collection of silos, each with its own data model. The diversity of connectivity, authentication protocols, SQL dialects, and data models makes enterprise data more difficult to leverage than it should be.
Data Virtualization simplifies access to enterprise data by providing built-in connectivity to most enterprise data management platforms, and providing a standard SQL interface to query the data. It also provides tools to discover relationships among data entities in different silos.
Hadoop is fast emerging as a leading repository for big data analytics. However, the map-reduce paradigm used to interact with Hadoop data sources is not well understood in typical enterprise IT organizations. This may not be a problem when performing specialized analytics, but it can be a big barrier when trying to combine Hadoop and enterprise data using enterprise IT standard languages such as SQL.
Data virtualization overcomes the query language challenge by integrating and extending Hive, and thus provides a unified SQL based approach for querying both enterprise and Hadoop data sources. Complex data digestion and reduction will still be done by map-reduce, but leveraging that data to combine with other data can easily be done through data virtualization.
Many organizations leverage SaaS platforms like Salesforce.com, which results in a nexus of data stored in the cloud, and this data is valuable to analytics. In addition, more and more data from third-party providers is becoming available to companies looking to leverage certain specialized data sets. Both of these types of data require access across the Internet through service protocols.
Data virtualization provides access to most cloud-based data sources through standard SOAP and REST protocols, and leverages other web service standards to complete the picture. Data virtualization also facilitates the querying, transformation, and caching of this data to make it suitable for analytics.
Social Media Data
Facebook, Twitter, and other social media sites hold a tremendous amount of data that can be useful to customer analytics. Unfortunately this data is difficult to access through standard protocols, and it is difficult to acquire the appropriate authorizations.
Data virtualization can access and integrate this social media data through third-party data providers like Gnip. The combination of Gnip, which is officially authorized by the social media sites to distribute their data, and data virtualization, which can access and transform the data, brings social media data to the analyst’s desktop.
Although the bulk of data that an analyst works with comes from elsewhere, there are often local spreadsheets and flat files that an analyst would like to use in conjunction with the rest of the data. These files are an important tool in the analyst’s arsenal to create specialized data sets to augment the analysis being done.
With data virtualization analysts easily access and integrate Excel and flat file data. Often this data is either untyped or text-only, and data virtualization allows the analyst to transform and cast the data into an appropriate form.
If you want to learn more about how data virtualization can help with analytics, check out these white papers: