Architecting Analytic Data Hubs

I Love Solution Architecture

I have been doing enterprise architecture for a long time. From my early days as a consultant at Accenture to my role at CTO of webMethods and for the past several years working with Composite Software’s largest enterprise customers, I find enterprise architecture a wonderful challenge.

A High Standard

Along this journey, I have developed a high standard for elegance and clarity. And unfortunately far too often I find architectures lacking in both.

The biggest problem is a lack of precision with respect to design principles. Fuzzy design principles lead to even fuzzier architectures.

So I was pleased to read Rick Sherman’s latest white paper on Analytic Data Hub design entitled Analytics Best Practices: The Analytical Hub. He does a fine job providing the elegance and clarity I appreciate.

Read Rick’s guidance for yourself:

Analytic Data Hub Design Principles

“When creating analytical hubs, follow these design principles to provide the right enterprise environment:

  • Data from everywhere needs to be accessible and integrated in a timely fashion

Expanding beyond traditional internal BI sources is necessary as data scientists examine such areas as the behavior of a company’s customers and prospects; exchange data with partners, suppliers and governments; gather machine data; acquire attitudinal survey data; and examine econometric data. Unlike internal systems that IT can use to manage data quality, many of these new data sources are incomplete and inconsistent forcing data scientists to leverage the analytical hub to clean the data or synthesize it for analysis.

Advanced analytics has been inhibited by the difficulty in accessing data and by the length of time it takes for traditional IT approaches to physically integrate it. The analytical hub needs to enable data scientists to get the data they need in a timely fashion, either physical integrating it or accessing virtually-integrated data. Data virtualization speeds time-to-analysis and avoids the productivity and error-prone trap of physically integrating data.

  • Building solutions must be fast, iterative and repeatable

Today’s competitive business environment and fluctuating economy are putting the pressure on businesses to make fast, smart decisions.Predictive modeling and advanced analytics enable those decisions to be informed.  Data scientists need to get data and create tentative models fast, change variables and data to refine the models, and do it all over again as behavior, attitudes, products, competition and the economy change.The analytical hub needs to be architected to ensure that solutions can be built to be fast, iterative and repeatable.

  • The advanced analytics elite needs “run the show”

IT has traditionally managed the data and application environments.In this custodial role, IT has controlled access and has gone through a rigorous process to ensure that data is managed and integrated as an enterprise asset.The enterprise, and IT, needs to entrust data scientists with the responsibility to understand and appropriately use data of varying quality in creating their analytical solutions. Data is often imperfect, but data scientists are the business’s trusted advisors who have the knowledge required to be the decision-makers.

  • Solutions’ models must be integrated back into business processes

When predictive models are built, they often need to be integrated into business processes to enable more informed decision-making. After the data scientists build the models, there is a hand-off to IT to perform the necessary integration and support their ongoing operation.

  • Sufficient infrastructure must be available for conducting advanced analytics

This infrastructure must be scalable and expandable as the data volumes, integration needs and analytical complexities naturally increase. Insufficient infrastructure has historically limited the depth, breadth and timeliness of advanced analytics as data scientists often used makeshift environments.”

What Do You Think?

Are you as impressed with Rick’s thoughts as I?

If you like these, perhaps you’ll want to read the entire paper Analytics Best Practices: The Analytical Hub and Rick’s companion piece, Analytics Best Practices: The Analytical Sandbox.

2 Responses to Architecting Analytic Data Hubs

  1. Padmashree says:

    Marc, I have a question for you. We use Composite as a middle ware between database and Business Objects XiR3.
    When we go to Add/Remove Resources, what happens when we delete an existing resource and add it back. Will it get detached from the existing object in Business Object universe?
    Can you please explain to me, how it works?

    • Owen Taylor says:

      Hi Padmashree,

      Quick answer: Add/remove will not break a published resource unless a necessary resource is removed and not added again which would create a problem for the query resolution.

      As an analogy: if you offer a published view called ‘PURPLE_DATA’ that relied on data from a BLUE resource and a RED resource and you used ADD/REMOVE to remove the RED resource, the query could never offer the expected PURPLE_DATA (you need RED mixed with BLUE to make PURPLE). If, however, you then later added back the RED resource, the query engine will be able to make PURPLE_DATA again. (In this scenario, you may be best served to reconnect the client to Composite, or flush the generated query plan, after the final ADD/REMOVE step to ensure the query plan generated uses the latest snapshot of mapped resources.)

      DETAILED RESPONSE:
      The resources seen by Business Objects are ‘published’ – visible to clients with proper permissions outside of the Composite Server that utilize our JDBC, ODBC, or ADO.NET drivers. This published layer of resources is one of several layers of abstraction facilitated by the Composite Server.

      There are two roles at work here: The client role that connects to Composite and queries the data, and the developer role that builds the abstraction in Composite and makes a published resource available to the clients.

      When a client application connects to a published resource using our driver, they do not know how many collaborating data sources are involved to provide the data to that published abstraction. They do not know the original data base/file/service types, locations, or the format and logic involved to actually provide the data.

      Before the developer can publish the resource and expose it to Business Objects or any other client to Composite, they have to design and build the necessary model in Composite.

      Let’s say you (the developer) create a view called HIGH_VALUE_ORDER_FULFILLMENT that combines SalesForce.com hosted customer data with orders and inventory data stored in an Oracle database.

      You will need to connect to SF.COM and to Oracle and introspect the interesting entities you plan to use to fetch data. Let’s say you bring in a HIGH_VALUE_ORDERS table from Oracle that includes the columns CUST_ID and ORDER_ID. Let’s say you bring in a FULFILLMENT ‘table’ from SF.com that includes the ORDER_ID column and a column ORDER_FULFILLED that shows if an order is completed or not.

      You then create a view in Composite that joins the HIGH_VALUE_ORDERS table and the FULFILLMENT table using an equi-join on the ORDER_ID column and call that view: HIGH_VALUE_ORDER_FULFILLMENT.

      You then publish that view so that Business Objects and other clients can connect and query the data, the final view looks like this:

      cust_id, order_id, order_fulfilled
      100,20001, ‘YES’
      100,20002, ‘NO’
      200,30006, ‘NO’

      Now, if you were to use the add/remove operation to remove the FULFILLMENT table (the SF.com table that exposes the order_fulfilled information), there would be no way for Composite to execute the necessary join and populate the final view. Composite would in fact show that the final view had an error. Composite would respond to any queries against that view with an error and essentially it would be a broken resource to anyone trying to interact with it.

      If you then, used add/remove to add the FULFILLMENT table back in, Composite would again be able to execute the query and the error condition would go away.

      Interestingly – if the client application is within the context of a transaction when the developer makes the change and removes the resource using ADD/Remove – the next query against the same view will succeed. Composite will not alter the affected resources during the lifetime of a transaction.
      A new Transaction attempting to query the view will result in an error like this one:

      FAILURE: An exception occurred when executing the following query: “select * from HIGH_VALUE_ORDER_FULFILLMENT”. Cause: Unable to expand composite view ‘/services/databases/example/test/HIGH_VALUE_ORDER_FULFILLMENT’. On
      line 1, column 16. Cause: ‘Invalid table
      ‘/shared/sup/forum_qa/ADD_REMOVE/sf_ds/FULFILLMENT’. On line 7, column 9.

      After the ADD/Remove operation is repeated so that the missing table is restored, a new transaction can be started (I just disconnected and reconnected the client) and the next query to the view will succeed:

      admin@compositesw:localhost> select * from HIGH_VALUE_ORDER_FULFILLMENT;
      ———+———-+—————–+
      ←[1m cust_id←[m |←[1m order_id←[m |←[1m ORDER_FULFILLED←[m |
      ———+———-+—————–+
      100 | 20001 | YES |
      100 | 20002 | NO |
      200 | 30006 | NO |
      ———+———-+—————–+
      3 rows in result (first row: 16 msec; total: 31 msec)

      I hope this helps!

      Owen.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 25 other followers

%d bloggers like this: