How to best design Datasets

Posted on August 8th 2018 How to best design Datasets

I had a post recently on the Informer Community web page asking us how to best design Datasets. There are a couple of alternatives worth discussing.

Datasets bring the concept of data modeling to the forefront. Designing your Dataset requires much broader thinking than designing a single report. When designing Datasets, think of data in categorical terms. For example, if I am a manufacturer, I might have an Inventory Dataset and a Sales Dataset. Each of those Datasets would include all the data points I might need when reporting on or analyzing the data associated with that specific part of the business.

You also want to think about how fields that are derived from Data Flows fit into your design. With Data Flows, data doesn’t have to only come from your source database. You can also create new data with values derived from existing data residing in your Dataset.

There are a few questions to consider when designing Datasets:

  • Should I incorporate everything into one big Dataset or should I break down the Datasets into smaller ones
  • Which approach would be the best for this particular situation? There are pros and cons to each.

The question raised on the Informer Community site was from a college that keeps a record of their alumni and all the alumni gifts/donations the institution receives. They wanted to know if they should:

  1. build just an alumni Dataset that combines alumni, alumni gifts/donations, and demographic information pulled from their database
  2. or, take a two-step approach whereby they first build a broader Dataset that includes alumni, alumni gifts/donations, demographic information pulled from their database, students, staff, faculty, etc., and then build a separate alumni Dataset which would pull the necessary data from the broader Dataset.

There are times when it might be useful to create separate Datasets in a two-step approach. Having a separate Dataset containing broader information creates a single source of truth. It also provides for consistent reporting and analysis since any data manipulation or scrubbing would need to be done only once. And, having a broader Dataset in which to pull data from enables analysis and reporting to be conducted on just the demographic data contained within. There would be no need for Dataset designers to remember to add the logic to cleanse the data in each Dataset if you design your Dataset structure to pull data from this broader Datasource.

The drawback to employing a separate Dataset approach is you have multiple Datasets to manage. How do you schedule those to be refreshed? How often should they be refreshed? If the Dataset is rather large, it can be somewhat challenging if you’re trying to refresh the Dataset frequently. Also, the size of the Dataset could impede how often you want to refresh it. For example, if it’s taking 30 – 45 minutes to refresh, you wouldn’t want to do a refresh every 30- 45 minutes. You’d need to give it some time to finish before doing it again. And then, if you have one Dataset refreshing at a different rate than the Datasets on which it pulls data from, it’s possible they could get out of sync. For example, if the alumni Dataset gets refreshed more frequently than the broader Dataset, you could potentially have an instance where alumni data exists but their corresponding record in the broader data does not exist (or vice versa).

By using Informer Jobs, you can schedule all 3 Datasets to refresh in a single Job, so you don’t have to worry about the timing for when each Dataset is refreshed.

A good situation for using separate Datasets would be where you have multiple one-to-many relationships in your data. For example, let’s say you have a CRM (Customer Relationship Management) database that includes all of your accounts and each account has more than one contact person. That creates a one-to-many relationship between the account and the contacts. Accounts can also be tied to multiple sales opportunities. If we combine data from accounts, contacts, and opportunities into a single Dataset, we create a Cartesian effect. For example, if we have an account with 3 contacts and 2 opportunities, the resulting Dataset would have 6 records (3 contacts x 2 opportunities) listed for the one account. This can make manipulating and displaying the data rather tricky because you end up with duplicates.

To prevent the Cartesian effect, you could create separate Datasets for accounts, contacts, and opportunities. You would then use a Flow Step to pull the records from the contacts and opportunities Datasets into the accounts Dataset. This makes for a much cleaner Dataset with less to manipulate.

When it comes to the frequency of Dataset updates (or refreshes), there are a couple of things to consider:

  • How often is the data in the database actually changing? Frequently changing data may need to be refreshed more often than data that isn’t changing very often, if at all (historical data for example).
  • How big is the resulting Dataset? This will impact the time it takes to run the underlying query. If it takes 30 minutes for the query to run, the Dataset should not be refreshed every 20 minutes.
  • Does the Dataset query require a lot of resources from the database server? A query that is taxing on resources (such as processor or memory) might need to be run after business hours to prevent performance impacts on other applications that use the database.

Ultimately the goal of a Dataset is to:

  • provide the required freshness of data to users who need to do comprehensive reporting and analysis without the requirement to constantly query the database for the latest data
  • provide a single source of truth for all reporting and analysis to users.

For more please contact sales@entrinsik.com or call 888-703-0016.

 

This article was written by Tim Nicholson
VP, Informer Customer Services

Back to posts