Integrating Multiple Data Streams to Achieve Your Research Goals

Posted in Blog on July 9, 2019

When presented with a research question, it’s natural to ask whether data already exists that could provide an answer. There may be data available within your organization that could be useful, or external data from myriad government agencies or private sector suppliers may hold promise. Simply repurposing existing data not collected with the current research question in mind may leave blind spots that prevent deeper understanding. Similarly, a dedicated research study may not be able to answer all aspects of the question due to the scarcity of time and resources. Leveraging existing data with the primary data collect-ed from a dedicated research study can provide a more holistic answer to the re-search question and yield much more powerful and nuanced insights.

Although we are living in a data rich world, the best data can feel elusive. In the healthcare space, government data from clinicaltrials.gov, cms.gov, cdc.gov, and even healthcare.gov are easily recognizable, but data from the Census, the Bureau of Labor Statistics, or the Bureau of Eco-nomic Analysis can provide context from the larger population and economy that surround and inform the research question. Databases of insurance claims can paint a picture of individuals’ health journey and provide extremely granular cost data. Academic data projects, such as the Panel Study of Income Dynamics from the University of Michigan, can provide an otherwise rare intergenerational view of individuals and their families across a wide range of socioeconomic and health related topics. With the right planning, these external data can integrate with internal data on sales, marketing effectiveness, conference attendance, social media interactions, and even customer service requests to better inform primary market research.

Successful integration can be a tricky endeavor. Ideal data integration requires data from all sources be complete (no missing data) using the same unit of observation (e.g., each row represents a single person) with a common unique identifier (e.g., the same ID variable identifies the same per-son in all data sets). Descriptive analyses can handle outliers and missing data by selectively omitting them but dealing with them in a statistical model requires more complex techniques and no guarantees. Mixing data on individuals with data that is more aggregated (say characteristics at the zip code level) can inadvertently dilute the variation in the data and cause a statistical model to lose predictive power. Lacking a common unique identifier can prevent integration all together. In general, these problems can be avoided with enough forethought.

The real trick to getting insight from multiple data sources is integrating them to work in harmony from the beginning, which requires identifying existing data that may be useful, vetting and validating that data, and designing a research plan for the new primary research that explicitly builds upon the existing data. Partnering with KJT Group to investigate your research question means having the guidance and expertise to apply these best practices to ensure you achieve your research goals.