One component of the American Recovery and Reinvestment Act (AKA: Stimulus Package) still making its way through the pipeline is 250 million for the development of statewide longitudinal data systems managed by the Department of Education's Institute of Education Sciences.

Another new early childhood initiative, The Early Learning Challenge Fund, also seems set to provide a substantial investment in what is described as a "coordinated zero to five data  infrastructure to collect essential information on where young children  spend their time and the effectiveness of programs that serve them."

As a researcher, and evaluator, currently working in the early childhood realm I have a vested interest in the outcomes of these types of pending government grants.  So much of our capacity to evaluate programmatic effectiveness hinges on the availability of quality data.  Currently it is hard to even estimate the number of children being served in the early childhood system with so many non-mutually exclusive programs (private, non-profit, public, local, state, and federal) lacking compatible data systems, or any system at all.

As I sift through information and sit in on inter-agency work groups I have begun to wonder where this is all headed.  The general pitch is a system that allows researchers and evaluators to follow any kid through the multitude of programs they will participate in from birth until their eventual exit from the education system, all the way up through graduate school.  But beyond this general vision, what these types of systems actually look like has been left open for the states to decide.

The most recent attempts at a definition begin by outlining key components that any comprehensive longitudinal data system should have.  This includes things like student identifiers, enrollment data, and test scores.  The data quality campaign has a list of ten essential elements on their website.  The America COMPETES Act, which offers the base definition for the stimulus grant, takes a similar approach.

While these definitions do a good job of providing some type of guidance on characteristics a system should have they do not paint a picture of what the system would actually look like.  The biggest weakness though is that a state can hit all essential elements and be far from having a
comprehensive system capable of even coming close to the vision. 

If a state wants to legitimately create a comprehensive longitudinal data system they should start by envisioning what exactly that would look like.  Here are a few options that I've heard being thrown around and some of the positives and negatives. 

Single Data Warehouse

Description:  This is probably the most commonly pictured data system approach.  Basically
there is one huge data warehouse that all organizations contribute to.  Building a large data warehouse can be very expensive and requires a hefty commitment from all involved parties.  The system requires dedicated oversight and administration.




Positives

-When data is stored in the same warehouse organizations tend to work under the same rules in terms of formatting, structure, and identifiers increasing the technical ability to connect with one another.

-Individuals can be given access to the whole system or just part of the system depending on specific privacy restrictions.  When the systems are separate it is often much harder to get access to multiple systems. 

Negatives

-It is tough, or probably impossible, to get full participation from all organizations.  Even with a large budget there is also a limit to how many individual agencies will be involved in a single data warehouse.  Any agency not involved will be disconnected.

-Many agencies have made huge investments in proprietary systems.  Why spend a lot of money
on a new system when a single agency already collects the data it needs with their current system.

-The larger the system the more administrative costs are required.  There is usually some limit to how large it can grow.

Separate but Connected Systems

Description: With this approach the focus is on creating connectivity through the
development of some type of sharing mechanism, most often a unique common identifier.  This requires a lot of coordination to be successful but individual agencies still work primarily with their independent systems


Positives
-Agencies are not losing control of their own data or the investment in proprietary systems, a worry of some.

-It is expandable in that participation does not increase larger system wide administrative costs and can be a single agency decision. Agencies that might have been locked out of the single data warehouse approach can participate.

Negatives
-Privacy issues and differing individual agency access rules can hold back the ability of researchers to access longitudinal data.  In order to make this type of system work some type of overall governance is required in order to develop or even document system wide access rules. 

-While attempts to match data without a common identifier, which allows for removal of personal identifying information, can be pursued it inflates the privacy concerns.

-When systems are separate it is tough to know what data exists where and how to get it.

Variant 1: "Confederated" Data Warehouse

Description:  This is a take off of a single data warehouse approach.  I heard the term confederated from Connecticut where they are developing this type of system for early childhood.  Basically individual agencies keep their systems but a larger system lifts out specific data elements.  Good for states that feel they already have a lot invested in existing individual data collection and storage capacity but want a single warehouse that connects these agencies.

Variant 2: Mix and Match Approach

Description: This probably describes the "final" product for most systems.  It is doubtful that a single data warehouse will account for all education and health agencies.  Instead, certain large agencies (department of education, department of health and human services, etc.) will probably
continue to grow and expand their systems to encompass related individual agencies.  Developing a common identifier that is used across all of these agencies (or at least developing a mechanism that matches the individual common denominators) could connect the overall
system.