Report to the Administrative Computing Council
Technology Review of the Data Warehouse Project
Executive Summary
The data warehouse project is a proposed decision support system for student information. The pilot project demonstrates the usefulness of the warehouse and the data modeling processes proposed for the production phase of the project. The objective for the data warehouse is the integration of multiple campus data sources into one data architecture to support decisions related to students and potential students. For the pilot phase of the project, the Data Administration Steering Committee used only a single source, data repository.
This review focuses on the technology issues of this pilot project and recommends future actions for the project whether the campus decides to extend the pilot project or approve a production project. The primary issues on which the reviewers focused were technology and processes related to data quality, dimensional analysis processes associated with the proposed data cube architecture, data policies, peer review, and cost benefit.
There is a consensus among the reviewers that the Data Administration Steering Committee should prioritize its efforts. This priority should address the following issues:
- Extraction, Translation, and Loading (ETL)
Extraction, Translation, and Loading represent major efforts for Data Administration Steering Committee. ETL are the critical processes for maintaining data quality in the data warehouse. The Data Administration Steering Committee should devote considerable effort to optimizing ETL processes with the source data processes, ensuring data validation processes, and integrating multiple source data repositories into the data warehouse.
- Data Cubes
The limitation for the data cubes is the number of dimensions that can be maintained in a cube. Currently this limit is approximately 10. The Data Administration Steering Committee should evaluate software strategies that permit the linkage of cubes for analysis. Such a strategy may reduce or avoid redundant cube design.
- Policy Factors
Significant policy issues will arise from construction of a data warehouse. The Data Administration Steering Committee invested significant resources to address access and other issues for the data warehouse. Other access policy issues will arise as use of the data warehouse increases and additional development occurs. Addressing these issues can wait. Data warehouse users, data custodians, and data administrators will need more time and study to resolve these issues.
To move the data warehouse project forward, this technology review recommends a series of next steps. These are:
- The Data Administration Steering Committee needs to prioritize its efforts for the next year. Committee efforts should focus on the issues identified for the ETL process, the data cubes, and a review of other institutional data warehouse projects.
- Integration of multiple data source repositories into the data warehouse is an essential achievement to accomplish.
- The Data Administration Steering Committee should evaluate metadata processes of commercial off-the-shelf ETL software for its ability to use and maintain data quality. Acceptance criteria for the software tool should require that the metadata processes are sufficient to achieve high levels of data validation during the extraction, translation, and loading.
- The ETL software should support change-based replication for the extraction of data from the source data repositories.
- The Data Administration Steering Committee should give significant consideration to change-based replication of source data repositories.
- Dimensional analysis tools for the data cubes should provide for linkage of multiple cubes in a data analysis.
- Continued application of the data naming standards, metadata standards, and the data access policies are essential for maintaining data quality.
- The Data Administration Steering Committee should commence another review of other institutional data warehouse projects. Such review should build on previous reviews done at the start of the project and help identify critical success factors and avoidable pitfalls.
- The Data Administration Steering Committee should assess the cost-benefit of the project using a quantitative approach so measurable benefit to cost ratios can be determined.
Data Warehouse Review
The data warehouse project object is to provide a data system that combines multiple campus data repositories into a single structure for analysis of student information. This system will provide student information users with significant tools for analysis of data collected by the Davis campus. For the pilot phase of the project, the project used a single data repository, the Student Information System (BANNER).
This analysis examines the data warehouse project and supporting documentation. It follows a detailed assessment that occurred between reviewers from the Office of the Vice-Provost, Information and Educational Technology and the Data Administration Steering Committee. This assessment presents the results of this analysis.
Primary issues for the Data Administration Steering Committee focus should be on maintaining data quality, data analysis processes, data warehouse performance, and policy issues. There is a consensus that the Steering Committee should prioritize its efforts. This priority should address the following issues:
Data Quality
Maintaining data quality remains one of the major obstacles to successful data warehouse projects. Steps required to ensure a successful data warehouse project require significant investments of time and effort devoted to metadata, naming standards, and data integrity. Addressing data quality is the single most difficult issue facing data warehouse developers. When the quality of data is compromised, incorrect interpretation and use of information from the data warehouse will destroy users' confidence level. More importantly, a flawed decision support system results in poor quality decisions.
The Data Administration Steering Committee spent considerable time and effort understanding and applying standards to the project. The Steering Committee adopted the UC Davis metadata standard and the naming convention standards for the project. These documents were reviewed during this analysis of the project. The reference documents are:
Data Naming Standards
The Data Warehouse project uses the naming standards adopted for database projects by the Planning and Budget Office, Data Administration. These standards are significant and provide strong naming convention standards for the data warehouse project. The standards provide conventions and naming terminology for all Tier 1, computer systems. Adoption and implementation of these standards will ensure naming quality of the data within the warehouse.
Metadata Standards
The proposed data warehouse project conforms to the UC Davis Metadata Standards. These standards provide for high levels of metadata documentation for data entities (tables), data attributes (columns or field), and instances (individual occurrences of data). This level of metadata provides a high degree of information about the data structures in the warehouse and access to these data. Application of the UC Davis Metadata Standards during extraction and loading of the data improves the chances for higher quality data in the data warehouse.
Extraction, Translation, and Loading (ETL) software
The data warehouse pilot project currently relies on custom scripts to extract, translate, and load data from the source data repository into the data warehouse. The Data Administration Steering Committee currently has an RFP for ETL software. This software tool will provide a critical process for extracting, translating, and loading data repository information into an integrated data warehouse architecture. Repositories include Student Information System (BANNER), Financial Information System (DaFIS), the Employee Information System (PPS), and other data systems. Maintenance of data quality in the data warehouse requires persistent processes, rules, and procedures to account for and resolve inconsistencies encountered in the data obtained from the source repository data structures. This includes measurement of data validity of the data instance and establishment of the metadata for the entities and attributes of the data warehouse relies extensively on metadata. Integration of information from the source data repository systems is a daunting task and the ETL software must be capable of applying rules and technology to ensure data quality by performing validation on data for the data warehouse.
Currently the criteria for the ETL software include ones for metadata. However, these criteria are not requirements and software that failed adequately to maintain metadata standards could be acquired. The functional requirements for the ETL software tool should assure a high level of data validation. The on-site evaluation for the functional criteria should include an assessment of the ability of the software tool to construct and maintain metadata and to perform data validation.
Commercial off-the-shelf ETL software is not a panacea for maintaining data quality in the warehouse. The Data Administration Steering Committee and programmers could apply considerable effort to tune the ETL software for each data source repository. Data validation will remain a major effort for the Committee. In many cases, custom scripts may be required to work with the ETL software tool to adequately translate and load the data warehouse.
Data Integration
A goal of the data warehouse pilot is establishment of a common repository of integrated student information to support planning and decision support. The data warehouse pilot proposal calls for integration of student-related data from disparate information systems. These systems include Student Information System (BANNER), Financial Information System (DaFIS), Employee Information System (PPS), and other source data systems. The Data Administration Steering Committee has not accomplished this goal during the pilot phase. The pilot uses only data from Student Information System (BANNER) to populate the warehouse.
An accurately assessment of the pilot project requires evaluation of the integration of disparate data source repositories. Many data integration issues remain that can only be resolved by integrating different source data repositories.. Previous discussion addressed naming standard variance and levels of metadata documentation. Temporal inconsistencies of the data "snapshots" taken when data is extracted from the source data repositories exist within the source data repositories. Data in the source data repositories may not reflect the same level of update. This status could affect the quality of the information in the warehouse. Data updated in Student Information System (BANNER), for example, and not in the Employee Information System (PPS) would introduce an inconsistency in the data warehouse. The ETL software should measure and identify this inconsistency where complete metadata is available. Additionally, use of a change-based replication schema should result in less temporal inconsistency. Discussion of change-based replication follows.
Burden of ETL on Data Warehouse and Production Systems Processes
The proposed design for ETL for the data warehouse is to take periodic snapshots of data source repositories. Snapshots can place a tremendous burden on the system resources of the source data systems and the data warehouse. During the time for these snapshots, the source data systems and data warehouse could be unavailable.
Another method of update is change-based replication. Change-based replication captures each incremental change in the data source repository and stores it for distribution to the data warehouse. Changed data can apply, in near real-time, updates to the data warehouse. Changed data can also be stored for application on a periodic basis. Sophisticated data replication tools have transactional integrity, meaning that the target tables in the data warehouse are updated in the same order as the tables on the operational system were during the original transaction. This can be an important consideration, especially where users require maintenance of data dependencies from the source data repository.
Beyond the benefits of more current data in the data warehouse, use of change-based replication provides another valuable benefit: regained computing time. Large batch processes that update information for the data warehouse can impair both the functioning of the source data systems and the OLAP capability of the data warehouse. The time required for large batch jobs could be enormous during which time the data warehouse will be unavailable to users. Batch jobs currently take 5 to 9 hours to update the warehouse from the Student Information System, during which time the data warehouse is inaccessible. Growth of organization-related data and the increased utilization of the Decision Support System (DSS) may increase the time necessary for batch jobs into the normal usage hours for the data warehouse.
The Data Administration Steering Committee should strongly consider moving from a time slice (snapshot) approach to a change-data-based ETL process. Current batch processing times for the pilot are excessive and likely to grow in a production operation.
Data Access
The proposed data warehouse will include significant personnel and financial information currently considered confidential or sensitive. The Data Administration Steering Committee is thoroughly aware of the access constraints to data in the data warehouse. The Campus Data Administrator, currently Hebert Diaz-Flores, chairs an access policy committee charged with developing a data access policy. A draft policy, dated April 4, 2000, was referenced for this review of the data warehouse project. The policy location is: http://www.dataadmin.ucdavis.edu/documents/data_access_policy_26/data_access_policy_26.html.
The data access policy provides for access to institutional data such as Student Information System (BANNER), Financial Information System (DaFIS), and the Employee Information System (PPS). Access control is the responsibility of the data stewards for these data systems. Currently, there are three data stewards for the Student Information System, no clear steward for the Employee Information System and the Office of Administration as steward for the Financial Information System. There are other stewards for the Graduate Students Information and the Alumni Information System. Application of the metadata standard to the database provides for confidentiality and disclosure rules at the entity and attribute level. Access control lists using these metadata items can easily prevent unauthorized users from seeing the data. The Data Administration Steering Committee developed a procedure for obtaining access privileges for data warehouse users. This system requires the data steward to give written authorization for user privilege to access specific data in the data warehouse. The architecture of the system allows for data access control throughout the table structure of the data warehouse.
Other access issues will arise in the future as users create new data cubes. Users can create data cubes that organize information that might be subject to misinterpretation. Misinterpretation of information may occur due to poorly constructed cubes or analyses. Users may also build data cubes that essentially reveal information about a very small group of students or even individuals.
The Data Administration Steering Committee will need to develop policies over time to address these new issues. Such policies need to reflect the sense of the community. Devoting resources to develop these policies now would be inefficient.
Data Cube Design and Scalability
The current design limits cube dimensions to approximately10. Consequently, the Data Administration Steering Committee built multiple cubes that duplicated dimensions. Scalability of the cube design was limited in the pilot by the hardware platform. However, usefulness of the cubes diminishes as the cube dimensions grow. The current dimension of approximately 10 will likely remain for the near future.
The issue is that a single data cube will not satisfy every user's requirements. Data cube designs reflect the needs of a specific user group. Consequently, many cubes are developed and cube dimensions become duplicated in other cubes. In the pilot project for example, 2 cubes had all the same dimensions except one.
The data warehouse project needs to provide tools that allow linkage of cubes for an analysis. Hyperion Software provides a cube linkage tool for the Essbase software according to the Data Administration Steering Committee. The Committee did not evaluate the linkage during the pilot project. Linkage of cubes reduces the effort for data analysis. Multi-dimensional cube generation is a complex process. Tools that reduce this effort offer increase usability of the data warehouse. The Data Administration Steering Committee should explore the cube linkage tools offered by Hyperion Software to assess whether the tool will increase ease of usability.
Cube design and development needs to be a central function, at least initially. Cube design is complex and trained personnel exist in Data Administration to perform the task. Centralizing cube design will reduce redundancy that is likely to occur if this function is initially distributed. Centralized cube design will leverage the knowledge gained in the pilot and prevent similar cube construction that results in inconsistent analyses.
Cost Benefit
Benefit analysis for data warehouse projects should demonstrate high ratios of benefits to costs. Data warehouse projects in other organizations demonstrate a 6:1 up to 30:1 benefit to cost rations. The analysis required to assess the benefit level is complex and lengthy. The analysis assesses functions and functional processes to determine where data and method use is similar or redundant. The Data Administration Steering Committee should undertake such an analysis. The analysis will help the Committee develop focus for the data warehouse project so that it benefits the largest community of users.
Methodology for such an assessment requires an examination of data usage, business functions, and workflow processes. The methodology developed by Dr. Roger Thomlinson examines data and workflow processes to identify where these processes and data overlap. This matrix approach allows the investigator to identify means to leverage workflow processes and data to achieve significant benefits.
The pilot project has not had sufficient time to address the costs and benefits of the data warehouse. The February 16, 2000 communication to Provost Grey from Celeste M. Hunziker expressed the need for an additional year to assess these cost benefits. The initial plan for development of the data warehouse qualitatively examined benefits to some programs.
Review of Peer Projects
The Data Administration Steering Committee assessed data warehouse projects at other universities when the project commenced in 1995. This was a time when data warehouse projects were new and participants had attained limited experience. Data warehouse efforts are more mature today and the Data Administration Steering Committee needs to reassess the status of these projects at other institutions. An updated assessment of other projects is appropriate at this stage of the data warehouse project. The opportunities to learn what worked and what did not work should not be lost.
Recommendations
This review presents the following recommendations for the Data Administration Steering Committee to consider for future phases of this project. The recommendations focus on factors for improving the likelihood for success of the data warehouse. The recommendations are:
- The Committee needs to prioritize its efforts for the next year. Steering Committee efforts should focus on the issues identified for the ETL processes and data cubes. The project should also reassess other data warehouse projects. Such reassessment can help the Committee increase its knowledge of the success factors from other more mature projects.
- The Data Administration Steering Committee should evaluate metadata processes of commercial off-the-shelf ETL software for their ability to use and maintain data quality. Acceptance criteria for the ETL software tool should require that the metadata processes are sufficient to achieve high levels of data validation.
- The Data Administration Steering Committee should strongly consider change-based replication of source data repositories.
- The ETL software should support change-based replication for the extraction of data from the source data repositories.
- Dimensional analysis tools for the data cubes should provide for linkage of multiple cubes in a data analysis.
- Continued application of data naming standards, metadata standards, and data access policies are essential for maintaining data quality.
- The Data Administration Steering Committee should demonstrate integration of multiple data source repositories as an essential achievement to accomplish.
- Assessment of cost-benefit of the project should use a quantitative approach so those measurable benefit ratios can be determined.
- The Steering Committee should perform a reassessment and review of other institutional data warehouses. Such reassessment should focus on critical success factors and avoidable pitfalls.
Glossary
Change-based replication
The process of capturing changes made to a production data source. Change-based data replication is typically performed by reading the source DBMS log. It consolidates units of work, ensures data synchronization with the original source, and reduces data volume and load time in a data warehouse.
Data cube
A data cube is a term of art associated with multidimensional online analytical processes (MOLAP). MOLAP tools use a pre-calculated data set that contains all the possible answers to a given range of questions. MOLAP tools feature very fast response, and the ability to write data quickly into the data set. Primary downsides of MOLAP tools are limited scalability (the cubes get very big, very fast when you start to add dimensions and more detailed data), inability to contain detailed data (you are forced to use summary data unless your data set is very small), and load time of the cubes.
Data mart
A database, or collection of databases, designed to help managers make strategic decisions about their business. Whereas a data warehouse combines databases across an entire enterprise, data marts are usually smaller and focus on a particular subject or department. Some data marts, called dependent data marts, are subsets of larger data warehouses.
Data warehouse
A collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. Development of a data warehouse includes development of systems to extract data from operating systems plus installation of a warehouse database system that provides managers flexible access to the data. The term data warehousing generally refers to combine many different databases across an entire enterprise. Contrast with data mart.
Decision support system (DSS)
Software and processes that support exception reporting, standard repository, data analysis, and rule-based analysis. A database created for end-user, ad hoc query processing.
Extraction
Processes that copy information from the source data system to the data warehouse. Extraction is controlled by the extraction specification and extraction frequency.
Extraction frequency
The latency of data extracts, such as daily versus weekly, monthly, quarterly, etc. The frequency that data extracts are needed for the data warehouse is determined by the shortest frequency requested through an order, or by the frequency required to maintain consistency of the other associated data types from the source data.
Extraction specification
The standard expectations of a particular source data warehouse for data extracts from the operational database system-of-record. System-of-record uses an extract specification to retrieve a snapshot of shared data, and formats the data in the way specified for updating the data in the source data warehouse. An extract specification also contains extract frequency rules for use by the data access environment.
Loading
Processes associated with putting source data into the data warehouse tables. Processes may include staging areas where denormalized tables exist to translate data for the target data structures of the data warehouse.
Metadata
Metadata is information, documented in IT tools, that improves both business and technical understanding, of data and data-related processes. Metadata describes how and when and by whom a particular set of data was collected, and how the data is formatted. Metadata is essential for understanding information stored in data warehouses
OLAP
Short for Online Analytical Processing, a category of software tools that provides analysis of data stored in a database. OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views.
Translation
Application of business rules to change source data before loading into the data warehouse target data structures.
|