Resources
DM Review Magazine - March 2005 Issue
Maximizing Your ETL Tool Investment
Is your organization struggling to justify an ETL (extract, transform and load) tool purchase? If the tool is an enterprise level solution, the entry price is likely well into the six digits. At this price, the justification process may appear daunting. Because these tools serve a purpose and are critical to serious data warehouse endeavors, the justification is well worth the effort.
Once the decision to proceed with an ETL tool is made and budget dollars released, the difficult part is complete. From this point, the data warehouse team initiates the tool selection process. They proceed to investigate options and leverage past experience to identify a short list of product vendors. Fortunately, in the ETL space, only a handful of vendors are capable of addressing enterprise needs. Once engaged, the vendors work diligently to understand requirements, mold requirements to fit their product and shorten the tool selection time frame.
The "short list" vendors create a small working sample, often referred to as a proof of concept (POC), to demonstrate tool capabilities. Once functional fit is established, the vendor moves aggressively toward closing the sale. Many in this situation understand the dynamics associated with procuring enterprise software. Less understood are the consequences for poor decision making. Was the tool purchased without a complete evaluation? Will this ETL tool implementation be successful?
Why Use ETL Tools?
Since Oracle (substitute any database) exists in house, writing stored procedures in PL/SQL is often the first thought for ETL development. With no additional licensing and existing in-house talent, the decision appears obvious. Less obvious is the lack of integration, lack of usable meta data and additional maintenance cost.
ETL tools serve two very specific purposes. First, they provide a development environment that is easy to manage. The scripting associated with ETL tools is embedded into the graphical interface, which allows for code visualization. The visualization achieved with ETL tools is priceless when attempting to reverse-engineer existing code. It also reduces knowledge transfer to production support.
Second, ETL tools provide increased throughput over scripting and database coding. The increased throughput is achieved by separating data management from data access. Data management is the coordination of data from source to target systems. Data management integrates meta data for the accurate cataloging and control of source code. Data access is the delivery of information from the database to the end user. Data access is most often achieved directly or semantically via SQL queries (relational databases) or proprietary access (cube technology).
Data Access vs. Data Management
What is the value of separating data access from data management? To answer that question, one must first understand the origin of data warehousing. Fifteen years ago, organizations were asking, "Why build a redundant database dedicated to reporting?" The answer is priority and performance. Transaction systems have historically held higher priority than reporting applications. Heavy query access impacts performance of transactional systems to unacceptable levels. In order to maintain high performance, a second repository is necessary.
The reason to separate data management from data access is also priority and performance. To meet current demands, access to analytical data must be near immediate. When users drill, data is immediately returned. Specifically, fast performing ETL does not justify slow running queries. In the practical world, both ETL performance and query performance must be fast.
To achieve adequate performance on the movement and query of data, the enterprise solution must optimize and separate each function. While this separation increases cost, this investment is essential to establishing large-scale solutions
Successful Implementation
Successful ETL tool implementation is defined by several project team characteristics: 1) speed to market, 2) effective use of functional capabilities, 3) maintenance complexity and 4) financial ROI. Each organization weighs these measures according to internal drivers.
Speed to Market. Speed to market refers to a project team's ability to quickly deliver new applications or enhancement applications. From an ETL perspective, these new projects equate to the addition of new sources and/or new attributes from existing sources. Project teams delivering rapid solutions demonstrate the ability to accurately and precisely scope and plan new projects. Successful ETL tool implementations are directly correlated to the project team's ability to deliver according to schedule.
Use of Functional Capabilities. Functional capabilities are grouped into two categories: a) performance gains and b) processing flexibility. The leading enterprise ETL tools all claim (and most deliver) to increase processing throughput over the majority of custom coded solutions. The majority of custom code resides in database-specific language. Within a specific database, certain processing may perform comparably to a tool (e.g., aggregations), but broadly speaking, most tasks are outperformed through ETL tools. Each of the leading tools offers separate approaches to optimization. The optimization techniques developed in the tool represent many hours of R&D and implementation experience. The tools also provide options for performance tuning that work in concert with database-specific performance gains. Specifically, the parallel reading and writing of data from databases coordinates with the ETL tool processing to execute multiple instances of single code.
Processing flexibility refers to the simplification of coding business logic. Most third-generation coding languages accommodate data warehouse processing (integrity checks, domain validation, data partitioning, process partitioning, etc.). Additionally, many languages support parallel processing. The dilemma is the cost associated with the design/development/testing of custom complex and parallel code. Is the custom code parameterized and flexible to accommodate multiple similar requirements? Successful implementations understand tool capabilities and leverage them effectively.
Maintenance Complexity. Maintenance complexity refers to the ability of existing and new staff to interpret and maintain existing code. Poorly utilized tools contain confusing and often extensive code. Successful projects have efficiently coded applications that are easily maintained.
Case in point: While working on a high-volume data warehouse effort for a national retailer, a rogue developer produced a sophisticated scheduling application to manage the ETL tool processing. The developer likely invested 250 hours of effort into a custom-coded solution (Perl). This particular client lacked significant experience in Perl, and the source code resided in a private source code repository. Without consideration for the architecture, this developer implemented a significant component. The result was a production system that was dependent on a sophisticated scheduling application that was difficult and expensive to maintain.
Positive ROI. As with any project, successful implementations of enterprise ETL tools must produce a positive return on investment (ROI). Many creative methods exist for deriving an accurate ROI.
Ineffective Implementations
Ineffective ETL tool implementations are characterized by lingering development efforts and unpredictable project plans. Also, if the software resides uninstalled for more than two months, a red flag is in order. An installed product should reside on the identified development server, not on a developer's laptop.
Many ETL tool implementations perform slowly. A throughput of 10 records per second is never acceptable. Often, the design and hardware load combine to produce poor performance. Adjustments to design and hardware configuration and/or system load may produce orders of magnitude improvement.
Guide to Success
Complete Requirements. Each organization has unique project requirements and they must be clearly defined. For example, the automated integration with SAP may provide significant cost savings. If the solution is real-time or the volumes are massive, maximum performance is critical. Each organization must document all requirements and properly weight their importance.
These requirements provide a foundation for selecting an enterprise ETL tool. Far too often, vendors successfully redefine selection criteria based on the strengths of their product. If tool vendors are engaged prior to a clear understanding of requirements, the definition process is subject to biased input that affects the perceived importance weight for each requirement. For example, a single vendor may convince the project team that performance is the most important requirement. Is it more important than speed to market or maintainability? Each company is unique.
Start Small. Starting with a manageable implementation allows for adjustments and tweaks to the tool configuration. It also provides a good assessment of existing teams' abilities to manage and implement the new technology. Additionally, the first implementation allows for solidification of the source code control process.
Architecture. Successful ETL tool implementations are achieved through structured development and careful planning. Most critical to achieving success is the architecture the environment and tool configuration.
Particularly with the high-end solutions, the architecture defines the flexibility of an application. The architecture supports or complicates the sharing of common modules and parameterization of code. The architecture should support source code control that manages code through the development, testing and production life cycle. A poorly planned implementation fosters poorly coded applications and can cause development efforts to double in duration and cost.
Another notable result of poorly architected enterprise ETL tool implementations is performance degradation. Often, the fix to poorly architected applications is band-aid patchwork, which provides short-term benefits at the cost of increased long-term complexity.
A sound architecture facilitates centralized development and works toward reducing pocket development. The main concern with pocket development is design/coding inconsistencies and one-off failures. For every failed implementation, several successful implementations are required to offset the negative internal branding. Positive internal marketing promotes broader use of the tool and strengthens realized ROI.
Adequate Hardware. While they read and write to databases, the leading ETL tools operate largely outside the database environment. The decision to separate and isolate the ETL tool from the database may have an impact on successful implementation. In a shared environment, processing ETL directly impacts the database performance. In the separated environment, the heavy lifting associated with complex ETL processing does not directly impact query performance of the database.
Additionally, development must occur in an environment separate from production, and the migration of code from development to production must be planned. Far too often, development begins in the production environment. When this occurs, the production environment becomes contaminated with snippets of code, scripts and directories not pertinent to the production environment. Excess code in production tends to increase the maintenance complexity.
Intellectual Capital. Intellectual capital is the knowledge of processes, design techniques and performance tuning techniques required to make an ETL implementation operate efficiently. Develop the intellectual capital with internal staff and share this capital across multiple project teams. Ideally, a center of excellence or competency center exists and the knowledge of the selected ETL tool is developed in this group.
A standard set of design principles, including both the recommended techniques and known pitfalls, should also be developed. A good technique for communicating these lessons is via "Developer News" releases. Developer News should be a numbered publication that is distributed internally to the development team. Developer News improves communication with existing teams and quickly educates new team members.
The End Result
Addition of an ETL tool increases efficiency and throughput capabilities of a data warehouse environment. By following a commonsense approach to selecting and implementing a new tool, the probability of success greatly increases.