The database diet: Nearline and archiving in the data warehouse
01-Oct-08
The database diet: Nearline and archiving in the data warehouse

If your database is stuffed to bursting point, nearline and archiving solutions could provide the diet it needs. But how do you know which is right for you? Arthur Ritchie, chairman and CEO at SAND, chews over the basics.
Data warehouses are increasing radically in size. To maintain acceptable performance, several techniques have been introduced, such as pre-building aggregates and KPIs, and indexing as many columns as possible in order to speed up query processing. But as they continue to grow, these tasks can no longer be performed in the available ‘batch windows’ when the warehouse is not being accessed by users and trade-offs need to be made.
Doing less pre-processing work reduces the required time but also means queries that depend on aggregates, KPIs or additional indexes may take an inordinately long time to run. It may also severely degrade performance for other users as the system attempts to do the processing on the fly.
This impasse leads to two possible choices: either stop providing the analytic functionality, making the system less valuable and users more frustrated, or put the database on a diet by moving some of the data it contains to another location.
Trimming down
Both nearline and archiving solutions can help trim down an over-expanded database. The database can be made much smaller by implementing an information lifecycle management (ILM) approach, removing unused or infrequently used, detailed transactional data from the online database and storing it elsewhere.
When the database is smaller, it will perform better and be capable of supporting a wider variety of user needs. Aggregates and KPI’s will be built from a much smaller amount of detailed transaction data. Additionally, column indexing will be more practicable as there will be fewer rows per column to be indexed.
However, it is important to stress the differences between archiving warehouse data and storing it nearline. Since both are used to hold data that has been moved out of the main online system, it is unclear to some why one would need to be implemented if the other is in place. To clarify this question and make it easier to discuss why one or the other type of system (or both) might be required in a given situation, here are the major differences between nearline and archived data:
Archive data
Normally, the concept of electronic archiving focuses on the preservation of documents or data in a form that has some sort of certifiable integrity (for example, conformity to legal requirements). It is also immune to unauthorised access and tampering, and is easily subject to certain record management operations within a defined process (for example, automatic deletion after a certain period or retrieval when requested by an auditor). The archive is in fact a kind of operational system for processing documents/data that are no longer in active use.
The notion of archiving has traditionally focused on unstructured data in the form of documents but similar concepts can be applied to structured data in the warehouse. An archive for SAP Business Intelligence, for example, would preserve warehouse data that is no longer needed for analytical use but which needs to be kept around because it may be required by auditors – as would be the case if SAP BI data were used as the basis for financial statements.
The archive data does not need to be directly accessible to the user community, just locatable and retrievable in case it is required for inspection or verification – not for analysis in the usual sense. In fact, because much of the data that needs to be preserved in the archive is fairly sensitive (for example, detailed financial data), the ability to access it may need to be strictly regulated.
While many vendors of archiving solutions stress the performance benefits of reducing the amount of data in the online database, accessing the archived data is a complicated and relatively slow process since it will need to be located and then restored into the online database. For this reason, it is unrealistic to expect archived data to be usable for analysis/reporting purposes.
Nearline
In the ILM approach, the nearline repository holds data that is used less frequently than the ‘hottest, most current data, but is still potentially useful for analysis or for constructing new or revised analytic objects for the warehouse.
While the exact proportion of nearline to online data will vary, the amount of less frequently used data that needs to be kept available is normally quite large. Moving this out of the main database greatly reduces the pressure on the online database and enables continued performance of standard operations within available time windows, even in the face of the explosive data growth that many organisations are currently facing.
Thus, the archiving requirements described above do not apply to a nearline product which is designed to reduce the size of the online warehouse database, while at the same time keeping the data more or less transparently accessible to end users who may need to use it for analysis and rebuilding KPIs.
Why a nearline product is not an archive
Makes older data easily accessible to end users for enhanced analysis/reporting. Offers very good performance in delivering data to end-users (typically not more than 1 times slower than online), with little or no impact on online users. Allows greater amounts of relatively recent data to be moved out of the online system.
Offers methods for ensuring the compliance of data with regulations. Features any special built-in security regime beyond the read-only status of the data. Takes care of operational processes on data, such as enforcement of retention periods, automatic deletion etc.
Why an archiving product is not nearline
Provides controlled storage of older data that will probably not be accessed except in special circumstances. Enforces organisational policies with regard to data retention. Ensures compliance. Limits access to sensitive data.
Makes data easily accessible to users for analysis or reporting. Offers fast performance in restoring data. Stores relatively recent data that may be required for analytics/reporting.
Details
- Author:
- louise druce
- Publisher:
- KnowledgeBoard
- Date:
- 01-Oct-08
- Categories:
- Business Processes, IT and Infrastructure, Processes, Technology
- Sections:
- Home , News
This article has been read 18124 times.
Tools
Member comments (1)
Share your views with other users: add your own comments to this item.

not fit for any particular purpose
The author typically has no clear understanding about the nature and purpose for the data warehouse.
Data warehouses, only become effective tools after years of data storage. The greatest business value comes from having the most online data possible. Any form of archiving is not wanted or needed. (should not be part of any solution or proposal) Also, the cost of storage is so lowthat its not considered part ofthe big picture. (much cheaper than the cost to cook up some convoluted archiving scheme) The aurthor may be confused by refreshment, refreshing or loading, or simply writing a sales pitch.