Academy TechNotes

Academy TechNotes
ATN Volume 6, Number 2, 2015
Multi Cluster Architecture and Staged Upgrade Process for Zero Downtime Upgrades of Large Data Warehouses
Sanjeev Kumar
Gandhi Sivakumar
A central data warehouse has been built in a multi-clustered configuration for
one of the largest telecom providers in India. The warehouse provides 220 TB of
storage, data loading rate of 2 TB/day and generates more than 500 reports with
a latency of 9 hours from the last data load for the day. This is one of the largest
installations of IBM's Big Data analytics platform using ISAS(IBM Smart Analytics
System) and PDOA (IBM PureData for Operational Analytics System) appliances
and InfoSphere Streams/DataStage.
This technical note describes the system architecture and first-of-a-kind upgrade
process of the data warehouse, which provides zero downtime for capacity
addition and full system upgrades. 245 TB of data was migrated in the last
system upgrade with no disruption to the service and zero downtime. Previous
system upgrade, in comparision, took 4 weeks of complete downtime.
clusters at physical level [Reference 1].
While this solution resulted in
performance improvement due to optimal use of ISAS appliances in a ISAS-BCU
mixed cluster, and cost saving by increasing the shelf life of existing BCU
appliances, it did not address the issue of application downtime during hardware
Figure 3 shows the process modifications done while upgading ISAS to the next
generation PDOA appliances. Firstly, a backup of approximately 8000 tables
(132 TB) was taken from the data loading sub-cluster (7 ISAS nodes) and
restored onto an offline PDOA of 7 nodes. This took approximately 12 hours.
Thereafter 11 new nodes of PDOA were added to this sub-cluster and tables
were re-striped across 18 nodes, which took nearly 24 hours. Subsequently, the
data loading sub-cluster was taken offline, the new PDOA sub-cluster was
brought online and connected to the existing business layer sub-cluster.
Figures 4 & 5 show the steps required to upgrade the business layer sub-cluster.
If the number of available ISAS nodes are equal to or more than the number of
nodes in the existing sub-cluster, data restoration followed by redistribution on an
expanded ISAS sub-cluster could be done, as in the case of data loading cluster.
However, in our case, original configuration consisted of 18 BCU nodes storing
110 TB of data, whereas the new sub-cluster consisted of only 12 ISAS nodes including the 7 nodes released from the upgrade of data loading nodes. Hence,
this did not allow for quick cluster-to-cluster data restoration. Instead, 18 nodes
of BCU data was copied to a 12 nodes ISAS configuration taking sub-cluster
level table dumps and loading into the corresponding sub-clusters in the new
configuration. It took approximately 14 days for the online and offline business
layer sub-clusters to be brought in sync. It is to be noted that this would have
been completed in 1-3 days, had the number of nodes in the ISAS sub-cluster
were same as or more than the BCU sub-cluster. Finally, cutover to the new
business cluster occured; it was brought online and the previous cluster was shut
down giving a full system upgrade with no application downtime.
In summary, a large data warehouses can be configured as two sub-clusters
connected over LAN (or any other high speed network). Data loading sub-cluster
should be upgraded first followed by the business layer upgradation. This
provides a no downtime upgrade of the system. For each sub-cluster, the
optimal migration path is from a configuration of N nodes to N+m nodes in the
upgraded configuration.
The original design of the warehouse comprised BCUs (Balanced Configuration
Units) in a single clustere configuration (Figure 1) to which capacity was added at
scheduled intervals (12 months). Each capacity addition took 2-3 weeks of
downtime as the system had to be taken offline to re-stripe the tables across the
expanded cluster. The data loading had to be put on hold for this period. This
required a catch-up time once the new set up was brought on line.
Additional complexity arose when the next generation of BCU, ISAS, was
released in market. The usual method of capacity addition would have resulted
in overall loss of efficiency as a mixed BCU + ISAS cluster would have been
restricted to the performance levels of the lowest common denominator (BCU).
Replacing all BCUs by ISAS would have resulted in discarding expensive
appliances (BCUs), were still useable. This limitation was addressed by a mixed
deployment solution (Figure 2), whereby the appliances were deployed as two
sub-clusters connected by a high speed private LAN. The application design was
modified into self contained processes that could be deployed on the two
connected sub-clusters. Staging, ODS (Operational Data Source) and FACT
tables (central tables of the warehouse) were kept in ISAS sub-cluster, while
aggregate tables and cubes were deployed on the BCU sub-cluster. This
resulted in the single logical view of the application but deployed on two physical
Following team members played critical role in the implementation.
Pratim Mukherjee, Sandhya, Bharat Sharma, Mukesh, Rajesh Pilwari, Pratik
[Refrence 1] Appliance Interconnection Architecture And Method : United States
Patent Application IN9-2013-0028. Inventors: Jay Praturi & Sanjeev Kumar
About the Author: Sanjeev Kumar is a certified
consultant in BI, Analytics from Pune, India. Gandhi
Sivakumar is an IBM Master Inventor.
Please consider following @IBMAoT on
Twitter and using the hashtag #IBMAoT
when mentioning IBMAoT in social media.
© Copyright IBM Corporation 2012
For more information please
visit the IBMAoT website.