Academy TechNotes ATN Volume 6, Number 2, 2015 Multi Cluster Architecture and Staged Upgrade Process for Zero Downtime Upgrades of Large Data Warehouses Sanjeev Kumar Gandhi Sivakumar A central data warehouse has been built in a multi-clustered configuration for one of the largest telecom providers in India. The warehouse provides 220 TB of storage, data loading rate of 2 TB/day and generates more than 500 reports with a latency of 9 hours from the last data load for the day. This is one of the largest installations of IBM's Big Data analytics platform using ISAS(IBM Smart Analytics System) and PDOA (IBM PureData for Operational Analytics System) appliances and InfoSphere Streams/DataStage. This technical note describes the system architecture and first-of-a-kind upgrade process of the data warehouse, which provides zero downtime for capacity addition and full system upgrades. 245 TB of data was migrated in the last system upgrade with no disruption to the service and zero downtime. Previous system upgrade, in comparision, took 4 weeks of complete downtime. clusters at physical level [Reference 1]. While this solution resulted in performance improvement due to optimal use of ISAS appliances in a ISAS-BCU mixed cluster, and cost saving by increasing the shelf life of existing BCU appliances, it did not address the issue of application downtime during hardware upgrades. Figure 3 shows the process modifications done while upgading ISAS to the next generation PDOA appliances. Firstly, a backup of approximately 8000 tables (132 TB) was taken from the data loading sub-cluster (7 ISAS nodes) and restored onto an offline PDOA of 7 nodes. This took approximately 12 hours. Thereafter 11 new nodes of PDOA were added to this sub-cluster and tables were re-striped across 18 nodes, which took nearly 24 hours. Subsequently, the data loading sub-cluster was taken offline, the new PDOA sub-cluster was brought online and connected to the existing business layer sub-cluster. Figures 4 & 5 show the steps required to upgrade the business layer sub-cluster. If the number of available ISAS nodes are equal to or more than the number of nodes in the existing sub-cluster, data restoration followed by redistribution on an expanded ISAS sub-cluster could be done, as in the case of data loading cluster. However, in our case, original configuration consisted of 18 BCU nodes storing 110 TB of data, whereas the new sub-cluster consisted of only 12 ISAS nodes including the 7 nodes released from the upgrade of data loading nodes. Hence, this did not allow for quick cluster-to-cluster data restoration. Instead, 18 nodes of BCU data was copied to a 12 nodes ISAS configuration taking sub-cluster level table dumps and loading into the corresponding sub-clusters in the new configuration. It took approximately 14 days for the online and offline business layer sub-clusters to be brought in sync. It is to be noted that this would have been completed in 1-3 days, had the number of nodes in the ISAS sub-cluster were same as or more than the BCU sub-cluster. Finally, cutover to the new business cluster occured; it was brought online and the previous cluster was shut down giving a full system upgrade with no application downtime. In summary, a large data warehouses can be configured as two sub-clusters connected over LAN (or any other high speed network). Data loading sub-cluster should be upgraded first followed by the business layer upgradation. This provides a no downtime upgrade of the system. For each sub-cluster, the optimal migration path is from a configuration of N nodes to N+m nodes in the upgraded configuration. The original design of the warehouse comprised BCUs (Balanced Configuration Units) in a single clustere configuration (Figure 1) to which capacity was added at scheduled intervals (12 months). Each capacity addition took 2-3 weeks of downtime as the system had to be taken offline to re-stripe the tables across the expanded cluster. The data loading had to be put on hold for this period. This required a catch-up time once the new set up was brought on line. Additional complexity arose when the next generation of BCU, ISAS, was released in market. The usual method of capacity addition would have resulted in overall loss of efficiency as a mixed BCU + ISAS cluster would have been restricted to the performance levels of the lowest common denominator (BCU). Replacing all BCUs by ISAS would have resulted in discarding expensive appliances (BCUs), were still useable. This limitation was addressed by a mixed deployment solution (Figure 2), whereby the appliances were deployed as two sub-clusters connected by a high speed private LAN. The application design was modified into self contained processes that could be deployed on the two connected sub-clusters. Staging, ODS (Operational Data Source) and FACT tables (central tables of the warehouse) were kept in ISAS sub-cluster, while aggregate tables and cubes were deployed on the BCU sub-cluster. This resulted in the single logical view of the application but deployed on two physical Acknowledgements: Following team members played critical role in the implementation. Pratim Mukherjee, Sandhya, Bharat Sharma, Mukesh, Rajesh Pilwari, Pratik Paingankar [Refrence 1] Appliance Interconnection Architecture And Method : United States Patent Application IN9-2013-0028. Inventors: Jay Praturi & Sanjeev Kumar About the Author: Sanjeev Kumar is a certified consultant in BI, Analytics from Pune, India. Gandhi Sivakumar is an IBM Master Inventor. Please consider following @IBMAoT on Twitter and using the hashtag #IBMAoT when mentioning IBMAoT in social media. © Copyright IBM Corporation 2012 For more information please visit the IBMAoT website.
© Copyright 2018