LIAKMLogo

 LIAKM  

 Home    About Us      Laboratory      Project Management     Partners

The Laboratory for Information Analysis and Knowledge Mobilization

 

Hybrid Cloud Computing for Data Analytics

      

Researchers:     
                             
Partners:
                  
Affiliated Lab:   
     

Marin Litiou, Nick Cercone, York University
IgorJurisica, University of Toronto
IBM
Adaptive Systems Research Lab (ceraslabs.com)
  

Cloud computing is a new computation model in which hardware and software are offered on-demand as services over the web. Enabled by new Internet technologies and standards and driven by new business models, cloud targets three main areas of computing: the infrastructure, (Infrastructure as a Service- IaaS), the programming and runtime environments (Platform as a Service- PaaS) and the end user software (Software as a Service-SaaS). While IaaS, PaaS and SaaS categorization is useful in identifying and partitioning many distinct cloud research challenges, an orthogonal set of issues arises when looking at the requirements of different classes of applications that we aim to run in cloud, For example, it can be argued that a cloud that runs real time and safely critical applications has different services and offer different qualities of services than a cloud for e-commerce applications. 

This project focuses on addressing some of the challenges in creating a cloud for data analytics. Analytics, in general terms, refers to extracting information from raw data, often in the context of business decision-making.  In response to the growth of this raw data, cloud analytics marshals significant amounts of computing power and robust storage to conduct analytics using on-demand, scalable infrastructure. Gartner describes six key elements that may exist in a cloud analytics solution: data sources, data models, processing applications, computing power, analytic models, and sharing or storing of results [1].  This project focuses on platform and infrastructure for analytics: an infrastructure that is analytics-aware and is driven by the requirements of large-scale data analytics, optionally including a platform to support analytics applications.

There is growing academic interest in large-scale analytics using cloud computing.  Position papers have suggested a need for substantial work in this area (e.g. [2, 3]).  Individual efforts have made piecemeal advances – for example migrating analytic applications to the cloud for time series data [4],  and data warehouses for the cloud [5].  There is interest in using the MapReduce paradigm for analytics (e.g., [6, 7]); IBM Research reported efforts – primarily focused on extracting and analyzing large-scale unstructured data – to do search-driven analytics using the Hadoop platform [8].   To our knowledge there is little or no research at the infrastructure or platform level; the focus is on applications that run on stock infrastructure and platforms.

Commercial vendors have focused on providing analytics-as-a-service – that is, analytics software as a service (for example, SAP [9]) and Opera Solutions [10]).  Google’s BigQuery[1] service queries terabytes of data in seconds, runs on Google’s cloud infrastructure, scales as needed, and is accessed through RESTful APIs or a web interface. HP through Vertica[2] offers a data warehouse that can be run on Amazon EC2.  Microsoft SQL Azure Reporting[3] is a business intelligence system that runs only on Windows Azure.  These commercial solutions require structured data.

Research Goals

Hybrid Clouds for Analytics: To design, implement, and evaluate an architecture for analytics on combinations of private and public clouds that is elastic, scalable, and tailored for the analytics application class.

Hybrid clouds are made of private and public sub-clouds working together to mitigate privacy, security and the need for large computation and storage capacity. Academic research into hybrid clouds has focused on the middleware / abstraction layers for creating, managing, and using hybrid clouds (e.g. [11-13]).  For example, Zhang et al. used the MapReduce paradigm to split a data-intensive workload into mapping tasks sorted by the sensitivity of the data, with the most sensitive data being processed locally and the least sensitive processed in a public cloud [14].  Abraham et al. provision private clouds from multiple collaborating entities as well as public clouds from Amazon, automatically and semi-transparently [15].

Commercial support for hybrid clouds is growing in response to the business case for cloud federation.  For example, HP offers software to manage a private cloud, but also provides infrastructure as a service and the ability to have the two communicate (interoperability with other vendors is not discussed) [16]. Fujitsu and Microsoft partnered to produce a hybrid of the Windows Azure public cloud with Windows Server running in a private cloud or on a private server [17].  IBM offers both SmartCloud Provisioning (to manage a private cloud) [18] and SmartCloud Enterprise (a public cloud offering) and software to bridge the two at the Software-as-a-Service (SaaS) level [19].  These examples do not enable heterogeneous clouds; that is, the private and public cloud must run specific software to enable linking the two.

In the open source realm, support for cloud federation and hybrid clouds is emerging.  Apache Deltacloud is an abstraction layer that allows developers to work with various public IaaS providers (e.g. Amazon EC2, Rackspace) and internal private clouds (supporting various commercial and open-source solutions) using a unified RESTful API.  This open-source project recently transitioned from incubation to a top-level Apache project [20].  Apache Libcloud [21] and jClouds [22] are Java and Python libraries (respectively) with similar abstraction goals.  All three offer stable support for working with compute instances and cloud storage, but also have beta or experimental support for other abstractions.

We propose an architecture that optimizes analytics on a combination of private and public clouds. Performing an analytics task across heterogeneous, hybrid clouds can combine the scalable, on-demand computation resources of the public cloud with the privacy and control provided by a private cloud.  Supporting these tasks at the infrastructure and platform level allows out-of-the-box analytics operations to leverage these capabilities.  In the typical case, the large data stores will be local to the enterprise; the research challenge is enabling the use of public cloud infrastructure while supporting data locality, privacy, and bandwidth limitations.  An analytics cloud may include highly-optimized data warehouse hardware in addition to the commodity machines commonly used. The architecture of the hybrid cloud for analytics must be scalable and elastic, and must systematize support for non-functional properties such as privacy, data security, and quality of service.  Contributions could include data partitioning to minimize data transfer among clouds, computation task modeling to identify candidate modules for migration to a public cloud, and migrating tasks among heterogeneous cloud infrastructures at run-time.

Scaling for Analytics in a Hybrid Cloud: To investigate, design, and evaluate novel approaches to adaptively scaling a business analytics application in response to a changing workload, leveraging the key properties of the required architecture: large data, hybrid public/private clouds, and analytics applications.

Cloud architectures are well-suited for adaptive management: the layer of virtualization and abstraction typically employed allows for fine-tuned and responsive resource management not possible with traditional infrastructure.  We focus here on novel contributions to scaling the application and the infrastructure in response to a dynamic workload to meet service level objectives, which may be explicit or implicit.  The usual approach of adding or removing resources (typically focused on computation power, namely number of cores) is not sufficient here; the key characteristics of our architecture both enable and require extending the state-of-the-art, as follows:

  • Large Data: for this problem domain, adding computation power may not resolve performance issues.  Research exploiting this property will consider the cost-benefit trade-offs of a variety of actions beyond computation resources: allocating more bandwidth, moving a running instance closer to a remote data source, moving an instance to a cloud with better I/O performance, and other remedial actions not typically considered in existing approaches.
  • Hybrid: when a workload analyzing data across possible several private and public clouds requires more resources, the unresolved question is where to add resources.  This may be determined by constraints (capacity limits, privacy / security policies, data locality) or by optimization (available capacity, financial cost).  Workload may be migrated from a private cloud to a public cloud with a different API and a different set of governing policies.
  • Analytics: in addition to resource management, adaptive management can change the functionality of the managed system.  This under-explored research area offers high-potential opportunities in this domain.  By adjusting the data model, the analytic model, or even the analytics application, we can manage the demand for computing resources with a changing workload without adding resources.  Adjustments could include introducing sampling, reducing data granularity (e.g. from seconds to days), switching from an exact algorithm to a more efficient heuristic, etc.  This approach is valuable for private clouds, where there is typically a fixed cap on the amount of resources available.  Novel contributions include analyzing the data quality versus cost relationship, identifying and evaluating adaptations.

3. Brokerage of Resources: To design, evaluate and implement an automated solution (a broker) capable of matching the requirements of an analytics application with the available resources in a heterogeneous, hybrid cloud.

We propose the Resource Broker as an additional layer in the cloud architecture that serves as an intermediary between an application or a task and a pool of resources.  Typically an application developer must statically decide on a deployment (identify cloud provider, number and type of instances, etc.).  The Broker accumulates knowledge about the services offered by providers (public and private) and provides a unified interface to the developer, allowing them to instead specify requirements.  The heterogeneous, hybrid cloud is abstracted through a set of APIs and the application may never be aware of the exact underlying infrastructure.  The broker can be used to acquire infrastructure-, platform-, or software-as-a-service.  Contributions include the design and architecture of this broker, algorithms for matching requirements with resources, and an understanding of what level and quality of information the application should provide about its requirements to achieve the optimal matching.

Methodology

The project will involve analytical modeling, simulation and experimental work. The first phases of the project will involve building an experimental infrastructure, a distributed cloud with a  private sub-cloud at York University and a public sub-cloud (Amazon).

For IaaS layer, we plan to use Xen as our main virtualization platform.  Xen supports efficient virtualization of Linux/Intel x86 platforms. It is also open source, which allows us to modify it as needed to support our research tasks. In addition to Xen, we may also explore the use of other virtualization platforms.

For PaaS layer, we plan to use a variety of products available from our industrial partners, including middleware for analytics. We will also experiment with open source applications such as Deltacloud and Libcloud.

SaaS layer will consist of several benchmarks and applications which capture typical analytics workloads.  To take into account more sporadic workloads with different demand profiles, other applications will be considered, following consultations with our industrial partner.

Milestones

Year 1

  • Review and evaluate existing approaches for federating clouds, focusing on academic and open source approaches.  Perform gap analysis between their capabilities and our projected requirements.  Select a platform to build on.

  • Identify and model canonical analytics tasks and applications to produce simulated workloads for testing and development.

  • Collect data on the costs of in-house computing infrastructure and on-demand computing infrastructure.  Automate the updating of this data where possible.

  • Examine analytics applications to identify feasible functionality adaptations.

  • Produce a set of metrics that can be used to measure and evaluate a cloud service provider (including external providers and in-house IT teams), particularly dimensions of cost and service quality.

Year 2

  • Finalize an architecture for a hybrid, heterogeneous cloud meeting the requirements established.  Evaluate this architecture analytically and in simulation.
  • Identify remedial actions available to data-intensive processes.  Using the workload generator, simulate the behavior of the application in response to remedial actions.  Generalize the workloads into classes of workloads based on their responses.
  • Produce and validate an economic model for measuring the cost of the public cloud and the private cloud.
  • Develop and test an in-house API for acquiring real-time updates on non-functional properties of cloud services: cost, performance, reliability, privacy.
  • Design a metric for measuring data quality in the context of business analytics, capable of measuring both data input and data output.

Year 3

  • Develop and deploy a testbed hybrid cloud, including at least two public cloud providers and a set of private resources.  Demonstrate the ability to manage the resources and deploy and execute analytics tasks using a simple broker.
  • Design and test feedback loops that employ advanced remedial actions including functionality changes, data migration, workload migration.  Identify key attributes for these actions: cost, settle time, reliability.  Evaluate these loops analytically.
  • Produce a set of requirements for the test applications that the broker may use to assign resources.  Using a simple rule-based approach, assign tasks to resources.

Year 4

  • Test the scalability of the architecture using real-world workloads and large-scale analytics tasks.
  • Implement an adaptive manager that takes remedial actions beyond adding CPU cores in response to changing workloads.  Deploy to testbed.  Measure cost-benefit trade-offs to the remedial actions in the face of real workloads.
  • Finish and implement a decision algorithm for matching capabilities to requirements.
  • Begin integrating the architecture, the adaptive manager, and the broker.

Year 5

  • A proof-of-concept, commercialization-ready implementation of the proposed business analytics architecture running on a hybrid, heterogeneous cloud.  Extensions to existing open-source libraries or platforms will be contributed back to the open source communities.
  • This implementation will include adaptive scaling that leverages remedial actions beyond simple CPU power, moves workloads dynamically among the federated clouds based on a set of policies and optimizations, and adjusts data quality dynamically to provide usable results within SLOs with minimized cost.
  • The implementation will execute analytics tasks on a hybrid cloud topology determined automatically by a resource broker based on defined requirements, which will include the necessary and sufficient requirements to produce cost-optimized topologies.

References

[1] B. Gassman and R. Knox, "‘Cloud analytics’ means many different kinds of opportunity," Gartner Research, Stamford, CT, Tech. Rep. 1386527, 2010.
[2] A. Cuzzocrea, I. Song and K. C. Davis, "Analytics over large-scale multidimensional data: The big data revolution!" in Proceedings of the ACM 14th International Workshop on Data Warehousing and OLAP, Glasgow, Scotland, UK, 2011, pp. 101-104.
[3] S. Frischbier and I. Petrov, "Aspects of data-intensive cloud computing," in Anonymous 2010, pp. 57-77.
[4] H. Vashishtha, M. Smit and E. Stroulia, "Moving text analysis tools to the cloud," in IEEE Congress on Services, 2010, pp. 107-114.
[5] K. Doka, D. Tsoumakos and N. Koziris, "Efficient updates for a shared nothing analytics platform," in Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, Raleigh, North Carolina, 2010, pp. 7:1-7:6.
[6] I. Konstantinou, E. Angelou, D. Tsoumakos and N. Koziris, "Distributed indexing of web scale datasets for the cloud," in Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, Raleigh, North Carolina, 2010, pp. 1:1-1:6.
[7] M. Shmueli-Scheuer, H. Roitman, D. Carmel, Y. Mass and D. Konopnicki, "Extracting user profiles from large scale data," in Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, Raleigh, North Carolina, 2010, pp. 4:1-4:6.
[8] K. S. Beyer, V. Ercegovac, R. Krishnamurthy, S. Raghavan, J. Rao, F. Reiss, E. J. Shekita, D. E. Simmen, S. Tata, S. Vaithyanathan and H. Zhu, "Towards a Scalable Enterprise Content Analytics Platform,"
  IEEE Data Eng.Bull., vol. 32, pp. 28-35, 2009.
[9] SAP BusinessObjects. SAP BusinessObjects BI OnDemand | SAP® BusinessObjects™ OnDemand. 2011(Nov 16), 2011. Available:
http://www.ondemand.com/businessintelligence.
[10] Opera Solutions LLC. Opera solutions. 2011(Nov 16), 2011. Available:
http://operasolutions.com/.
[11] H. Li and J. Jeng, "CCMarketplace: A marketplace model for a hybrid cloud," in Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research, Toronto, Ontario, Canada, 2010, pp. 174-183.
[12] M. Hajjat, X. Sun, Y. E. Sung, D. Maltz, S. Rao, K. Sripanidkulchai and M. Tawarmalani, "Cloudward bound: planning for beneficial migration of enterprise applications to the cloud,"
 SIGCOMM Comput.Commun.Rev., vol. 41, pp. 243-254, August, 2010.
[13] C. Baun and M. Kunze, "The KOALA cloud management service: A modern approach for cloud infrastructure management," in Proceedings of the First International Workshop on Cloud Computing Platforms, Salzburg, Austria, 2011, pp. 1:1-1:6.
[14] K. Zhang, X. Zhou, Y. Chen, X. Wang and Y. Ruan, "Sedic: Privacy-aware data intensive computing on hybrid clouds," in Proceedings of the 18th ACM Conference on Computer and Communications Security, Chicago, Illinois, USA, 2011, pp. 515-526.
[15] L. Abraham, M. A. Murphy, M. Fenn and S. Goasguen, "Self-provisioned hybrid clouds," in Proceeding of the 7th International Conference on Autonomic Computing, Washington, DC, USA, 2010, pp. 161-168.
[16] Hewlett-Packard Company. Enterprise cloud services - compute | HP services. 2011(Nov 15), Available:
http://www.hp.com/enterprise/cloud.
[17] Fujitsu. FACT SHEET: Fujitsu hybrid cloud services for windows azure. 2011(Nov 16), 2011. Available:
http://solutions.us.fujitsu.com/pdf/services/Services-Cloud-Hybrid-Windows-Azure-factsheet.pdf.
[18] IBM. IBM - A highly scalable, low-touch private cloud which offers near zero downtime, rapid image deployment and automated recovery across heterogeneous platforms
- IBM SmartCloud provisioning  - software. 2011(Nov 15), 2011. Available: http://www-01.ibm.com/software/tivoli/products/smartcloud-provisioning/.
[19] IBM. IBM infrastructure as a service. 2011(Nov 15), 2011. Available:
http://www-935.ibm.com/services/us/en/cloud-enterprise/index.html.
[20] Apache Software Foundation. Deltacloud | many clouds. one API. no problem. 2011(Nov 14), 2011. Available:
http://incubator.apache.org/deltacloud/.
[21] Apache Software Foundation. Apache libcloud | a unified interface to the cloud. 2011(Nov 14), Available:
http://libcloud.apache.org/.
[22] jClouds Inc. jClouds. 2011(Nov 14), Available:
http://www.jclouds.org/.


 

Copyright ©2012 LIAKM, Toronto, Canada