The Laboratory for Information Analysis and Knowledge Mobilization
Hybrid Cloud Computing for Data Analytics
Marin Litiou, Nick Cercone, York University
Cloud computing is a new computation model in which hardware and software are offered on-demand as services over the web. Enabled by new Internet technologies and standards and driven by new business models, cloud targets three main areas of computing: the infrastructure, (Infrastructure as a Service- IaaS), the programming and runtime environments (Platform as a Service- PaaS) and the end user software (Software as a Service-SaaS). While IaaS, PaaS and SaaS categorization is useful in identifying and partitioning many distinct cloud research challenges, an orthogonal set of issues arises when looking at the requirements of different classes of applications that we aim to run in cloud, For example, it can be argued that a cloud that runs real time and safely critical applications has different services and offer different qualities of services than a cloud for e-commerce applications.
This project focuses on addressing some of the challenges in creating a cloud for data analytics. Analytics, in general terms, refers to extracting information from raw data, often in the context of business decision-making. In response to the growth of this raw data, cloud analytics marshals significant amounts of computing power and robust storage to conduct analytics using on-demand, scalable infrastructure. Gartner describes six key elements that may exist in a cloud analytics solution: data sources, data models, processing applications, computing power, analytic models, and sharing or storing of results . This project focuses on platform and infrastructure for analytics: an infrastructure that is analytics-aware and is driven by the requirements of large-scale data analytics, optionally including a platform to support analytics applications.
There is growing academic interest in large-scale analytics using cloud computing. Position papers have suggested a need for substantial work in this area (e.g. [2, 3]). Individual efforts have made piecemeal advances – for example migrating analytic applications to the cloud for time series data , and data warehouses for the cloud . There is interest in using the MapReduce paradigm for analytics (e.g., [6, 7]); IBM Research reported efforts – primarily focused on extracting and analyzing large-scale unstructured data – to do search-driven analytics using the Hadoop platform . To our knowledge there is little or no research at the infrastructure or platform level; the focus is on applications that run on stock infrastructure and platforms.
Commercial vendors have focused on providing analytics-as-a-service – that is, analytics software as a service (for example, SAP ) and Opera Solutions ). Google’s BigQuery service queries terabytes of data in seconds, runs on Google’s cloud infrastructure, scales as needed, and is accessed through RESTful APIs or a web interface. HP through Vertica offers a data warehouse that can be run on Amazon EC2. Microsoft SQL Azure Reporting is a business intelligence system that runs only on Windows Azure. These commercial solutions require structured data.
Hybrid Clouds for Analytics: To design, implement, and evaluate an architecture for analytics on combinations of private and public clouds that is elastic, scalable, and tailored for the analytics application class.
Hybrid clouds are made of private and public sub-clouds working together to mitigate privacy, security and the need for large computation and storage capacity. Academic research into hybrid clouds has focused on the middleware / abstraction layers for creating, managing, and using hybrid clouds (e.g. [11-13]). For example, Zhang et al. used the MapReduce paradigm to split a data-intensive workload into mapping tasks sorted by the sensitivity of the data, with the most sensitive data being processed locally and the least sensitive processed in a public cloud . Abraham et al. provision private clouds from multiple collaborating entities as well as public clouds from Amazon, automatically and semi-transparently .
Commercial support for hybrid clouds is growing in response to the business case for cloud federation. For example, HP offers software to manage a private cloud, but also provides infrastructure as a service and the ability to have the two communicate (interoperability with other vendors is not discussed) . Fujitsu and Microsoft partnered to produce a hybrid of the Windows Azure public cloud with Windows Server running in a private cloud or on a private server . IBM offers both SmartCloud Provisioning (to manage a private cloud)  and SmartCloud Enterprise (a public cloud offering) and software to bridge the two at the Software-as-a-Service (SaaS) level . These examples do not enable heterogeneous clouds; that is, the private and public cloud must run specific software to enable linking the two.
In the open source realm, support for cloud federation and hybrid clouds is emerging. Apache Deltacloud is an abstraction layer that allows developers to work with various public IaaS providers (e.g. Amazon EC2, Rackspace) and internal private clouds (supporting various commercial and open-source solutions) using a unified RESTful API. This open-source project recently transitioned from incubation to a top-level Apache project . Apache Libcloud  and jClouds  are Java and Python libraries (respectively) with similar abstraction goals. All three offer stable support for working with compute instances and cloud storage, but also have beta or experimental support for other abstractions.
We propose an architecture that optimizes analytics on a combination of private and public clouds. Performing an analytics task across heterogeneous, hybrid clouds can combine the scalable, on-demand computation resources of the public cloud with the privacy and control provided by a private cloud. Supporting these tasks at the infrastructure and platform level allows out-of-the-box analytics operations to leverage these capabilities. In the typical case, the large data stores will be local to the enterprise; the research challenge is enabling the use of public cloud infrastructure while supporting data locality, privacy, and bandwidth limitations. An analytics cloud may include highly-optimized data warehouse hardware in addition to the commodity machines commonly used. The architecture of the hybrid cloud for analytics must be scalable and elastic, and must systematize support for non-functional properties such as privacy, data security, and quality of service. Contributions could include data partitioning to minimize data transfer among clouds, computation task modeling to identify candidate modules for migration to a public cloud, and migrating tasks among heterogeneous cloud infrastructures at run-time.
Scaling for Analytics in a Hybrid Cloud: To investigate, design, and evaluate novel approaches to adaptively scaling a business analytics application in response to a changing workload, leveraging the key properties of the required architecture: large data, hybrid public/private clouds, and analytics applications.
Cloud architectures are well-suited for adaptive management: the layer of virtualization and abstraction typically employed allows for fine-tuned and responsive resource management not possible with traditional infrastructure. We focus here on novel contributions to scaling the application and the infrastructure in response to a dynamic workload to meet service level objectives, which may be explicit or implicit. The usual approach of adding or removing resources (typically focused on computation power, namely number of cores) is not sufficient here; the key characteristics of our architecture both enable and require extending the state-of-the-art, as follows:
3. Brokerage of Resources: To design, evaluate and implement an automated solution (a broker) capable of matching the requirements of an analytics application with the available resources in a heterogeneous, hybrid cloud.
We propose the Resource Broker as an additional layer in the cloud architecture that serves as an intermediary between an application or a task and a pool of resources. Typically an application developer must statically decide on a deployment (identify cloud provider, number and type of instances, etc.). The Broker accumulates knowledge about the services offered by providers (public and private) and provides a unified interface to the developer, allowing them to instead specify requirements. The heterogeneous, hybrid cloud is abstracted through a set of APIs and the application may never be aware of the exact underlying infrastructure. The broker can be used to acquire infrastructure-, platform-, or software-as-a-service. Contributions include the design and architecture of this broker, algorithms for matching requirements with resources, and an understanding of what level and quality of information the application should provide about its requirements to achieve the optimal matching.
The project will involve analytical modeling, simulation and experimental work. The first phases of the project will involve building an experimental infrastructure, a distributed cloud with a private sub-cloud at York University and a public sub-cloud (Amazon).
For IaaS layer, we plan to use Xen as our main virtualization platform. Xen supports efficient virtualization of Linux/Intel x86 platforms. It is also open source, which allows us to modify it as needed to support our research tasks. In addition to Xen, we may also explore the use of other virtualization platforms.
For PaaS layer, we plan to use a variety of products available from our industrial partners, including middleware for analytics. We will also experiment with open source applications such as Deltacloud and Libcloud.
SaaS layer will consist of several benchmarks and applications which capture typical analytics workloads. To take into account more sporadic workloads with different demand profiles, other applications will be considered, following consultations with our industrial partner.
B. Gassman and R. Knox, "‘Cloud analytics’ means many different kinds of
opportunity," Gartner Research, Stamford, CT, Tech. Rep. 1386527, 2010.
|Copyright ©2012 LIAKM, Toronto, Canada|