Service-Oriented, Distributed, High-Performance Computing

Article
01/14/2009

Savas Parastatidis and Jim Webber

July 2005

Applies to:
Enterprise architecture
High-performance computing (HPC)

Summary: In this article we explore one service-oriented approach for enabling Internet-scale, high-performance computing (HPC) applications. (16 printed pages)

Abstract
Introduction
Grid Computing
From The Enterprise...
...To The Internet
Conclusions
Related Reading
Footnotes

Abstract

High-performance computing (HPC) has evolved from a discipline solely concerned with efficient execution of code on parallel architectures to be more closely aligned with the field of distributed systems. Modern HPC is as much concerned with access to data and specialized devices in wide-area networks as it is with crunching numbers as quickly as possible. The focus of HPC has shifted towards enabling the transparent and most efficient utilization of a wide range of capabilities made available over networks, in as seamless a way as the electrical grid delivers electricity.

Such a vision requires significant intellectual and architectural investment. In this article we explore one service-oriented approach for enabling Internet-scale, high-performance applications based on work completed as part of the £250 million UK e-Science program.

Introduction

From networks for workstations through to Internet [1], the high-performance computing community has long advocated composing individual computing resources in an attempt to provide higher quality of service (in terms of processing time, size of data store, bandwidth/latency, remote instrument access, special algorithm integration, and so on, for example). In recent years this progression has been driven by the vision of "Grid computing" where the computational power, storage power, and specialist functionality of arbitrary networked devices is to be made available on-demand to any other connected device that is allowed to access them.

Concurrently, the distributed systems community has been working on design principles and technologies for Internet-scale integration (Web and Web Services, for example). Recently the term 'Service-Oriented Architecture (SOA)' has emerged as a popular piece of terminology, in some part due to the hype surrounding the introduction of Web Services. While Web Services are perceived as an enabling technology for building service-oriented applications, they should be treated as an implementation technology of the set of principles that constitute service-orientation. The promise of SOA and Web Services is the enabling of loose-coupling, robustness, scalability, extensibility, and interoperability. These are precisely the features required of a global fabric for "Grid computing"—a popular buzzword used to refer to distributed, high-performance computing or Internet-scale computing.

In the following sections we describe grid computing and service-orientation. We then discuss how high-performance applications can be designed, deployed, and maintained by using message-orientation and protocol-based integration. Following this, we present our approach on how large-scale HPC applications can deal with a multitude of resources and state in a manner that is consistent with SOA principles.

Grid Computing

The term "Grid computing" is overloaded and has different meanings to different communities (and vendors). Here are some of the common interpretations:

On-demand computing
Utility computing
Seamless computing
Supercomputer interconnectivity
Virtual world-wide computer
SETI@home and ClimatePrediction.net
(BOINC-style projects [2])
Virtual organizations

We adopt the view that grid computing is synonymous to Internet-scale computing with a focus on the dynamic exploitation of distributed resources for high-performance computing. When building grid infrastructure and applications we promote the application of the same principles, techniques, and technologies that are typical of modern distributed systems practice, with service-orientation as the architectural paradigm of choice and Web Services as the implementation technology.

Services

While service-orientation is not a new architectural paradigm, the advent of Web Services has reinvigorated interest in the approach. It is, however, a misconception that Web Services are a form of software magic that somehow automatically corrals the architect towards a loosely coupled solution that is scalable, robust, and dependable. Certainly it is possible (and generally highly desirable) to build service-oriented applications using Web Services protocols and toolkits; however, it is equally possible to build applications that violate every architectural principle and tenet of SOA.

As researchers and developers have re-branded their work to be in vogue with the latest buzzwords, the term Service Oriented Architecture (SOA) has become diluted and imprecise. Due to lack of a widely-accepted definition of a service, we propose the following:

A service is the logical manifestation of some physical or logical resources (for example, databases, programs, devices, humans, and so on) and/or some application logic that is exposed to the network;

and

Services interact by exchanging messages.

Services consist of some resources (data, programs, or devices, for example), service logic, and a message processing layer that deals with message exchanges (Figure 1). Messages arrive at the service and are acted on by the service logic, utilizing the service's resources (if any) as required. Service implementations may be of any scale: from a single operating system process to enterprise wide business processes.

Aa480054.0168_journal_5_service01(en-us,MSDN.10).gif

Figure 1. The archetypal structure of a service

Services may be hosted on devices of arbitrary capability (for example, workstations, databases, printers, phones, personal digital assistants, and so on) providing different types of functionality to a network-based application. This promotes the concept of a connected world in which no single device and/or service is isolated. Interesting applications are built through the composition of services and the exchange of messages (Figure 2).

Aa480054.0168_journal_5_service02(en-us,MSDN.10).gif

Figure 2. Networked applications are built through the exchange of messages between services hosted in devices. In this example, an application running on a mobile device makes use of the distributed resources through services running on a workstation (job execution, for example), a database, and a printer

Messages

A message is the unit of communication between services. Service-oriented systems do not expose abstractions like classes, objects, methods, remote procedures, but are instead based around the concept of message transfer. Of course, single message transfers have limited utility so there is a tendency for a number of message transfers to be logically grouped to form message exchange patterns (MEPs) (for example, an incoming and an outgoing message that are related can form a "request-response" MEP) to support richer interactions. MEPs are grouped to form protocols that capture the messaging behavior of a service (sometimes known as a conversation) for a specific interaction. Such protocols may subsequently be described in contracts and published to aid integration with the service (for example, in WSDL or SSDL [3]).

Protocols and Contracts

The behavior of a service in a distributed application is captured through the set of protocols which it supports. The notion of protocol is a departure from the traditional object-oriented world where behavioral semantics are associated with types, exposed through methods, and coupled with particular endpoints (the point of access for particular instances). Instead a protocol describes the externally visible behavior of a service only in terms of the messages, message exchange patterns, and ordering of those MEPs which are supported by the service.

Protocols are usually described through contracts to which services adhere. A contract is a description of the policy (security requirements or encryption capabilities, for example) and quality of service characteristics (support for transactions, for example) which a service supports and/or requires, in addition to the set of messages and MEPs that convey functional information to and from the service.

From The Enterprise...

Many organizations are realizing the cost benefits from using clusters of workstations as alternative platforms to specialized supercomputer facilities for their high-performance computing needs. Until recently such cluster-based solutions have been treated as dedicated computational and/or storage resources. Enterprises are now seeking to gain in terms of both lower cost and performance by using the idle processing power, distributed storage capacity, and other capabilities available by their deployed workstation-based infrastructure, an approach commonly referred to as "intra-enterprise grid computing."

Here we explore dedicated clusters and how service-orientation can be used for building such solutions before proposing an approach to building intra-enterprise, distributed, high-performance architectures.

Dedicated Clusters

Purpose-built commodity hardware-based solutions for high-performance computing are not uncommon inside an administrative domain of organizations with requirements for high performance computation. Such solutions are usually implemented by one or more clusters of workstations with high-speed interconnects (for example Myrinet, SCI, Gigabit Ethernet, and so on). Some such solutions attempt to provide a single-computer image to applications through the implementation—in hardware or software—of techniques that hide the distribution of CPUs, memory, and storage. Developers are presented with a familiar programming abstraction, that of shared-memory symmetric multiprocessing. However, such approaches have a tendency to limit the scalability of computational nodes, which may become an issue for certain types of parallel applications. Specialized message-oriented middleware solutions (MPI, for example) are usually employed to deal with the problem of scalability, but at the cost of requiring explicit management of the degree of parallelism by the application. There is a great deal of work in the parallel computing literature discussing the advantages/disadvantages of the shared-memory versus message-passing paradigms for parallel applications.

Dedicated clusters for high-performance computing are usually considered and managed as single resources. To enable better utilization of such resources, a service-based approach is preferable. For example, access to resources is usually controlled by a job submission/queuing/scheduling service that ensures optimal system utilization. Web Services technologies can be used for the implementation of such services, and indeed benefit from the significant investment in tooling, efficient runtime support, documentation, and user education in the Web Services area. The composable nature of Web Services technologies makes it easy for quality-of-service non-functional needs of the implementation (for example, reliable messaging, security, transactions, and so on) to be more easily incorporated into a heterogeneous environment.

Stealing Cycles from Staff Workstations

An approach that has recently gained significant momentum is the deployment of cycle-stealing technologies implemented by specialized middleware, like Condor [4]. Such middleware enables the distribution and management of computational jobs on idle workstations, while allowing a workstation to almost instantly be reclaimed by its console-based user when they start to use the computer, as the cycle-stealing gets suspended, killed, or migrated to another workstation. However, most current implementations of middleware software supporting such installations do not yet leverage interoperable and composable quality-of-service protocols. As a result, it becomes difficult to create interoperable and seamless solutions for high performance computing within the enterprise.

In intra-enterprise high-performance computing each workstation, database management system, device, and so on, expose some functionality as a service. Building such middleware using the principles of service-orientation can result in deployments that can scale to thousands of workstations. A service-oriented approach may increase the flexibility, manageability, and value of such solutions since large set of widely-accepted units of functionality/behavior, made available as protocols, can be leveraged (for example, security, transactions, reliable messaging, orchestration, and so on), and indeed there are efforts to do just that with existing systems (for example, the Condor BirdBath [5]).

Future intra-enterprise grid installations will be built around standard services provided by the underlying operating systems. Application protocols like WS-Eventing and WS-Management will be implemented and provided as standard that grid-like solutions can be easily implemented and deployed.

An example of a conceptual approach to intra-enterprise HPC computing using Web Services is described in the sidebar and illustrated in Figure 3.

Conceptual Approach for Windows-Based Intra-Enterprise HPC Computing

A set of Windows-based workstations used by staff are part of the "underutilized-CPUs" Active Directory domain. The enterprise wishes to leverage the computational capabilities of these workstations during their idle period (overnight, for example). The domain administrator pushes the .NET implementation of a set of Web Services that provide submission, monitoring, and management of jobs out to workstations through active directory. Those services leverage the underlying Web Services middleware platform (Indigo, for example) for their security and quality of service features (for example, notification, reliable messaging, transactions, and so on) requirements. Only users that belong to the "underutilized-CPUs" domain are allowed to submit jobs. WS-Security is used for the authentication and message encryption requirements with the Kerberos tickets retrieved from the Active Directory. In addition to the workstations, there are also a dedicated cluster for the enterprise's high performance requirements and a data-center. Web Services installed on these resources expose computational and data storage functionalities to the network. The enterprise's applications are written in a way that can dynamically discover and utilize any distributed computational resources within the enterprise. Therefore, as soon the functionality is enabled, the computationally-intensive applications can automatically start to take advantage of the distributed infrastructure. The users of such applications are unaware of the resources actually used.

Aa480054.0168_journal_5_service03(en-us,MSDN.10).gif

Figure 3. Integrating enterprise resources in order to meet the high-performance requirements of applications

We identify a non-exhaustive set of generic services, functionalities, and features that may offered and/or supported by each device on the network (Table 1). Of course, application domain specific functionalities will also have to be supported (for example, a BLAST service installed on a powerful server to perform bioinformatics analysis, or a service implementing an estimation algorithm for petroleum usage).

Aa480054.0168_journal_5_service04(en-us,MSDN.10).gif

Table 1. Characteristic services/functionalities/features devices may support

Clusters of Clusters in the Enterprise

Larger enterprises may not only be interested in deploying just single cluster solutions or simply re-claiming the idle processing power of parts of their organizational infrastructure. Instead, they may wish to focus on the encapsulation of entire sets of computational resources behind high-level services that, when composed together, can enable a level of integration that was previously difficult and time-consuming due to the different number of deployed technologies.

The approach to architecting intra-enterprise high-performance solutions is similar to that when the focus is on building the cluster-based solutions discussed earlier (the issues of scalability, loose-coupling, composability, and so on, equally apply). Quality of service protocols like security (for authentication, authorization, and accounting), transactions, reliable messaging, notifications, and so on, are all part of the underlying infrastructure and can be used unmodified no matter the type of solution that is implemented. Moreover, the set of services used for intra-enterprise solutions can also be used unmodified (for example, user credential management, systems management, application and services deployment, workflow support, data storage and archiving, messaging, and so on).

As previously, there is still a need for services to offer access to computational and data resources, implement scheduling, visualization, specialized-algorithm functionality, and so forth, depending on the type of application being implemented. This time, however, the services are at higher-level of abstraction because entire collections of resources are encapsulated (Figure 4).

Aa480054.0168_journal_5_service05(en-us,MSDN.10).gif

Figure 4. An example of an intra-enterprise service-oriented application with different parts of the enterprise being represented as services

We note that even though the complexity of the services has increased from those that we used when building a cluster solution, the complexity of the architecture has not and the principles and guidelines remain the same. Our distributed application still binds to messages being exchanged and no assumptions are made about an intraenterprise-wide understanding of interfaces and behaviors of the various components. Architects design applications through the description of messages and the definition of protocols that capture service behavior.

We observe that as the granularity of the services increases, the need to increase the granularity of the message exchanges is higher. The network is expensive, and so architects need to design their protocols and their messages appropriately. As the degree of distribution of an application increases, the need for loose-coupling also increases. While in a single cluster or in smaller enterprise environments, complete control of the infrastructure and the set of deployed technologies is possible, in an enterprise-wide (or larger) solution it is imperative that integration happens through standard protocols.

Also, it is clear that as the complexity of the scale of an application increases, the functionality of its services becomes even more abstract. From services that expose specific functionality to the network (remote process execution and workstation management, for example) or provide access to a resource (database system and filesystem, for example), we move to services that support aggregation of functionalities (job queues, message queues, cluster management, for example) or resource aggregation (storage area network and database federation, for example).

Since the interactions between the services become coarser in order to minimize the effect of the communication costs between the different parts of the applications, and the services become more abstract and coarse-grained with respect to the resources they encapsulate, we must design applications using larger building blocks. The larger the scale of a distributed application, the more important it is to devise declarative, protocol-based, and coarse-grained mechanisms for describing behavior. It is at this stage that workflows, contracts, and policies become even more significant. Service orchestration and abstract business processes become necessary, and so relevant technologies like WS-BPEL become an important part of the architect's toolset.

...To The Internet

In the same way that no single device is an inaccessible island within an administrative domain, nor are enterprises and organizations isolated. As is the case with our physical world, enterprises do businesses with one another, organizations interact, and government institutions collaborate. Services and interactions are fundamental to our day-to-day activities (the banking service, the post office service, a travel agent service, for example). It is only natural that when we model these activities in a computerized world, we follow a similar architectural approach to the one adopted in the real world.

As one would expect, when it comes to very large scale applications with a focus on delivering high-performance, the nature of the application may lead us to different designs with different strategies in mind compared to an intra-enterprise situation. As is the case in the real world, contracts and service-level agreements are put in place to govern the interactions between enterprises. Virtual organizations may be established—in the same way alliances are formed between enterprises in the physical world—to meet the high-performance needs of the participating entities' applications. Indeed, a distributed, high-performance application may reflect a real-world alliance between enterprises as illustrated in Figure 5 (a number of research institutes joining forces in order to solve a large scientific problem, for example).

Aa480054.0168_journal_5_service06(en-us,MSDN.10).gif

Figure 5. An illustration of how enterprises can be joined

For the virtual organizations to be viable, issues such as digital representations of agreements, contract negotiations, non-repudiation of interactions, federation of user credentials, policies, and agreed quality-of-service provisioning have to be addressed. High-level workflow-based descriptions may be put in place to represent the behavior of the virtual organization or to choreograph cross-enterprise business interactions. Applications in this space, when appropriately designed and implemented, may become truly Internet-scale.

While the type of services that are found inside the enterprise, like job queuing, scheduling, resource brokering, data access and integration, visualization, and so on, may still be necessary, when we move to Internet-scale, care must be taken on how such services are implemented and deployed. Centralized solutions (a single data store service, for example) or tight-coupling behaviors (long-lived transactions across organizations or direct exposure of state, for example) should be avoided. Of course one may argue Google and Amazon are spectacular examples of centralized repositories. This is true, but it is also the case that these are logical repositories that already use scalable solutions for their implementations; they are already built on top of distributed, replicated, loosely-synchronized data centers. Google and Amazon can be seen as good examples of virtualized data access services that have been designed and implemented with scalability and performance in mind.

Declarative Integration

With Internet-scale applications, alliances and collaborations between organizations are formed using digital contracts. Such contracts represent the set of service-level agreements that must be put in place in order for computational jobs to travel from one organization to the other, for data sources to become visible, for the functionality of special equipment to become available to the partners, for the level of trust on agreed user roles, security and policy requirements, and so on. Virtual organizations need contracts to govern their operations as it is the case with any collaboration between enterprises in the real world. The contracts are read, validated, and executed by specialized supporting middleware.

Furthermore, in a digital world where every kind of resource is accessible, application requirements and service offerings are described using declarative languages. The supporting middleware is responsible for dynamically matching an application's requirements with a service offering. If necessary, dynamic negotiation of payment and service-level agreements may have to take place.

For example, an application may advertise that it needs a service offering computational resources with specific hardware and software requirements, a data storage facility of a certain size, a visualization engine with a specific response time, and an equation solving service with a guaranteed up-time. These requirements are expressed using an XML vocabulary. The resulting document is sent to a (distributed) registry and a set of available services are discovered. The underlying middleware negotiates the service-level agreement and payment details with the resulting services according to the application's requirements within the set of limits that the end-user has set. Once agreement has been set, a digital contract, which can be used for future disputes, is signed by all parties involved.

Community Networks

The SETI@home, ClimatePrediction.net, and other similar projects have demonstrated that through community networks it is possible to bring together resources to solve large problems. If we ignore the controversy surrounding the use of P2P technologies for file-sharing, the grid computing promise of collaboration on scientific and business problems is a good match for the capabilities of P2P technologies or community networks. Future HPC Internet applications should consider P2P technologies as enabling technologies for file transfers and sharing, resource discovery, computation distribution, and so on. For example, we can imagine a P2P network that allows jobs to be submitted, and suitable resources for execution to be automatically discovered.

P2P networks may be implemented using the same set of Web Services technologies in order to leverage the huge investment in the underlying quality-of-service protocols.

The Business Value

Although the concept of grid computing emerged from the supercomputer community, businesses are now also realizing its commercial value. Per-pay or subscription-based access to resources (especially high performance computer resources) is starting to emerge as a viable business model with large companies already deploying the enabling technologies and services. Of course, the integration of such deployments into applications has to become ubiquitous and completely transparent to the end users for the vision of "utility computing" or "computing as a service" to become a reality.

This yields a number of valuable opportunities. Obviously, there will be those companies that will be able to reap the benefits from hosting cost-effective compute resources for others to integrate into their environments on an ad hoc basis. The reciprocal of this situation is that there will be opportunities for companies to more effectively plan their spending on IT infrastructure and decide whether up-front capital investments may be superseded by the pay-as-you-compute model (or not), in addition to the general business agility that moving to a service-oriented architecture will yield.

Common Services and Functionality

The set of typical services/functionalities presented earlier (Table 1) are also needed in Internet-scale HPC applications. However, they are more abstract and must make different assumptions about the environment in which they are deployed. For example, the issue of who is allowed to send jobs for execution in an organization's computee center will be defined through digital contracts while the quality of service that each interaction will receive (CPU and data storage allocation, for example) will be controlled by the service-level agreements defined in the same contract. In addition to Table 1, however, we also observe the set of typical services/behaviors for Internet-scale HPC presented in Table 2.

Aa480054.0168_journal_5_service07(en-us,MSDN.10).gif

Table 2. Characteristic services/functionalities/features for Internet-scale HPC computing

Design and Implementation Guidelines

Having discussed at a very high and abstract level the architecture of service-oriented, high-performance, distributed applications that can scale across the Internet, we now touch on some important design and implementation considerations.

Communication is Expensive

Despite the ever increasing improvements in network latency and bandwidth, communication over commodity network infrastructures is orders of magnitude less efficient than over specialized interconnects or memory-bus architectures. Consequently, care must be taken when architecting, designing, and building HPC distributed applications, so as to minimize the costs associated with message exchanges between components of an application.

In addition to the network costs, the HPC community is also concerned with the computational costs incurred due to the processing of XML. However, this is an aspect that is being addressed by the SOAP community. In fact, good SOAP implementations already approach the performance of binary mechanisms for short messages [6], which implies that eventually the limiting factor for message transmission either in binary or SOAP format will be the latency and bandwidth of the network. While we do not wish to denigrate the importance of low-latency and high throughput for HPC applications, it is clear that the laggard label that SOAP has attracted is somewhat undeserved.

Principles and Guidelines for Service-Oriented Applications

Loose-coupling and scalability are the results of principled design and sensible software architecture. We adopt the following tenets for building service-oriented applications:

The collection of protocols supported by a service determines its behavioral semantics.
Services bind to messages and the information conveyed though them, and not to particular endpoints and state.
Messages exchanged between services are self-descriptive (that is, as in REST5) insofar as they carry sufficient information to enable a recipient to establish a processing context as well as the information needed to execute the desired action.
Services are implemented and evolve independently of one another.
Integration of services takes place through contract-based agreements.

In addition to the above principles, we also promote the following list of guidelines when building service-oriented systems:

Statelessness: This property relates to the above self-descriptive nature principle. Services should aim to exchange messages that convey all the necessary information necessary for receiving services to re-establish of the context of an interaction in a multi-message conversation. Stateless services are easy to scale and make fail-over fault tolerance very simple.
Rich messages: Communication costs are usually high. Hence, we should aim for protocols that involve rich messages that result in coarsegrained interactions, effectively minimizing the number of times a service has to reach across the network.
State management: As per the traditional N-tier application design, service implementation should delegate all aspects of state management to dedicated and specialized data stores (as suggested in Figure 1).
Message dispatching: There should be no assumptions about the dispatching mechanisms employed by services. As a result, no dispatching-specific information should leak from the service implementations, cross the service boundaries, and conveyed through the message contents (for example, SOAP-RPC, RPC style SOAP, Document-Wrapped style WSDL, or method names conveyed as soap:action or wsa:action attributes).
Role-specific coupling: Architects should keep in mind that in a service-oriented architecture there are no actors like "consumer," "provider," "client," "server," and so on. These are roles that exist at the application-level and not at the building blocks of the architecture, the services. In SOAs, there are only services that exchange messages. Treating a pair of services as client/server introduces a form of coupling that may ultimately be difficult to break.

Conclusions

Services are a sensible abstraction for encapsulating and managing the increasing level of complexity in distributed applications. The beauty of service-orientation is that the architectural principles and the guidelines are consistent from an operating system process right through to a service that encapsulates an entire business process or even an entire organization. The architectural requirements of high-performance Internet-scale or "Grid" computing are not different from those of enterprise business-focused computing, and so identical principles and guidelines should be used.

Acknowledgments

The authors would like to thank Prof. Paul Watson (School of Computing Science, University of Newcastle upon Tyne) for his useful feedback during the preparation of this article.

Footnotes

About the authors

Savas Parastatidis

School of Computing Science

University of Newcastle upon Tyne

Jim Webber

ThoughtWorks Australia

This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the Architecture Journal website.