TL;DR - There is more than one way to deliver applications, but there’s a whole bunch of reasons to do it right.
Application delivery : You’re doing it wrong
I had a discussion with a colleague recently, which started with a few requests to EGI from “third-parties” (researchers or infrastructure projects outside of the direct European sphere of influence) as to how they could use EGI services. Since there have been significant changes in the nature of computing resources, there is also renewed interest in such collaboration, often in the hope that what has been developed in Europe or elsewhere1 might address some deficit or solve some problem.
Leaving aside the mere numerical shortages - the lack resources on which to execute scientific workflows, and deal with scientific data - there are, in my mind, essentially two major issues to address :
- Availability of relevant scientific applications
- Access to the infrastructure
These are the literal barriers to entry, and decisions on how to address these two issues shape and inform other aspects of whatever infrastructure one builds. There are many schools of thought regarding the means of getting applications into a place where they can be executed by the users that need them, commonly an approach which recognises nothing but the lowest common denominator is taken: give the user access to the OS, and have them solve their dependencies and configurations themselves. Some projects provide dedicated case-by-case support to deploying or installing applications, in which case the user or group in question is not alone in their efforts, but of course this means that some other group is alone and may have to wait.
Yet other approaches may include simply locking down the infrastructure endpoints so that only predetermined applications can be executed; or perhaps some intermediate position whereby new applications can be introduced into the execution environment, but only laboriously and in way which is overly-reliant on manual tasks and particular expertise (often locked in a particular person).
I’ve always found these approaches distasteful and inelegant. I’ve often, upon reflection and in interactions with those who use and rely on them, found them the result of logic or circumstances which may not continue to hold. In my opinion, there is a better way to do this2, to deliver applications in their correct, desired configuration, to arbitrary resources - adopting continuous delivery. This is the heart of the CODE-RADE.
CODE-RADE solves a problem
I’ve written about CODE-RADE here and there. The main thrust of this article is to describe a problem and a solution to it. That problem is the delivery of applications to aribtrary execution platforms. This is a general problem and we have come up with a general solution3. I’ll leave more technical descriptions of the CODE-RADE platform for a later date.
Everything is awesome !
We’ve spent significant time and money building all kinds of resources with pretty much one goal : let the scientists do the science. Compute facilities such as HPC centres, grids, clouds, etc are the food of modern science; when consumed, they make science grow. Infrastructure has come a long way since the early pre-grid days in South Africa. We now have better network connections, wider distribution of HPC resources and many more people actively engaged in the ecosystem in some way. Applications themselves have also matured, and are easier to install and configure. Leaving aside the issue of data access and movement, we’re in a pretty good place, relatively speaking4. One would be forgiven for assuming that if one simply had access to one of these compute resources, one would within a short amount of time be able to start producing scientific results, through the execution of scientific applications.
This was a false assumption : getting applications onto HPC centres was a non-trivial task. All sorts of issues from, version to source availability, license, compiler, execution model and dependencies got in the way of making the most out of these resources.
Everything was not awesome
It’s a jungle out there
We have seen an increase in the variety of compute platforms, in terms of underlying hardware and configurations in the last 10 years. The “standard” beowulf cluster was tinkered with, adding various kinds of hardware acceleration, data cache, hierarchical networks, etc not to mention new chips and instruction sets, operating systems. To this, add the efforts to share resoruces in federations, along grid and cloud paradigms - the challenge of getting the right application, in the right way, in the right place, reliably and reproducibly emerged as a serious one.
The issue of compiler optimisations may be taken as an example. The very same application may be compiled in slightly different ways, depending on the hardware it will be executing on. Most of these optimisations have the mere effect of speeding up the execution of the code, however some may have side-effects in numerical operations. Applications distributed as precomiled binaries may simply not find the hardware for which they were configured, and not execute at all5. All told, for a researcher or research community6, a significant amount of time is invested in simply getting to the point where they are able to fully execute workflows on single resources or distributed infrastructures.
Researchers and Research Software Engineers : a stunted dialogue
As we will discuss shortly, the researcher does not act alone in using resources. In rare cases, the researcher owns and operates their own resources directly - it is far more likely that they have instead have access to a shared system, which has a series of gatekeepers. More frequently, we have a human interface between the compute environment and the researcher - the “research software engineer” (RSE) or “application expert”7. The RSE is typically a general support role, whilst typically the application expert is someone who is a domain-specific researcher who has specialised further in the support of a specific application, stack or framework8. Whatever the specific case, there is a dialogue between the researcher (who comes from a position of scientific priority), and the RSE who comes from a position of technical implementation. Anecdotally, dialogues between these two roles can follow a pattern9:
- Initial contact
Especially for e-Research Centres which are just starting out and do not have prolonged contact with their research communities, there is a phase of initial contact where the two “sides” attempt to find a common language. Their differing perspectives and priorities mean that often they may speak at cross-purposes. Simple differences in terminology and frames of reference are important to overcome, before it becomes possible to have a more concrete discussion on what actions to take. However, after a few of these engagements, the RSE may see patterns emerging in how researchers expect to undertake their research.
The second phase takes on very specific and concrete aspects. In this phase, the RSE and researcher have converged on a common goal of actually making the application available and have diagnosed the necessary prerequisites. These are often not prescriptive however, but descriptive, ie:
- “you need compiler family xyz”,
- “the application should be compiled with dependency x,y and z”
- “we require dataset x and y, from place a”
Very rarely, these conditions are expressed in some executable format - they are most often written in some form of documentation. Application configuration and dependencies written in a non-machine-readable format leaves open the possibility of various interpretations, and manuals may not prescribe how applications are to be installed, but focus more on their usage. Thus, the second phase leads into the third phase, wherein attempts at clarifying uncertainty or resolving problems are made.
At this point, the researcher and the RSE may be of the opinion that their respective counterpart is well aware of the constraints and needs of the platform and application respectively, however often this optimism is falsely founded. A myriad of minor details which either of them avoided mentioning - perhaps in good faith, considering them unnecessary complications - may factor in to the work to be done. This miscommunication can lead to frustration and can be very time-consuming. In the worst case, it can lead to mutual disgust at the ignorance of respective parties - but usually a plateau of productivity is reached after some time.
Suffering through every application
It is also likely that dialogues such as the one described above between RSE and researcher happen in a similar way repeatedly at various sites. Although knowledge and experience can be shared via various means, unless it is in an executable, machine-readable format, time will still be needed to parse, understand and implement it in a local context.
Scaling support - the number of applications
While this is certainly not the case for all applications, the need for a smoother, clearer, more inclusive delivery process remains. The need is particularly keen when one considers the depdendencies of more and more complex applications, givent that each of those has multi-dimensional attributes, such as version number, compiler optimisation, target architecture, host operating system etc. A single application than thus really decompose into 16 or more, given the desired configurations.
Clearly a systematic way of delivering applications is needed.
Addressing the problem
To summarise the previous sections, the issue of truly enabling researchers to conduct their researcher, as opposed to merely providing them access to a platform, can be ascribed to:
- Complexity of the execution environment
- Complexity of the disalogue between researchers and research software engineers
- Complexity of the application ecoystem
In the next section, we describe a solution to these problems. This solution:
- Lowers the barrier to entry to the grid or cloud infrastructure, or single HPC sites
- Puts the prerogative for addressing technical problems in the hands of the people most able to solve them : the application expert
- Allows The application expert to prove to the resource provider that the application will actually run on the execution environment of the site
- Easily manages the lifecycle of applications across multiple versions, architectures, configurations
- Ensures that once applications are certified, they are actually available on as many sites as possible
- Allows better collaboration between researcher, research software engineer and infrastructure provider.
Solving these problems
We now schematically describe how the problem was solved with CODE-RADE, starting from a few basic hypotheses.
Seven Hypothese of scientific applications
- At the core of the issue are applications.
- No software is an island.
- Applications need an environment
- There is more than one environment
- All solutions decay
- The use of automated agents reduces the cognitive load on humans.
- The solution is attainable.
These hypotheses drive the design and implementation of the CODE-RADE platform for application delivery.
As mentioned above, researchers rarely if ever act alone in the execution of the scientific applications: they almost never own and operate their own resources, but rather use shared resources provided by their institute, collaboration or a distributed federation or national facility. In order to better model the delivery of applications to these, CODE-RADE also recognises several roles, each conducted by a series of actors:
- Research Software Engineer
- Infrastructure Operators
- Automated Agents
Although they have a common goal of enabling the research agenda brought forward by the scientist, only the scientist has the final scientific output as their main priority. The RSE may prioritise the sustainability of a particular suite of software, or the performance of such in various environments. The infrastructure operators may want to prioritise the stability and accessibility of the platform, or enforcing certain site policies related to security, accounting, etc.
Also included here “automated agents”. These are currently conceived as “helpers” of the previous three roles - adjuncts to their functions, so as to reduce cognitive or other workload. Although they are merely extensions of the other roles, they can play a vital role in helping the humans in the other roles to interact and collaborate with each other. Automated agents do not tire, and often have specific protocols or APIs via which they can communicate.
Having this discussed aspects of the problem at hand, hypotheses and actors, we next describe the implementation of the solution : the CODE-RADE platform.
In order to solve the problem of application delivery as we have described it, we postulate that no single tool method will address all aspects, and respect the priorities of all actors. What is needed is the construction of a platform which the actors can use in a collaborative fashion, with well-defined interfaces. The usage of this platform should typically follow the flow we have described above, but removing confusion and delays by adopting methods and tools which make the needs of applications explicit, and with machine-readable configuration and documentation, automating as many actions as possible. This platform can be composed by using a few widely-used tools and services. It is designed, as a consequence of our hypotheses stated above, to have few important aspects. We go into the respective detail of these below.
Although we are adopting a multi-actor platform, with automated agents, we must still recognise the basic workflow of the application delivery, which we described above - however with some improvement and extension:
|Step 1||Initial contact||Initial contact|
|Step 2||Diagnosis||Diagnosis (iterative)|
|Step 3||Feedback (slow)||Feedback (automated)|
|Step 4||Convergence/confusion||Convergence (automated)|
|Step 5||Deployment (manual, target-specific)||Deployment (automated, arbitrary targets )|
The point here is that we are still going through similar motions as before - there is still a dialogue between RSE and researcher. The difference however is that this dialogue is far more transparent, explicit and speedy. The effect of changes are seen immediately, the reasons for failures are explicit and although they may not immediately indicate the remedy, they provide clear symptoms, rather than anecdotes. Furthermore the effort in building and integrating new applications immediately pays off, since applicaitons are delivered automatically to sites.
The generic components of the CODE-RADE platform are :
- machine-readable expression of software
- version control
- continuous integration service
- arbitrary build environment
- fitness tests
- automated delivery
Each of these can be implemented in several ways, however a workflow similar to the one expressed above should be maintained.
For the foundation releases of CODE-RADE, the following were used :
- machine-readable expression of software :
- version control
- continuous integration service
- arbitrary build environment
- fitness tests
- automated delivery
- Cross-platform :
- Build and test artifacts for an arbitrary set of targets.
- Promote diversity in computing platforms.
- Ensure proper optimisations and application portability
- Atomic :
- Fine-grained control over dependencies, versions and targets
- Relevant action taken on each event
- Community-Driven :
- No restriction on the applications that can be integrated.
- Anyone can contribute applications, resources, code review, etc
- Automated :
- Heavy reliance on automated agents to reduce bias, lead time
- Action initiated by users, not operators.
References and footnotes
I write from an African point of view. “Elsewhere” can typically be interpreted as “places which are better resourced and developed than here”. ↩
I find it a great temptation to claim that there is a “right” way, but I should be wary of the assumptions implicit in this argument. I am considering specifically a federated, distibuted environment like a grid or a federated cloud. My arguments will hold less weight in other environments, specifically ones where the control and authority is more centralised. ↩
Note: “a” solution, implying that there are non-unique ways to solve this problem. ↩
I’ll arbitrarily compare whatever we describe here to the start of the national facility (c. 2007) ↩
The obvious case is an application expecting a GPGPU, where none is available. ↩
I’ll use “researcher” and “research community” interchangeably here. This term means “someone, some group, or some thing which wants to execute a scientific application”. ↩
We use “application expert” and “Research Software Engineer” interchangeably here. ↩
An example would be a researcher who maintains a GEANT4 installation for a local user group. ↩
I must admit, I have on evidence to support this, just subjective experience. I’d really appreciate any form of critique or feedback here. ↩