TL;DR :loudspeaker:


We have started work on an Ansible Style Guide. We hope it will encourage re-use and collaboration in our community, increase the velocity of delivery, and ultimately improve the quality of our infrastructure


The dark art of turning computers into science

“e-Infrastructures” composed of many different ICT services underpin much of modern research activities. Despite their importance, they are rarely seen or interacted with directly by researchers themselves, which makes them difficult to talk about1. Just like any other kind of research infrastructure however, they have to be built, and the way in which they are built and delivered informs aspects of their usability, fitness for purpose, scalability, sustainability, cost-effectiveness, and more. There are many ways to describe what e-Infrastructures are2, but I like to think of it as:

e-Infrastructures: the dark art of turning computers into science.

There are of course as many ways to turn computers into science as there are computers (or scientists, for that matter) - but there’s only so far you can get by yourself. At some point, you need to scale - either by accessing more computer resources, or by accessing more people through services which enable collaboration. This is where you start moving from doing research which needs computers, to research which needs infrastructures. This is a qualitatively different undertaking.

The same thing that makes this research (and hence the e-Infrastructures which underpin them) so tricky also makes them powerful: you can’t do them in isolation. Their value lies not in in their components, but in the interaction between these components – and the most important component in them has always been people.

Now, if we want to wield the magic that turns computers into science, we’d best do it at a pace which research itself requires, preferably without breaking things. We’ve3 not been around long enough to see where “Move Fast and Break Things” leads eventually. Spoiler alert, it’s not4 good5 (i.e., everything is broken).

Based purely on my own experience however, infrastructures often lag considerably behind the pace of development of their components. This makes sense - there is always more rapid development at the edges of technology than at its centre; it’s called the “Bleeding Edge”6 for a reason, after all: that’s where most of the pain is! But why do we have to make a choice between a decaying, if stable infrastructure, and one which is so brittle that it hurts to use it? There is perhaps a sweet spot in terms of proximity to the bleeding edge - a situation where we can move fast enough for infrastructures to adopt relevant new technologies (before communities get fed up with the slow pace and branch off on their own), but not so fast that we break things.

Keeping pace, together

How can we balance the need for change with the need for collaboration and co-operation ?

I want to discuss one specific case here, specifically related to the smooth delivery of e-infrastructure service components7. What I’ll describe here is a style guide for developing the code which builds our services, and how that guide can be an expression of our operations methodology and our culture of collaboration, mutual support and desire to build something that our children will still be using. I will be using the case of Ansible roles, but hopefully the points discussed here, as well as the later implementation, will be generic enough to cover other use cases too8.

Infrastructure as code

The “Infrastructure as Code” pattern has come to some maturity recently9, but refers mostly to “Managing Servers in the Cloud”. What about “Managing the actual cloud” ? Well, this may be well-covered by a similar pattern (or more precisely, job description) - the Site Reliability Engineer, or SRE10. To quote Google:

SRE is what you get when you treat operations as if it’s a software problem

While we’re a long way away from that, there is a quiet shift happening in the e-Infrastructure world. There is a recognition that infrastructures are complex, interacting systems, but that these systems can indeed be described by software.

We do have a pretty good framework for managing services in EGI, based on a few industry standards. These standards require certain procedures and documentation to be in place in order to comply with them, and help to ensure that services are delivered in a consistent, reliable way to customers and peers in a federation.

Compliance, however can be attained in many ways - the standards make few statements about how requirements are met - only that they are. This leaves a lot of room for interpretation, which is a good thing if the standard is to apply in the widest possible set of cases. However, that can also leave open the possibility for confusion and conflicting styles to take hold, with some negative consequences we describe below.

Code as Community

Style is almost always a matter of opinion. As the old saying goes, there’s no accounting for taste. Modern configuration management tools use high-level languages such as YAML to describe what they do and how they do it, allowing developers and operators to communicate their work almost in plain English. The great irony is that while this makes it easier for individuals to read and write code, it can make it quite difficult for communities or even teams to do so, since individuals are prone to expressing their individual style. Differences in style can be a really positive thing, allowing freer-thinking, more creativity and ultimately more satisfaction in working together, as long as there is consensus along broad lines as to what constitutes good style, and more importantly what constitutes bad style.

If a community is to truly be a community, the individuals which comprise it must have values and ideas in common, beyond what is required by the language, the framework and the standards adopted.

Part of the security that comes with adopting a tool like Ansible11 is the huge community that comes along with it. The fact that it is in use in so many different environments, with so many different goals and usage patterns, is in a way a vaccine against bias. This diversity can be channelled into some form of common understanding of what constitutes good style, and highlighting where the flexibility of the tool or language is being abused (as well as whether that abuse is justified).

Almost all languages have their linters12, and Ansible is no different. There are in fact two different style checkers for Ansible:

  1. ansible-lint
  2. Ansible Review

The jury is currently out on whether the latter is still alive, but given the tenfold ratio of almost all the metrics, I think it’s safe to say that at the time of writing ansible-lint wins.

Contributing to the Commons

In a service federation like EGI, there is a strong temptation for individual service providers to develop these roles themselves – a symptom of the “Not Invented Here” syndrome. The barrier to creating these roles is particularly low, especially if we consider the case where the community using these roles is empowered with solid knowledge of how the tool works. Much of the impetus to “rediscover the wheel” derives from the quality and reliability of the other wheels which have already been built. Instead of a robust design for “a wheel”, which can be re-used by anyone who wants to build a car, we end up with many flimsy wheels which just barely work. This is clearly a suboptimal situation – there is little to be gained by having such duplication of work, and the individual effort required to produce high-quality work is high. It is, however, a situation which nevertheless persists in part due to the lack of ownership of the products created.

How then can we create useful bits of infrastructure as a community, where these things are owned by the community itself? Ownership need not be restricted to the mere authorship of code - there are other ways to “own”, for example code reviews, bug fixes, contributing to the style guide, and of course ownership through usage, i.e. reporting issues, helping developers produce high-quality work, talking about the work at meetings, etc.

The main point here is that there are many roles to play, beyond the mere authorship of code, and each of them is important.

A guide, not a standard

Finally, a style guide is not a standard. It can be treated as one, but then it mostly ceases to offer the benefits of creativity described above. A guide is most useful when it is the sincere expression of consensus, based on the experience of a community of practice, of a better way of conducting an activity - not the only way, nor indeed the best way13. A guide should be more descriptive than prescriptive - describing how one should go about doing something rather than what one should be doing.

We’ve hit this problem so many times that the time has come to address it.

Style guide in action

:loudspeaker: We have started work on an Ansible Style Guide. We hope it will address some of the waste in our community, and ultimately improve the quality of our infrastructure

Reducing “Not Invented Here” Syndrome

Let’s say you’re starting work on the development of a new role. This could be either an existing service that doesn’t have a configuration management repository, or perhaps you’re working on a whole new service. The chances are that this role already exists - but the only easy way to check that is to see if it’s on Galaxy. Let’s see:

 ansible-galaxy search umd

Found 3 roles matching your search:

 Name            Description
 ----            -----------
 brucellino.UMD3 UMD3 repository for CentOS 6.x
 egi-qc.umd      UMD distribution repository deployment
 AAROC.UMD-role  Configures the Unified Middleware Distribution and Cloud Middleware Distrbution Stacks on your host

Uh-oh…

OK - but middleware components will be there, right? I’ll save you a lot of frustration, dear reader - they are not. This is not to say that a lot of work has not been done in our community in writing roles for “domestic use”. The tragedy is however that all this effort usually doesn’t produce a result of sufficient quality and scope that it’s reusable. Now, this is usually a problem with the role metadata, meaning that either it’s not enabled on Galaxy, or metadata doesn’t parse properly - but a larger problem is when roles are written to be so specific to a given use case or site that they cannot be re-used elsewhere.

Improving re-usability

For a role to be re-used, it has to be absolutely trustworthy, and this means putting some more effort into developing these infrastructure components, with a wider appreciation of it’s benefit to the wider community.

All of these problems could be entirely avoided, and transparently to the developer, by slightly changing the environment and making the development process a little more frictionless.

Solving the problem at the source

A better generator

Ansible, like almost any good tool out there, provides a neat way to generate a skeleton for a new project: ansible-galaxy init. It’s clear that many of the roles for EGI infrastructure that have been produced so far have not taken advantage of this, from the missing directories, files, etc., but even those that have been generated with Ansible Galaxy have conflicting or missing metadata14, resulting in them failing to show up in the Galaxy search.

But why should we be fiddling about with metadata in the first place?

In terms of the middleware, we only have a few options - these should be automatically added to the supported platforms in metadata.yml. By the same token, if you’re developing a piece of infrastructure, it’s probably a good idea to have your role cover the possible platforms, and not just one specific option.

Testing by default

Another sore point in re-use is knowing whether the role actually works. Sure, the documentation can express the limits of what the role is designed for, but again we hit the bias implicit in the developer’s mind. The only way to know if a role really does what it says it does is:

  • Apply it to various initial states.
  • Make assertions on the final state.

This borrows a lot from the Test-Driven Development (TDD) paradigm that Agilists know and love. Which now begs the question:

Where are all the tests?

There are two things we can do to improve both the re-usability of the role and the life of the developer:

  • Write the tests first
  • Generate appropriate test coverage along with the role skeleton

The former needs a whole post in infrastructure spec tests, which is in the pipeline. For the latter, we can easily include at least a default testing scenario with molecule, as well as a .travis.yml so that the role can have continuous integration.

Summary

This won’t solve all of our problems and certainly doesn’t guarantee re-use of existing roles, but laying this groundwork and making it easy to write solid, widely applicable roles will help. Infrastructure components should be reliable, do what they says they do, satisfy the needs of the community rather than the individual, and not introduce any vulnerabilities!

Stay tuned…

References and Footnotes

  1. Just ask my mom or every other unsatisfied person who has asked me what I do for a living. 

  2. Let’s just go with the EC description 

  3. I’m going to leave the question of “who are we actually?” as an exercise to the reader. If you identify with what is written here, you’re part of our community. If not, come on in - we probably have a lot to learn from each other! 

  4. See XKCD 1428 

  5. See “Move Fast and Break Things: How Facebook, Google, and Amazon Cornered Culture and Undermined Democracy”, by Jonathan Taplin 

  6. Hayes, Thomas C. (21 March 1983). “Hope at Storage Technology”. The New York Times. Retrieved 10 September 2013. 

  7. See the Service Component description in the EGI wiki 

  8. For ruby-based tools like Chef or Puppet the situation is actually way better. Chef for example has a powerful food critic. Compared to Ansible Lint’s few rules, Food Critic is way ahead with over 100. 

  9. See the book “Infrastructure as Code: Managing Servers in the Cloud”, by Keif Morris, ISBN-13 9781491924358 

  10. See the Google Site Reliability Book 

  11. I don’t mean to imply that this tool has been adopted by the entire community, just that if people in our community want to adopt it, then they can do so. 

  12. A brief search threw up @caramelomartins/awesome-linters which more than illustrates the point, I think. 

  13. I don’t go so far as to say the “best” way, because for one thing that sounds like a strong opinion, and has the arrogance to assert that the current moment is special. Instead of instigating divisive opinions or hedging (by adding varied adjectives like “best known” or “current best”), I prefer to accept the position that what we lay out together through consensus will inevitably be incomplete, but will be better than what we have tried and failed at before. 

  14. Typically missing OS support, or using the wrong tags for the operating systems.