Archive for November, 2010


November 3, 2010

I am in California for a week, visiting Berkeley and Los Angeles, and also catching up with friends and family in the area.

In talking to my friends in Silicon Valley, I was repeatedly struck by the dissonance between “academic” verification research I do, and the “real world” software engineering grunge that becomes as important, if not more important, than the pure functionality of the code.

For example, there is the issue of system configuration. When you deploy a software stack, say on the cloud, you not only have “your” code, but an entire ecosystem of servers, load balancers, database backends, etc. that must all work together. The magic glue that makes this happen is configuration options written in lengthy files, in somewhat ad hoc languages. Getting the configurations right at deployment, and maintaining their correctness through system and component upgrades is an art. And a nightmare.

It’s a nightmare because your entire system can crash because you have misconfigured your system, even though individual components (your application, the server) are individually perfect. In fact, misconfigurations are a significant reason for system downtimes.

The problem is old and important, but I am not sure it has ever been a hot area in academic research. I recall there was a nice paper by Nick Feamster and Hari Balakrishnan on BGP misconfigurations some years back, and more recently, I read a paper by Mona Attariyan and Jason Flinn on dynamic dependency tracking for debugging misconfigurations. Are there other projects I am missing?

A somewhat naive view is to consider the content of the configuration file as symbolic inputs and to apply symbolic execution techniques, generating a symbolic description of all configurations causing a failure (and by negation, the space of “safe” configurations). I doubt though if symbolic execution can be scaled to the system level. An alternative is to compute the forward dynamic slice of an error trace, projecting to the set of configuration variables. Both techniques are “low level”. It is not clear to me how semantic constraints can be captured in this way.

Finally, why isn’t there a lot more academic interest in the problem? Is it that deployment issues affecting large scale deployments don’t “show up” in most projects in an academic setting?