More Comment-Preserving Configuration Parsers

For the past few weeks I've been on the hunt for a configuration file format with the following three properties:

  1. You can use a library to parse the configuration. Most configuration formats allow for this, though some (nginx, haproxy, vim) aren't so easy.

  2. You can manipulate the keys and values, using the same library.

  3. When that library writes the file to disk, any comments that were present in the original config file are preserved.

Why bother? First, allowing programs to read/write configuration files allows for automated cleanup/manipulation. Go ships with a first-class parser/AST, and as a consequence there are many programs that can lint/edit/vet your source code. These wouldn't be possible without that ast package and a set of related tools that make parsing and manipulating the source easy.

You can imagine installers that could automatically make a change to your configuration; for example, certbot from the Let's Encrypt project tries to automatically edit your Apache or Nginx configuration. This is an incredibly difficult task, due to the complexity of the configuration that have piled up over the years, and that those configuration files weren't built with automatic editing in mind.

Backwards incompatible changes are never good, but their downsides can be mitigated by effective tools for parsing and updating configuration.

You want comments in your configuration file because configurations tend to accumulate over the years and it can be incredibly unclear where values came from, or why values were set the way they were. At Twilio, the same HAProxy config got copied from service to service to service, even though the defined timeouts led to bad behavior. Comments allow you to provide more information about why a value is set the way it is, and note values where you weren't sure what they should be, but had to pick something besides "infinity" before deploying.

What problems do you run into when you try to implement a comment-preserving configuration parser? A lot of config parsers try to turn the file into a simple data type like a dictionary or an array, which immediately loses a lot of the fidelity that was present in the original file. The second problem there is that dictionaries in most languages do not preserve ordering so you might write out the configuration in a different order than you read it, which messes up git diffs, and the comment order.

You are going to need to implement something that is a lot closer to an abstract syntax tree than a dictionary; at the very least maps of keys and values should be stored as an array of tuples and not a dictionary type.

The next problem you run into is that syntax trees are great for preserving the fidelity of source code but tend to be unwieldy when all you want to do is index into an array, or get the value for a key, especially when the type of that value may take any number of values - a number, a string, a date, or an array of the above. The good news is configuration files tend to only need a subset of the syntax/fidelity necessary for a programming language (you don't need/want functions, for example) so you can hopefully get away with defining a simpler set of interfaces for manipulating data.

(Incidentally I realized in the course of researching this that I have written two libraries to do this - one is a library for manipulating your /etc/hosts file, and the other is a library for bumping versions in Go source code. Of course those are simpler problems than the one I am trying to solve here).

So let's look at what's out there.

  • JSON is very popular, but it's a non-starter because there's no support for comments, and JSON does not define an ordering for keys and values in a dictionary; they could get written in a different order than they are read. JSON5 is a variant of JSON that allows for code comments. Unfortunately I haven't seen a JSON5 parser that maintains comments in the representation.

  • YAML is another configuration format used by Ansible, Salt, Travis CI, CircleCI and others. As far as I can tell there is exactly one YAML parser that preserves comments, written in Python.

  • XML is not the most popular format for configuration, but the structure makes it pretty easy to preserve comments. For example, the Go standard library parser contains tools for reading and writing comments. XML seems to have the widest set of libraries that preserve comments - I also found libraries in Python and Java and could probably find more if I looked harder.

  • TOML is a newer format that resembles YAML but has a looser grammar. There are no known parsers for TOML that preserve comments.

  • INI files are used by windows programs, and the Python configparser module, among others. I have found one parser in Perl that tries to preserve comments.

  • Avro is another configuration tool that is gaining in popularity for things like database schema definitions. Unfortunately it's backed by JSON, so it's out for the same reasons JSON is out.

  • You can use Go source code for your configuration. Unfortunately the tools for working with Go syntax trees are still pretty forbidding, for tasks beyond extremely simple ones, especially if you want to go past the token representation of a file into actually working with e.g. a struct or an array.

I decided on [a configuration file format called hcl], from Hashicorp. It resembles nginx configuration syntax, but ships with a Go parser and printer. It's still a little rough around the edges to get values out of it, so I wrote a small library for getting and setting keys in a configuration map.

This is difficult - it's much easier to write a parser that just converts to an array or a dictionary, than one that preserves the structure of the underlying file. But I think we've only scratched the surface of the benefits, with tools like Go's auto code rewriter and npm init/npm version patch. Hopefully going forward, new configuration formats will ship with a proper parser from day one.

Liked what you read? I am available for hire.

6 thoughts on “More Comment-Preserving Configuration Parsers

  1. Pingback: Spotting a million dollars in your AWS account · Segment Blog | Artificia Intelligence

  2. Paul Tötterman

    Thank you for the excellent article. I actually stumbled on this because of your ssh_config parser. I agree that configuration parsers that preserve comments are very valuable. I myself would like to see a golang parser for yaml that preserves comments, but I guess I could live with a toml or some other parser that does the same. Have there been any developments other than your hosts and ssh_config parsers?

    Cheers,
    Paul

    Reply
  3. Hitesh

    Hey! Thanks for an article on such an overlooked topic. Could you please elaborate on building a comment-preserving parser for YAML using the Abstract Syntax Tree (AST) that you propose? Particularly, do you propose to use comments as separate nodes?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments are heavily moderated.