A Two Month Debugging Story

This is the story about an error that caused our tests to fail maybe one out of one hundred builds. Recently we figured out why this happened and deployed code to fix it.

The Problem

I've been fighting slow test times in Javascript and Sails since pretty much the day I started at Shyp. We run about 6000 tests on a popular CI service, and they take about 7 minutes to run end to end. This is pretty slow, so we spend a lot of time thinking about ways to make them faster.

To ensure each test has a clean slate, we clear the database between each test. Normally you clear tables by running TRUNCATE TABLE table1 table2 table3 table4 CASCADE. Unfortunately this takes about 200ms, which is too slow when you want to run it between every test - it would double the runtime of our test suite.

DELETE FROM runs much faster, but if you have records in table A that reference table B, DELETE FROM tableB will fail, since records in A still point to it. We could draw a dependency graph between our 60 tables and issue the DELETEs in the correct order, but this seems painful to maintain as we add more constraints. Instead our contractor James Cropcho found a solution - disable the constraints, run the DELETE, then re-enable the constraints. This ran in about 20ms.

ALTER TABLE tableA DISABLE TRIGGER ALL;
DELETE FROM tableA;
ALTER TABLE tableA ENABLE TRIGGER ALL;

This worked for a few months. Unfortunately, occasionally a test would time out and fail after 18 seconds - maybe 1 in 100 test runs would fail with a timeout. Normally when the tests fail you get back some kind of error message — a Postgres constraint failure will print a useful message, or a Javascript exception will bubble up, or Javascript will complain about an unhandled rejection.

With this error we didn't get any of those. We also observed it could happen anywhere - it didn't seem to correlate with any individual test or test file.

Where to start looking

For a long time we suspected that our ORM was losing some kind of race and a callback wasn't being hit, deep in the framework code. This type of problem isn't uncommon in Javascript - the most common tool for stubbing the system time causes hangs in many libraries, including Express, Request, async, Bluebird, and Co. These types of issues are incredibly difficult to debug, because when things hang, or fail to hit callbacks, you have virtually no insight into what the stack looks like, or what callback failed to get hit. We could add console.log lines but you have to have some idea what we were looking for. Furthermore annotating every single library and every single query would have caused readability problems elsewhere.

Run the tests forever

Our CI provider lets us SSH into a host, but you have to enable SSH access before the build completes, which isn't practical when 1 in 100 builds fails. My normal approach to debugging intermittent test failures is to run the tests in a continuous loop until they fail, and then inspect the test logs/SSH to the host and look for... something, until you figured out what was going on. Unfortunately that wasn't triggering it often enough either, and even when I did trigger it, our CI provider shuts down SSH access after 30 minutes, and I wasn't fast enough to notice/investigate.

Instead we had to ship whatever logs we needed as artifacts from the test. This required turning on database logging for Postgres, then copying logs to the directory where artifacts get exported. Database logs across multiple containers shared the same filename, which meant only database logs from the first container became available as artifacts, so we also had to add a unique ID to each log file so they'd get exported properly.

This was a lot of steps, and took just enough debugging/setup time that I procrastinated for a while, hoping a better stack trace or error message might reveal itself, or that the problem would go away.

Here are the Postgres settings we enabled:

logging_collector = on
log_statement = 'all'
log_duration = on
# Log if we are waiting on a lock.
deadlock_timeout = 1s
log_lock_waits = on
# Ensure all logs are in one file.
log_rotation_size = 200MB
log_line_prefix = '%m [%p-%c-%l] %e %q%u@%d '

Finally we found something!

20:38:04.947 LOG: process 16877 still waiting for AccessExclusiveLock on relation 16789 of database 16387 after 1000.113 ms
20:38:04.947 DETAIL: Process holding the lock: 16936. Wait queue: 16877.
20:38:23.141 LOG: process 16877 acquired AccessExclusiveLock on relation 16789 of database 16387 after 19193.981 ms
20:38:23.141 ERROR: canceling autovacuum task
20:38:23.141 CONTEXT: automatic vacuum of table "circle_test.public.events"
20:38:23.141 LOG: process 16877 acquired AccessExclusiveLock on relation 16789 of database 16387 after 19193.981 ms

It looks like, unbeknownst to us, autovacuum was running in the background, and conflicting with the multi-statement ALTER TABLE query, in a way that each one was waiting for the other to complete. If you get deep enough in the weeds of Postgres blog posts, you find that stuck VACUUMs can and do happen.

Although it requires a fairly unusual combination of circumstances, a stuck VACUUM can even lead to an undetected deadlock situation, as Tom Lane recently pointed out. If a process that has a cursor open on a table attempts to take an exclusive lock on that table while it's being vacuumed, the exclusive lock request and the vacuum will both block until one of them is manually cancelled.

Not out of the woods

Unfortunately, we continued to see deadlocks. In some cases, a test would continue to make background database queries after they had reported completion. These would interact poorly with our ALTER TABLE commands, and deadlock, leading to a timeout. I posted on the Postgresql mailing list about this, and Alexey Bashtanov suggested a different approach. Instead of running ALTER TABLE, start a transaction and defer enforcement of foreign key constraints until the COMMIT. That way we could run the DELETE FROM's in any order and they'd still work.

BEGIN; SET CONSTRAINTS ALL DEFERRED;
DELETE FROM tableA;
DELETE FROM tableB;
COMMIT

This approach also sped up that query! It now runs in about 10ms, versus 20ms before.

Not out of the woods, again

We thought we'd nipped it in the bud, and certainly we were happy with the performance improvement. Unfortunately we've still been seeing timeouts; less frequently, but maybe once or twice a week. They usually happen when INSERTing records into an empty table; Postgres just hangs for 15 seconds, until the statement_timeout kicks in and Postgres kills the query, which fails the build. There are no messages about locks, so we're not sure what's going on. This is pretty uncharacteristic behavior for Postgres, and we're not doing anything particularly crazy with queries. We're open to suggestions; I posted about it again on the Postgres mailing list but did not get any good feedback.

Conclusion

The amount of work that's gone into tracking this failure down is depressing, though it's surprising that Javascript isn't the culprit for intermittent test failures. It would be nice if our CI provider had a way to auto-enable SSH on failed builds, enabled Postgres logging by default, or disabled autovacuum by default. I'd also recommend if you're adding foreign key constraints to make them DEFERRABLE INITIALLY IMMEDIATE; this gives you the most flexibility to pursue strategies like we did.

Liked what you read? I am available for hire.

8 thoughts on “A Two Month Debugging Story

  1. Chetan Shenoy

    Slightly off-topic but still related: Have you guys thought about moving away from Sails? It seems like you and your team have and continue to work against the grain. I know it’s not trivial but after a certain point you guys must be losing so many man hours that could be spent on a better framework.

    Reply
  2. Jonathan

    “We could draw a dependency graph between our 60 tables and issue the DELETEs in the correct order, but this seems painful to maintain as we add more constraints. “

    If you pulled all the (constrained table, referenced table) pairs for all your foreign keys and piped them to the tsort(1) command line utility it would figure out a valid ordering for you (assuming you didn’t have any cyclic dependencies).

    I’d still expect the DELETEs with constraints disabled to be much faster, but just thought I’d point out that there is a solution to the problem you stated.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments are heavily moderated.