Posts Tagged With: Code

Cleaning up Parallel Tests in Go 1.7

I have a lot of tests in Go that integrate with Postgres, and test the interactions between Go models and the database.

A lot of these tests can run in parallel. For example, any test that attempts to write a record, but fails with a constraint failure, can run in parallel with all other tests. A test that tries to read a random database ID and expects to not fetch a record can run in parallel with other tests. If you write your tests so they all use random UUID's, or all run inside of transactions, you can run them in parallel. You can use this technique to keep your test suite pretty fast, even if each individual test takes 20-40 milliseconds.

You can mark a test to run in parallel by calling t.Parallel() at the top of the test. Here's an example test from the job queue Rickover:

func TestCreateMissingFields(t *testing.T) {
  t.Parallel()
  test.SetUp(t)
  job := models.Job{
    Name: "email-signup",
  }
  _, err := jobs.Create(job)
  test.AssertError(t, err, "")
  test.AssertEquals(t, err.Error(), "Invalid delivery_strategy: \"\"")
}

This test will run in parallel with other tests marked Parallel and only with other tests marked Parallel; all other tests run sequentially.

The problem comes when you want to clear the database. If you have a t.Parallel() test clean up after it has made its assertions, it might try to clear the database while another Parallel() test is still running! That wouldn't be good at all. Presumably, the sequential tests are expecting the database to be cleared. (They could clear it at the start of the test, but this might lead to unnecessary extra DB writes; it's better for tests that alter the database to clean up after themselves).

(You can also run every test in a transaction, and roll it back at the end. Which is great, and gives you automatic isolation! But you have to pass a *sql.Tx around everywhere, and make two additional roundtrips to the database, which you probably also need to do in your application).

Go 1.7 adds the ability to nest tests. Which means we can run setup once, run every parallel test, then tear down once. Something like this (from the docs):

func TestTeardownParallel(t *testing.T) {
  // This Run will not return until the parallel tests finish.
  t.Run("group", func(t *testing.T) {
    t.Run("Test1", parallelTest1)
    t.Run("Test2", parallelTest2)
    t.Run("Test3", parallelTest3)
  })
  // <tear-down code>
}

Note you have to lowercase the function names for the parallel tests, or they'll run inside of the test block, and then again, individually. I settled on this pattern:

var parallelTests = []func(*testing.T){
  testCreate,
  testCreateEmptyPriority,
  testUniqueFailure,
  testGet,
}
func TestAll(t *testing.T) {
  test.SetUp(t)
  defer test.TearDown(t)
  t.Run("Parallel", func(t *testing.T) {
    for _, parallelTest := range parallelTests {
      test.F(t, parallelTest)
    }
  })
}

The test mentioned there is the set of test helpers from the Let's Encrypt project, plus some of my own. test.F finds the defined function name, capitalizes it, and passes the result to test.Run:

// capitalize the first letter in the string
func capitalize(s string) string {
  r, size := utf8.DecodeRuneInString(s)
  return fmt.Sprintf("%c", unicode.ToTitle(r)) + s[size:]
}
func F(t *testing.T, f func(*testing.T)) {
  longfuncname := runtime.FuncForPC(reflect.ValueOf(f).Pointer()).Name()
  funcnameparts := strings.Split(longfuncname, ".")
  funcname := funcnameparts[len(funcnameparts)-1]
  t.Run(capitalize(funcname), f)
}

The result is a set of parallel tests that run a cleanup action exactly once. The downside is the resulting tests have two levels of nesting; you have to define a second t.Run that waits for the parallel tests to complete.

=== RUN   TestAll
=== RUN   TestAll/Parallel
=== RUN   TestAll/Parallel/TestCreate
=== RUN   TestAll/Parallel/TestCreateEmptyPriority
=== RUN   TestAll/Parallel/TestUniqueFailure
=== RUN   TestAll/Parallel/TestGet
--- PASS: TestAll (0.03s)
    --- PASS: TestAll/Parallel (0.00s)
        --- PASS: TestAll/Parallel/TestCreate (0.01s)
        --- PASS: TestAll/Parallel/TestCreateEmptyPriority (0.01s)
        --- PASS: TestAll/Parallel/TestUniqueFailure (0.01s)
        --- PASS: TestAll/Parallel/TestGet (0.02s)

The other thing that might trip you up: If you add print statements to your tear down lines, they'll appear in the console output before the PASS lines. However, I verified they run after all of your parallel tests are finished running.

Buying stocks without a time limit

Recently I put the maximum amount of cash into my IRA account. Since stock prices jump up and down all the time, I wondered whether the current price would be the best one to buy the stock at. In particular, I'm not withdrawing money from my IRA for the better part of 40 years, so I'm not going to sell it — I want the absolute lowest price possible.

Stock markets offer a type of order that's useful for this - you can put in a limit order, which says essentially "I'll buy N shares at price X", where X is some price below the last traded price of the stock. Anyone can take the other side of that order at any time, which usually happens when the stock price drops below the limit order! So you can put in a limit order at any time and it will execute whenever the stock price drops below that amount. I'm simplifying slightly but that's the gist.

So if the current stock price is X, what price should I put in my limit order for? You should be able to come up with some kind of model of the stock price and how it moves and say, roughly, "If you set your limit order at Z, there will be a ~98% chance of it executing in the next year." The normal criticism of stock price models is they underestimate the probability of extreme events, which is alright for us, since extreme events make it more likely our limit order will trigger.

I figured someone studied this problem before so I asked about it on Quant Stack Exchange and I got a really good answer! The answer pointed to two different papers, which have attempted to model this.

Essentially the model looks at the volatility of the stock - how often it jumps up and down - the stock's current price, the probability you want the order to hit, and the amount of time you want to wait. The formula is a little complex but it looks like this:

I wrote some code to compute these values for the stocks I wanted to buy. The hardest part is figuring out the volatility, because you need access to a lot of stock prices, and you'll want to compute the standard deviation from that. I was able to get historical price data from Quandl, and current stock prices from Yahoo.

Combine the two and you get output that looks like this:

$ go run erf.go main.go --stock=<stock> --total=5500 --percent 0.95
the standard deviation is: 0.006742
annualized: 0.107022
current price: 74.95
Based on this stock's volatility, you should set a limit order for: $74.35.

Compared with buying it at the current price, you'll be able to buy 0.6 extra shares (a value of $43.77)

Pretty neat! I've put the code up on Github, so you can take a look for yourself.

A Two Month Debugging Story

This is the story about an error that caused our tests to fail maybe one out of one hundred builds. Recently we figured out why this happened and deployed code to fix it.

The Problem

I've been fighting slow test times in Javascript and Sails since pretty much the day I started at Shyp. We run about 6000 tests on a popular CI service, and they take about 7 minutes to run end to end. This is pretty slow, so we spend a lot of time thinking about ways to make them faster.

To ensure each test has a clean slate, we clear the database between each test. Normally you clear tables by running TRUNCATE TABLE table1 table2 table3 table4 CASCADE. Unfortunately this takes about 200ms, which is too slow when you want to run it between every test - it would double the runtime of our test suite.

DELETE FROM runs much faster, but if you have records in table A that reference table B, DELETE FROM tableB will fail, since records in A still point to it. We could draw a dependency graph between our 60 tables and issue the DELETEs in the correct order, but this seems painful to maintain as we add more constraints. Instead our contractor James Cropcho found a solution - disable the constraints, run the DELETE, then re-enable the constraints. This ran in about 20ms.

ALTER TABLE tableA DISABLE TRIGGER ALL;
DELETE FROM tableA;
ALTER TABLE tableA ENABLE TRIGGER ALL;

This worked for a few months. Unfortunately, occasionally a test would time out and fail after 18 seconds - maybe 1 in 100 test runs would fail with a timeout. Normally when the tests fail you get back some kind of error message — a Postgres constraint failure will print a useful message, or a Javascript exception will bubble up, or Javascript will complain about an unhandled rejection.

With this error we didn't get any of those. We also observed it could happen anywhere - it didn't seem to correlate with any individual test or test file.

Where to start looking

For a long time we suspected that our ORM was losing some kind of race and a callback wasn't being hit, deep in the framework code. This type of problem isn't uncommon in Javascript - the most common tool for stubbing the system time causes hangs in many libraries, including Express, Request, async, Bluebird, and Co. These types of issues are incredibly difficult to debug, because when things hang, or fail to hit callbacks, you have virtually no insight into what the stack looks like, or what callback failed to get hit. We could add console.log lines but you have to have some idea what we were looking for. Furthermore annotating every single library and every single query would have caused readability problems elsewhere.

Run the tests forever

Our CI provider lets us SSH into a host, but you have to enable SSH access before the build completes, which isn't practical when 1 in 100 builds fails. My normal approach to debugging intermittent test failures is to run the tests in a continuous loop until they fail, and then inspect the test logs/SSH to the host and look for... something, until you figured out what was going on. Unfortunately that wasn't triggering it often enough either, and even when I did trigger it, our CI provider shuts down SSH access after 30 minutes, and I wasn't fast enough to notice/investigate.

Instead we had to ship whatever logs we needed as artifacts from the test. This required turning on database logging for Postgres, then copying logs to the directory where artifacts get exported. Database logs across multiple containers shared the same filename, which meant only database logs from the first container became available as artifacts, so we also had to add a unique ID to each log file so they'd get exported properly.

This was a lot of steps, and took just enough debugging/setup time that I procrastinated for a while, hoping a better stack trace or error message might reveal itself, or that the problem would go away.

Here are the Postgres settings we enabled:

logging_collector = on
log_statement = 'all'
log_duration = on
# Log if we are waiting on a lock.
deadlock_timeout = 1s
log_lock_waits = on
# Ensure all logs are in one file.
log_rotation_size = 200MB
log_line_prefix = '%m [%p-%c-%l] %e %q%u@%d '

Finally we found something!

20:38:04.947 LOG: process 16877 still waiting for AccessExclusiveLock on relation 16789 of database 16387 after 1000.113 ms
20:38:04.947 DETAIL: Process holding the lock: 16936. Wait queue: 16877.
20:38:23.141 LOG: process 16877 acquired AccessExclusiveLock on relation 16789 of database 16387 after 19193.981 ms
20:38:23.141 ERROR: canceling autovacuum task
20:38:23.141 CONTEXT: automatic vacuum of table "circle_test.public.events"
20:38:23.141 LOG: process 16877 acquired AccessExclusiveLock on relation 16789 of database 16387 after 19193.981 ms

It looks like, unbeknownst to us, autovacuum was running in the background, and conflicting with the multi-statement ALTER TABLE query, in a way that each one was waiting for the other to complete. If you get deep enough in the weeds of Postgres blog posts, you find that stuck VACUUMs can and do happen.

Although it requires a fairly unusual combination of circumstances, a stuck VACUUM can even lead to an undetected deadlock situation, as Tom Lane recently pointed out. If a process that has a cursor open on a table attempts to take an exclusive lock on that table while it's being vacuumed, the exclusive lock request and the vacuum will both block until one of them is manually cancelled.

Not out of the woods

Unfortunately, we continued to see deadlocks. In some cases, a test would continue to make background database queries after they had reported completion. These would interact poorly with our ALTER TABLE commands, and deadlock, leading to a timeout. I posted on the Postgresql mailing list about this, and Alexey Bashtanov suggested a different approach. Instead of running ALTER TABLE, start a transaction and defer enforcement of foreign key constraints until the COMMIT. That way we could run the DELETE FROM's in any order and they'd still work.

BEGIN; SET CONSTRAINTS ALL DEFERRED;
DELETE FROM tableA;
DELETE FROM tableB;
COMMIT

This approach also sped up that query! It now runs in about 10ms, versus 20ms before.

Not out of the woods, again

We thought we'd nipped it in the bud, and certainly we were happy with the performance improvement. Unfortunately we've still been seeing timeouts; less frequently, but maybe once or twice a week. They usually happen when INSERTing records into an empty table; Postgres just hangs for 15 seconds, until the statement_timeout kicks in and Postgres kills the query, which fails the build. There are no messages about locks, so we're not sure what's going on. This is pretty uncharacteristic behavior for Postgres, and we're not doing anything particularly crazy with queries. We're open to suggestions; I posted about it again on the Postgres mailing list but did not get any good feedback.

Conclusion

The amount of work that's gone into tracking this failure down is depressing, though it's surprising that Javascript isn't the culprit for intermittent test failures. It would be nice if our CI provider had a way to auto-enable SSH on failed builds, enabled Postgres logging by default, or disabled autovacuum by default. I'd also recommend if you're adding foreign key constraints to make them DEFERRABLE INITIALLY IMMEDIATE; this gives you the most flexibility to pursue strategies like we did.

The Frustration and Loneliness of Server-Side Javascript Development

I was hired in December 2014 as the sixth engineer at Shyp. Shyp runs Node.js on the server. It's been a pretty frustrating journey, and I wanted to share some of the experiences we've had. There are no hot takes about the learning curve, or "why are there so many frameworks" in this post.

  • Initially we were running on Node 0.10.30. Timers would consistently fire 50 milliseconds early - that is, if you called setTimeout with a duration of 200 milliseconds, the timer would fire after 150 milliseconds. I asked about this in the #Node.js Freenode channel, and was told it was my fault for using an "ancient" version of Node (it was 18 months old at that time), and that I should get on a "long term stable" version. This made timeout-based tests difficult to write - every test had to set timeouts longer than 50ms.

  • I wrote a metrics library that published to Librato. I expect a background metric publishing daemon to silently swallow/log errors. One day Librato had an outage and returned a 502 to all requests; unbeknownst to me the underlying librato library we were using was also an EventEmitter, that crashed the process if unhandled. This caused about 30 minutes of downtime in our API.

  • The advice on the Node.js website is to crash the process if there's an uncaught exception. Our application is about 60 models, 30 controllers; it doesn't seem particularly large. It consistently takes 40 seconds to boot our server in production; the majority of this time is spent requiring files. Obviously we try not to crash, and fix crashes when we see them, but we can't take 40 seconds of downtime because someone sent an input we weren't expecting. I asked about this on the Node.js mailing list but got mostly peanuts. Recently the Go core team sped up build times by 2x in a month.

  • We discovered our framework was sleeping for 50ms on POST and PUT requests for no reason. I've previously written about the problems with Sails, so I am going to omit most of that here.

  • We upgraded to Node 4 (after spending a day dealing with a nasty TLS problem) and observed our servers were consuming 20-30% more RAM, with no accompanying speed, CPU utilization, or throughput increase. We were left wondering what benefit we got from upgrading.

  • It took about two months to figure out how to generate a npm shrinkwrap file that produced reliable changes when we tried to update it. Often attempting to modify/update it would change the "from" field in every single dependency.

  • The sinon library appears to be one of the most popular ways to stub the system time. The default method for stubbing (useFakeTimers) leads many other libraries to hang inexplicably. I noticed this after spending 90 minutes wondering why stubbing the system time caused CSV writing to fail. The only ways to debug this are 1) to add console.log statements at deeper and deeper levels, since you can't ever hit ctrl+C and print a stack trace, or 2) take a core dump.

  • The library we used to send messages to Slack crashed our server if/when Slack returned HTML instead of JSON.

  • Our Redis provider changed something - we don't know what, since the version number stayed the same - in their Redis instance, which causes the Redis connection library we use to crash the process with a "Unhandled event" message. We have no idea why this happens, but it's tied to the number of connections we have open - we've had to keep a close eye on the number of concurrent Redis connections, and eventually just phased out Redis altogether.

  • Our database framework doesn't support transactions, so we had to write our own transaction library.

  • The underlying database driver doesn't have a way to time out/cancel connection acquisition, so threads will wait forever for a connection to become available. I wrote a patch for this; it hasn't gotten any closer to merging in 10 weeks, so I published a new NPM package with the patch, and submitted that upstream.

  • I wrote a job queue in Go. I couldn't benchmark it using our existing Node server as the downstream server, the Node server was too slow - a basic Express app would take 50ms on average to process incoming requests and respond with a 202, with 100 concurrent in-flight requests. I had to write a separate downstream server for benchmarking.

  • Last week I noticed our integration servers were frequently running out of memory. Normally they use about 700MB of memory, but we saw they would accumulate 500MB of swap in about 20 seconds. I think the app served about thirty requests total in that time frame. There's nothing unusual about the requests, amount of traffic being sent over HTTP or to the database during this time frame; the amount of data was in kilobytes. We don't have tools like pprof. We can't take a heap dump, because the box is out of memory by that point.

Summary

You could argue - and certainly you could say this about Sails - that we chose bad libraries, and we should have picked better. But part of the problem with the Node ecosystem is it's hard to distinguish good libraries from bad.

I've also listed problems - pound your head, hours or days of frustrating debugging - with at least six different libraries above. At what point do you move beyond "this one library is bad" and start worrying about the bigger picture? Do the libraries have problems because the authors don't know better, or because the language and standard library make it difficult to write good libraries? I don't think it helps.

You could argue that we are a fast growing company and we'd have these problems with any language or framework we chose. Maybe? I really doubt it. The job queue is 6000 lines of Go code written from scratch. There was one really bad problem in about two months of development, and a) the race detector caught it before production, b) I got pointed to the right answer in IRC pretty quickly.

You could also argue we are bad engineers and better engineers wouldn't have these problems. I can't rule this out, but I certainly don't think this is the case.

The Javascript community gets a lot of credit for its work on codes of conduct, on creating awesome conferences, for creating a welcoming, thoughtful community. But the most overwhelming feeling I've gotten over the past year has been loneliness. I constantly feel like I am the only person running into these problems with libraries, these problems in production, or I'm the only person who cares about the consequences of e.g. waiting forever for a connection to become available. Frequently posting on Github or in IRC and getting no response, or getting told "that's the way it is". In the absence of feedback, I am left wondering how everyone else is getting along, or how their users are.

I've learned a lot about debugging Node applications, making builds faster, finding and fixing intermittent test failures and more. I am available to speak or consult at your company or event, about what I have learned. I am also interested in figuring out your solutions to these problems - I will buy you a coffee or a beer and chat about this at any time.

Go Concurrency for Javascript Developers

Often the best way to learn a new language is to implement something you know
with it. Let’s take a look at some common async Javascript patterns, and how
you’d implement them in Go.

Callbacks

You can certainly implement callbacks in Go! Functions are first class
citizens. Here we make a HTTP request and hit the callback with the response
body or an error.

func makeRequest(method string, url string, cb func(error, io.ReadCloser)) {
    req, _ := http.NewRequest(method, url, nil)
    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        cb(err, nil)
    } else {
        cb(err, resp.Body)
    }
}
func main() {
    makeRequest("GET", "http://ipinfo.io/json", func(err error, body io.ReadCloser) {
        defer body.Close()
        io.Copy(os.Stdout, body)
    })
}

Go uses callbacks in the standard library in some places – time.AfterFunc,
http.HandlerFunc. Other places callbacks are mostly used to allow you to
specify what function or action you want to take, in interaction with a useful
piece of code – filepath.Walk, strings.LastIndexFunc, etc.

Most of the same problems with callbacks apply to Go as Javascript, namely, you
can forget to call a callback, or you can call it more than once. At least you
can’t try to callback that is undefined in Go, since the compiler will not
let you do this.

But callbacks aren’t idiomatic. Javascript needs callbacks because there’s
only a single thread of execution, processing every request on your server
sequentially. If you "wait"/"block" for something to finish (e.g. by calling
fs.readFileSync or any of the Sync methods, every other request on your
server has to wait until you’re done. I’m simplifying, but you have to have
callbacks in Javascript so everything else (other incoming HTTP requests, for
example) have a chance to run.

Go does not have this same limitation. A Javascript thread can only run on one
CPU; Go can use all of the CPU’s on your computer, so multiple threads can
do work at the same time, even if all of them are making "blocking" function
calls, like HTTP requests, database queries, or reading or writing files.

Furthermore, even if you only had one CPU, multiple Go threads would run
without having IO "block" the scheduler, even though methods like http.Get
have a synchronous interface. How does this work? Deep down in code like
net/fd_unix.go, the socket code calls syscall.Write, which (when you follow
it far enough) calls runtime.entersyscall, which signals to the scheduler
that it won’t have anything to do for a while
, and other work should
proceed.

For that reason you probably don’t want to use callbacks for asynchronous code.

.then / .catch / .finally

This is the other common approach to async operations in Javascript.
Fortunately most API’s in Go are blocking, so you can just do one thing and
then the other thing. Say you want to execute the following async actions:

var changeEmail = function(userId, newEmail) {
    return checkEmailAgainstBlacklist(newEmail).then(function() {
        return getUser(userId);
    }).then(function(user) {
        user.email = newEmail;
        return user.save();
    }).catch(function(err) {
        console.error(err);
        throw err;
    });
}

In Go, you’d just do each action in turn and wait for them to finish:

func changeEmail(userId string, newEmail string) (User, error) {
    err := checkEmailAgainstBlacklist(newEmail)
    if err != nil {
        return nil, err
    }
    user, err := getUser(userId)
    if err != nil {
        return nil, err
    }
    user.Email = newEmail
    return user.Save()
}

.finally is usually implemented with the defer keyword; you can at any time
defer code to run at the point the function terminates.

resp, err := http.Get("http://ipinfo.io/json")
// Note Close() will panic if err is non-nil, you still have to check err
checkError(err)
defer resp.Body.Close()
io.Copy(os.Stdout, resp.Body)

Promise.try

Just ignore the error response, or handle it; there’s no real difference
between sync and async errors, besides calling panic, which your code
shouldn’t be doing.

Promise.join

Doing multiple things at once in Javascript is pretty easy with Promises.
Here we dispatch two database queries at once, and handle the results in one
function.

var checkUserOwnsPickup = function(userId, pickupId) {
    return Promise.join(
        getUser(userId),
        getPickup(pickupId)
    ).spread(function(user, pickup) {
        return pickup.userId === user.id;
    })
}

Unfortunately doing this in Go is a little more complicated. Let’s start with
a slightly easier case – doing something where we don’t care about the result,
whether it errors or not. You can put each in a goroutine and they’ll execute
in the background.

func subscribeUser(user User) {
    go func() {
        sendWelcomeEmail(user.Email)
    }()
    go func() {
        metrics.Increment('user.signup')
    }()
}

The easiest way to pass the results back is to declare the variables in the
outer scope and assign them in a goroutine.

func main() {
    var user models.User
    var userErr error
    var pickup models.Pickup
    var pickupErr error
    var wg sync.WaitGroup
    wg.Add(2)
    go func() {
        user, userErr = getUser(userId)
        wg.Done()
    }()
    go func() {
        pickup, pickupErr = getPickup(pickupId)
        wg.Done()
    }()
    wg.Wait()
    // Error checking omitted
    fmt.Println(user.Id)
    fmt.Println(pickup.Id)
}

More verbose than Promise.join, but the lack of generics primitives makes a
shorter solution tough. You can gain more control, and early terminate by using
channels, but it’s more verbose to do so.

Promise.map

Do the same as the Go version of Promise.join above, and then write a for
loop to loop over the resolved values. This might be a little trickier because
you need to specify the types of everything, and you can’t create arrays with
differing types. If you want a given concurrency value, there are some neat
code examples in this post on Pipelines.

Checking for races

One of the strengths of Go is the race detector, which detects whether
two threads can attempt to read/write the same variable in different orders
without synchronizing first. It’s slightly slower than normal, but I highly
encourage you to run your test suite with the race detector on. Here’s our
make target:

test:
    go test -race ./... -timeout 2s

Once it found a defect in our code that turned out to be a standard library
error
! I highly recommend enabling the race detector.

The equivalent in Javascript would be code that varies the event loop each
callback runs in, and checks that you still get the same results.

Thanks

I hope this helps! I wrote this to be the guide I wish I had when I was getting
started figuring out how to program concurrently in Go.

Thanks to Kyle Conroy, Alan Shreve, Chris
Goddard
, @rf, and members of the #go-nuts IRC channel for
answering my silly questions.

Encoding and Decoding JSON, with Go’s net/http package

This is a pretty common task: encode JSON and send it to a server, decode JSON on the server, and vice versa. Amazingly, the existing resources on how to do this aren't very clear. So let's walk through each case, for the following simple User object:

type User struct{
    Id      string
    Balance uint64
}

Sending JSON in the body of a POST/PUT request

Let's start with the trickiest one: the body of a Go's http.Request is an io.Reader, which doesn't fit well if you have a struct - you need to write the struct first and then copy that to a reader.

func main() {
    u := User{Id: "US123", Balance: 8}
    b := new(bytes.Buffer)
    json.NewEncoder(b).Encode(u)
    res, _ := http.Post("https://httpbin.org/post", "application/json; charset=utf-8", b)
    io.Copy(os.Stdout, res.Body)
}

Decoding JSON on the server

Let's say you're expecting the client to send JSON data to the server. Easy, decode it with json.NewDecoder(r.Body).Decode(&u). Here's what that looks like with error handling:

func main() {
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        var u User
        if r.Body == nil {
            http.Error(w, "Please send a request body", 400)
            return
        }
        err := json.NewDecoder(r.Body).Decode(&u)
        if err != nil {
            http.Error(w, err.Error(), 400)
            return
        }
        fmt.Println(u.Id)
    })
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Encoding JSON in a server response

Just the opposite of the above - call json.NewEncoder(w).Encode(&u) to write JSON to the server.

func main() {
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        u := User{Id: "US123", Balance: 8}
        json.NewEncoder(w).Encode(u)
    })
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Reading a JSON response from the server.

This time you're going to read the response body in the client, after making the request.

func main() {
    u := User{Id: "US123", Balance: 8}
    b := new(bytes.Buffer)
    json.NewEncoder(b).Encode(u)
    res, _ := http.Post("https://httpbin.org/post", "application/json; charset=utf-8", b)
    var body struct {
        // httpbin.org sends back key/value pairs, no map[string][]string
        Headers map[string]string `json:"headers"`
        Origin  string            `json:"origin"`
    }
    json.NewDecoder(res.Body).Decode(&body)
    fmt.Println(body)
}

That's it! I hope it helps you a lot. Note that we only had to encode/decode the response to a byte array one time - in every other case we passed it directly to the reader/writer, which is one of the really nice things about Go, and interfaces - in most cases it's really easy to pass around streams, instead of having to deal with the intermediate steps.

How to push to multiple Github accounts from the same machine

You might want/need to check out repos locally as several different Github users. For example, my Github account is kevinburke, but I push code at work as kevinburkeshyp. If I could only commit/push as one of those users on each machine, or I had to manually specify the SSH key to use, it would be a big hassle!

Generate some SSH keys

First generate SSH keys for each account, and then upload the public keys to the right Github accounts.

ssh-keygen -t rsa -b 4096 -f "$HOME/.ssh/kevinburke_rsa"
ssh-keygen -t rsa -b 4096 -f "$HOME/.ssh/kevinburkeshyp_rsa"

Set up SSH host aliasing

You're going to use your standard SSH config for the host you use more often, and an alias for the one you use less often. Here's my SSH configuration for the host I use more often - on my work laptop, it's my work account.

Host github.com
    IdentityFile ~/.ssh/kevinburkeshyp_rsa
    HostName github.com
    User kevinburkeshyp

Now, set up an alias for your second account.

Host github.com-kevinburke
    IdentityFile ~/.ssh/kevinburke_rsa
    HostName github.com
    User kevinburke

The Host is set to github.com-kevinburke - this is what you write into your SSH config locally. The HostName is what's used for DNS resolution, which is how this still works. Note you'll want to switch the IdentityFile and the User with the key and usernames you set when you generated the keys.

Now if you want to clone something into the second account, use the special host alias:

git clone [email protected]:kevinburke/hamms.git

That's it! It should just work.

Which Commit Name/Email Address?

Most people configure their commit email address using the following command:

$ git config --global user.email [email protected]

This sets your email address in $HOME/.gitconfig and applies that value to every repository/commit on your system. Obviously this is a problem if you have two different email addresses! I want to use [email protected] with my Shyp account and [email protected] with my personal account.

I fix this by putting an empty email in the global $HOME/.gitconfig:

[user]
    name = Kevin Burke
    email = ""

When I try a commit in a new repository, Git shows the following message:

*** Please tell me who you are.

Run

  git config --global user.email "[email protected]"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got [email protected](none)')

Which is good! It prevents me from committing until I've configured my email address. I then have helpers set up to easily configure this.

alias setinburkeemail="git config user.email [email protected]'"

That's it! What tips do you use for managing two different Github accounts?

Just Return One Error

Recently there's been a trend in API's to return more than one error, or to always return an array of errors, to the client, when there's a 400 or a 500 server error. From the JSON API specification:

Error objects MUST be returned as an array keyed by errors in the top level of a JSON API document.

Here's an example.

HTTP/1.1 422 Unprocessable Entity
Content-Type: application/vnd.api+json

{
  "errors": [
    {
      "status": "422",
      "source": { "pointer": "/data/attributes/first-name" },
      "title":  "Invalid Attribute",
      "detail": "First name must contain at least three characters."
    }
  ]
}

I've also seen it in model validation; for example, Waterline returns an array of errors in the invalidAttributes field of the returned error object.

I think this is overkill, and adds significant complexity to your application for little benefit. Let's look at why.

  • Clients are rarely prepared to show multiple errors. At Shyp, we returned an errors key from the API on a 400 or 500 error for about 18 months. All four of the API's clients - the iOS app, the Android app, our error logger, and our internal dashboard - simply processed this by parsing and displaying errors[0], and ignoring the rest. It makes sense from the client's perspective - the errors were being shown on a phone (not much room) or an alert box in the browser (show two, and you get the "Prevent this page from displaying more alerts" dialog). But if clients can't make use of multiple errors, it makes me wonder why we return them from the API.

  • You rarely catch every problem with a request. Say a pickup is 1) missing the to address, 2) missing the city, 3) shipping to a country we don't ship to, and 4) contains a photo of a firearm, which we can't ship. If we return an error, it's going to be ['Missing City', 'Missing Address Line 1']. There's no way we're going to submit your invalid address to our rates service and discover even more problems with the customs form, or try to parse the photo and learn that you're trying to ship a firearm. So even if you fix the first errors, we're still going to return you a 400 from the API.

  • Continuing to process the request is dangerous. Attaching multiple errors to an API response implies that you discovered a problem with an incoming request and continued processing it. This is dangerous! Every part of your system needs to be prepared to handle invalid inputs. This might be a useful defensive measure, but it's also a lot of work for every function in your system to avoid making assumptions about the input that's passed to it. In my experience, passing around bad inputs leads to unnecessary server errors and crashes.

  • Programming languages and libraries return one error. Generally if you call raise or throw, you're limited to throwing a single Error object. If you're using a third party library that can error, usually library functions are limited to returning one error. Returning a single error from your API or your model validation function aligns with 99% of error handling paradigms that currently exist.

A Caveat

There's one place where returning multiple errors may make sense - a signup form with multiple form fields, where it can be frustrating to only return one problem at a time. There are two ways to deal with this:

  • Perform validation on the client, and detect/display multiple errors there. It's a better user experience to do this, since feedback is more or less instantaneous. You'll need the validation in your API anyway, but if someone manages to bypass the client checks, it's probably fine to just return one error at a time.

  • Special case your returned error with a key like attributes, and return the necessary error messages, just for this case. I'd rather special case this error message, than force every other error to be an array just to handle this one place in the API.

Safely Moving a Large Shrinkwrapped Dependency

This past week the team at Shyp decided to fork the framework we use (Sails.js) and the ORM it uses (Waterline). It was an interesting exercise in JS tooling, and I wasted a lot of time on busywork, so I thought I'd share a little bit about the process.

Why Fork?

We've been on an outdated fork of Sails/Waterline for a while. We chose not to update because we were worried that updates would introduce new errors, and many of the updates were new features that we didn't want or need.

The recent ugly discussions between the former CEO of Balderdash and other core contributors to Sails/Waterline made us rethink our dependency on the core tools. We concluded upgrading our framework was going to always be unlikely. We also figured we could remove the many parts of the framework that we don't want or need (sockets, cookies, blueprints, the horribly insecure "action routes", CSRF, Mongo support, Grunt, sails generate, sails www, the multi-query methods that aren't safe against race conditions, &c), and apply bugfixes to parts of the code that we knew were broken. Fortunately the changes already started to pay for themselves at the end of an 8 hour day, as I'll explain later.

Rollout Plan

On its face, the rollout plan was pretty simple.

  1. Fork Sails to Shyp/sails.

  2. Update the HEAD of the master branch to point at the version we have installed in production.

  3. Point the Sails dependency at our fork (no code changes).

  4. Make changes that we want.

  5. Update the dependency.

Breakdowns

This broke down somewhere around (3), for a few reasons.

  • If I tried to npm install github.com/Shyp/sails.git, every single package Sails depended on, and all of their sub-dependencies, would update. We didn't want this, for several reasons - I don't trust package maintainers to stay on top of updates, or predict when changes/updates will break code, and package updates amount to a large amount of unvetted code being deployed into your API without any code review.

    I didn't know how to work around this so I punted on it by hard-coding the update in the npm shrinkwrap file.

  • My next attempt was to try to generate a shrinkwrap file for Sails, and give our Sails fork the same versions of every dependency & sub-dependency we had in our API. This was the right idea, but failed in its implementation. Sails dependencies are currently alphabetized, but they weren't in the version that we were running. I alphabetized the package.json list while generating a shrinkwrap file, which changed the dependency order in the Shyp/sails shrinkwrap file. This made it impossible to diff against the shrinkwrap file in our API, which listed Sails's dependencies in an arbitrary order.

  • Finally I figured out what needed to happen. First, alphabetize the Sails dependency list in the Shyp API repository. Once that's done, I could use that shrinkwrap file as the basis for the shrinkwrap file in the Sails fork, so the dependencies in the Sails project matched up with the Shyp dependencies. Specifically, I accomplished this by sorting the dependencies in the package.json file, placing the sorted file in api/node_modules/sails/package.json, and then re-running clingwrap with the same dependencies installed. I then verified the same version of each dependency was installed by sorting the JSON blob in the old and new shrinkwrap files, and ensuring that the diff was empty.

    Once that was done, I could update dependencies in the (alphabetized) fork, and the diff in Shyp API's (now alphabetized) npm-shrinkwrap.json would be small. It was great to finally be able to generate a small diff in npm-shrinkwrap.json, and verify that it was correct.

Compounding the frustration was that each build took about 2 minutes, to wipe and reinstall all versions of every dependency. It was a slow afternoon that got more frustrating as it wore on.

Notes

  • Before adding or attempting to move any large NPM dependency, ensure that the dependencies are in alphabetical order. You'll run into too many headaches if this is not the case, and you won't be able to regenerate the npm-shrinkwrap file or be able to reason about it.

  • It's surprising that the builtin build tools don't enforce alphabetical order.

  • NPM's shrinkwrap tool does not generate the same file every time you run it. Specifically, if you install a package, occasionally the shrinkwrap file will report the "from" string is a range, for example, "from": "colors@>=0.6.0-1 <0.7.0", and sometimes it will report the "from" string as a npmjs.com URL. For a while, we worked around this by generating the NPM shrinkwrap file twice - the second time we ran it, it would generate a consistent set of strings.

      rm -rf node_modules
      npm cache clear
      npm install --production
      npm shrinkwrap
      npm install --production
      npm shrinkwrap
    

    Eventually we switched to using clingwrap, which solves this problem by avoiding the "from" and "resolved" fields entirely. We've also heard NPM 3 contains updates to the shrinkwrap file, but NPM 3 also believes our dependency tree contains errors, so we've been unable to effectively test it. Clingwrap also has some errors with NPM 3 and github repositories.

  • It was incredibly tempting to just say "fuck it" and deploy the updated version of every dependency and subdependency. Based on the amount of work it took me to figure out how to generate and check in consistent diffs, I came away thinking that there are a lot of Javascript applications out there that are unable to control/stabilize the versions of libraries that they use.

  • These are all problems you can avoid or mitigate by thinking incredibly hard before taking on a dependency, especially if, like Sails, the tool you're leaning on has a lot of its own dependencies. Is there really an advantage to using _.first(arr) instead of arr[0]? Do you really need the pluralize library, or is it sufficient to just append an 's' for your use case? Do you need to take a dependency to build a state machine, or can you use a dictionary and an UPDATE query?

Benefits

This took a lot of time to get right, and felt like busywork for at least three of the four hours I spent working on this, but we finally got some benefits out of this by the end of the day.

  • I committed a change to immediately error if anyone attempts to use the count() method, which attempts to pull the entire table into memory. Deploying this in our API application revealed four places where we were calling count()! Fortunately, they were all in test code. Still, it's delightful to be able to automatically enforce a rule we previously had to rely on code review (and shared team memory) to catch.

  • We removed the Grunt dependency, which we never used. There are about 4500 lines of deletions waiting in pull requests.

  • We understand shrinkwrap a little better, and hopefully this process won't be so painful next time.

  • We removed some incorrect Mongo-specific code from Waterline. This is the type of adaptation that we can only do by giving up compatibility, and we're happy with how easy this is now.

So that was my Wednesday last week. As it stands, we're planning on developing this fork in the open. I don't think you should use it - there are still better alternatives - but if you're stuck on Waterline with Postgres, or using Sails as a REST API, you might want to take a look.

Everything you need to know about state machines

Do you have objects in your system that can be in different states (accounts, invoices, messages, employees)? Do you have code that updates these objects from one state to another? If so, you probably want a state machine.

What is a state machine?

To Everything, There is a Season, by Shawn Clover. CC BY-NC 2.0

At its root, a state machine defines the legal transitions between states in your system, is responsible for transitioning objects between states, and prevents illegal transitions.

You sound like an architecture astronaut, why do I need this?

Let's talk about some bad things that can happen if you don't have a state machine in place.

(Some of these actually happened! Some are invented.)

  • A user submits a pickup. We pick up the item and ship it out. Two weeks later, a defect causes the app to resubmit the same pickup, and reassign a driver, for an item that's already been shipped.

  • Two users submit a pickup within a second of each other. Our routing algorithm fetches available drivers, computes each driver's distance to the pickup, and says the same driver is available for both pickups. We assign the same driver to both pickups.

  • A user submits a pickup. A defect in a proxy causes the submit request to be sent multiple times. We end up assigning four drivers to the pickup, and sending the user four text messages that their pickup's been assigned.

  • An item is misplaced at the warehouse and sent straight to the packing station. Crucial steps in the shipping flow get skipped.

  • Code for updating the state of an object is littered between several different classes and controllers, which handle the object in different parts of its lifecycle. It becomes difficult to figure out how the object moves between various states. Customer support tells you that an item is in a particular state that makes it impossible for them to do their jobs. It takes great effort to figure out how it got there.

These are all really bad positions to be in! A lot of pain for you and a lot of pain for your teams in the field.

You are already managing state (poorly)

Do you have code in your system that looks like this?

def submit(pickup_id):
    pickup = Pickups.find_by_id(pickup_id)
    if pickup.state != 'DRAFT':
        throw new StateError("Can't submit a pickup that's already been submitted")
    pickup.state = 'SUBMITTED'
    pickup.save()
    MessageService.send_message(pickup.user.phone_number, 'Your driver is on the way!')

By checking the state of the pickup before moving to the next state, you're managing the state of your system! You are (at least partially) defining what transitions are allowed between states, and what transitions aren't. To avoid the issues listed above, you'll want to consolidate the state management in one place in your codebase.

Okay, how should I implement the state machine?

There are a lot of libraries that promise to manage this for you. You don't need any of them (too much complexity), and you don't need a DSL. You just need a dictionary and a single database query.

The dictionary is for defining transitions, allowable input states, and the output state. Here's a simplified version of the state machine we use for Pickups.

states = {
    submit: {
        before: ['DRAFT'],
        after:   'SUBMITTED',
    },
    assign: {
        before: ['SUBMITTED'],
        after:   'ASSIGNED',
    },
    cancel: {
        before: ['DRAFT', 'SUBMITTED', 'ASSIGNED'],
        after:   'CANCELED',
    },
    collect: {
        before: ['ASSIGNED'],
        after:   'COLLECTED',
    },
}

Then you need a single function, transition, that takes an object ID, the name of a transition, and (optionally) additional fields you'd like to set on the object if you update its state.

The transition function looks up the transition in the states dictionary, and generates a single SQL query:

UPDATE table SET
    state = 'newstate',
    extraField1 = 'extraValue1'
WHERE
    id = $1 AND
    state IN ('oldstate1', 'oldstate2')
RETURNING *

If your UPDATE query returns a row, you successfully transitioned the item! Return the item from the function. If it returns zero rows, you failed to transition the item. It can be tricky to determine why this happened, since you don't know which (invalid) state the item was in that caused it to not match. We cheat and fetch the record to give a better error message - there's a race there, but we note the race in the error message, and it gives us a good idea of what happened in ~98% of cases.

Note what you don't want to do - you don't want to update the object in memory and then call .save() on it. Not only is .save() dangerous, but fetching the item before you attempt to UPDATE it means you'll be vulnerable to race conditions between two threads attempting to transition the same item to two different states (or, twice to the same state). Ask, don't tell - just try the transition and then handle success or failure appropriately.

Say you send a text message to a user after they submit their pickup - if two threads can successfully call the submit transition, the user will get 2 text messages. The UPDATE query above ensures that exactly one thread will succeed at transitioning the item, which means you can (and want to) pile on whatever only-once actions you like (sending messages, charging customers, assigning drivers, &c) after a successful transition and ensure they'll run once. For more about consistency, see Weird Tricks to Write Faster, More Correct Database Queries.

Being able to issue queries like this is one of the benefits of using a relational database with strong consistency guarantees. Your mileage (and the consistency of your data) may vary when attempting to implement a state transition like this using a new NoSQL database. Note that with the latest version of MongoDB, it's possible to read stale data, meaning that (as far as I can tell) the WHERE clause might read out-of-date data, and you can apply an inconsistent state transition.

Final Warnings

A state machine puts you in a much better position with respect to the consistency of your data, and makes it easy to guarantee that actions performed after a state transition (invoicing, sending messages, expensive operations) will be performed exactly once for each legal transition, and will be rejected for illegal transitions. I can't stress enough how often this has saved our bacon.

You'll still need to be wary of code that makes decisions based on other properties of the object. For example, you might set a driver_id on the pickup when you assign it. If other code (or clients) decide to make a decision based on the presence or absence of the driver_id field, you're making a decision based on the state of the object, but outside of the state machine framework, and you're vulnerable to all of the bullet points mentioned above. You'll need to be vigilant about these, and ensure all calling code is making decisions based on the state property of the object, not any auxiliary properties.

You'll also need to be wary of code that tries a read/check-state/write pattern; it's vulnerable to the races mentioned above. Always always just try the state transition and handle failure if it fails.

Finally, some people might try to sneak in code that just updates the object state, outside of the state machine. Be wary of this in code reviews and try to force all state updates to happen in the StateMachine class.