Posts Tagged With: Code

The Frustration and Loneliness of Server-Side Javascript Development

I was hired in December 2014 as the sixth engineer at Shyp. Shyp runs Node.js on the server. It's been a pretty frustrating journey, and I wanted to share some of the experiences we've had. There are no hot takes about the learning curve, or "why are there so many frameworks" in this post.

  • Initially we were running on Node 0.10.30. Timers would consistently fire 50 milliseconds early - that is, if you called setTimeout with a duration of 200 milliseconds, the timer would fire after 150 milliseconds. I asked about this in the #Node.js Freenode channel, and was told it was my fault for using an "ancient" version of Node (it was 18 months old at that time), and that I should get on a "long term stable" version. This made timeout-based tests difficult to write - every test had to set timeouts longer than 50ms.

  • I wrote a metrics library that published to Librato. I expect a background metric publishing daemon to silently swallow/log errors. One day Librato had an outage and returned a 502 to all requests; unbeknownst to me the underlying librato library we were using was also an EventEmitter, that crashed the process if unhandled. This caused about 30 minutes of downtime in our API.

  • The advice on the Node.js website is to crash the process if there's an uncaught exception. Our application is about 60 models, 30 controllers; it doesn't seem particularly large. It consistently takes 40 seconds to boot our server in production; the majority of this time is spent requiring files. Obviously we try not to crash, and fix crashes when we see them, but we can't take 40 seconds of downtime because someone sent an input we weren't expecting. I asked about this on the Node.js mailing list but got mostly peanuts. Recently the Go core team sped up build times by 2x in a month.

  • We discovered our framework was sleeping for 50ms on POST and PUT requests for no reason. I've previously written about the problems with Sails, so I am going to omit most of that here.

  • We upgraded to Node 4 (after spending a day dealing with a nasty TLS problem) and observed our servers were consuming 20-30% more RAM, with no accompanying speed, CPU utilization, or throughput increase. We were left wondering what benefit we got from upgrading.

  • It took about two months to figure out how to generate a npm shrinkwrap file that produced reliable changes when we tried to update it. Often attempting to modify/update it would change the "from" field in every single dependency.

  • The sinon library appears to be one of the most popular ways to stub the system time. The default method for stubbing (useFakeTimers) leads many other libraries to hang inexplicably. I noticed this after spending 90 minutes wondering why stubbing the system time caused CSV writing to fail. The only ways to debug this are 1) to add console.log statements at deeper and deeper levels, since you can't ever hit ctrl+C and print a stack trace, or 2) take a core dump.

  • The library we used to send messages to Slack crashed our server if/when Slack returned HTML instead of JSON.

  • Our Redis provider changed something - we don't know what, since the version number stayed the same - in their Redis instance, which causes the Redis connection library we use to crash the process with a "Unhandled event" message. We have no idea why this happens, but it's tied to the number of connections we have open - we've had to keep a close eye on the number of concurrent Redis connections, and eventually just phased out Redis altogether.

  • Our database framework doesn't support transactions, so we had to write our own transaction library.

  • The underlying database driver doesn't have a way to time out/cancel connection acquisition, so threads will wait forever for a connection to become available. I wrote a patch for this; it hasn't gotten any closer to merging in 10 weeks, so I published a new NPM package with the patch, and submitted that upstream.

  • I wrote a job queue in Go. I couldn't benchmark it using our existing Node server as the downstream server, the Node server was too slow - a basic Express app would take 50ms on average to process incoming requests and respond with a 202, with 100 concurrent in-flight requests. I had to write a separate downstream server for benchmarking.

  • Last week I noticed our integration servers were frequently running out of memory. Normally they use about 700MB of memory, but we saw they would accumulate 500MB of swap in about 20 seconds. I think the app served about thirty requests total in that time frame. There's nothing unusual about the requests, amount of traffic being sent over HTTP or to the database during this time frame; the amount of data was in kilobytes. We don't have tools like pprof. We can't take a heap dump, because the box is out of memory by that point.


You could argue - and certainly you could say this about Sails - that we chose bad libraries, and we should have picked better. But part of the problem with the Node ecosystem is it's hard to distinguish good libraries from bad.

I've also listed problems - pound your head, hours or days of frustrating debugging - with at least six different libraries above. At what point do you move beyond "this one library is bad" and start worrying about the bigger picture? Do the libraries have problems because the authors don't know better, or because the language and standard library make it difficult to write good libraries? I don't think it helps.

You could argue that we are a fast growing company and we'd have these problems with any language or framework we chose. Maybe? I really doubt it. The job queue is 6000 lines of Go code written from scratch. There was one really bad problem in about two months of development, and a) the race detector caught it before production, b) I got pointed to the right answer in IRC pretty quickly.

You could also argue we are bad engineers and better engineers wouldn't have these problems. I can't rule this out, but I certainly don't think this is the case.

The Javascript community gets a lot of credit for its work on codes of conduct, on creating awesome conferences, for creating a welcoming, thoughtful community. But the most overwhelming feeling I've gotten over the past year has been loneliness. I constantly feel like I am the only person running into these problems with libraries, these problems in production, or I'm the only person who cares about the consequences of e.g. waiting forever for a connection to become available. Frequently posting on Github or in IRC and getting no response, or getting told "that's the way it is". In the absence of feedback, I am left wondering how everyone else is getting along, or how their users are.

I've learned a lot about debugging Node applications, making builds faster, finding and fixing intermittent test failures and more. I am available to speak or consult at your company or event, about what I have learned. I am also interested in figuring out your solutions to these problems - I will buy you a coffee or a beer and chat about this at any time.

Go Concurrency for Javascript Developers

Often the best way to learn a new language is to implement something you know
with it. Let’s take a look at some common async Javascript patterns, and how
you’d implement them in Go.


You can certainly implement callbacks in Go! Functions are first class
citizens. Here we make a HTTP request and hit the callback with the response
body or an error.

func makeRequest(method string, url string, cb func(error, io.ReadCloser)) {
    req, _ := http.NewRequest(method, url, nil)
    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        cb(err, nil)
    } else {
        cb(err, resp.Body)
func main() {
    makeRequest("GET", "", func(err error, body io.ReadCloser) {
        defer body.Close()
        io.Copy(os.Stdout, body)

Go uses callbacks in the standard library in some places – time.AfterFunc,
http.HandlerFunc. Other places callbacks are mostly used to allow you to
specify what function or action you want to take, in interaction with a useful
piece of code – filepath.Walk, strings.LastIndexFunc, etc.

Most of the same problems with callbacks apply to Go as Javascript, namely, you
can forget to call a callback, or you can call it more than once. At least you
can’t try to callback that is undefined in Go, since the compiler will not
let you do this.

But callbacks aren’t idiomatic. Javascript needs callbacks because there’s
only a single thread of execution, processing every request on your server
sequentially. If you "wait"/"block" for something to finish (e.g. by calling
fs.readFileSync or any of the Sync methods, every other request on your
server has to wait until you’re done. I’m simplifying, but you have to have
callbacks in Javascript so everything else (other incoming HTTP requests, for
example) have a chance to run.

Go does not have this same limitation. A Javascript thread can only run on one
CPU; Go can use all of the CPU’s on your computer, so multiple threads can
do work at the same time, even if all of them are making "blocking" function
calls, like HTTP requests, database queries, or reading or writing files.

Furthermore, even if you only had one CPU, multiple Go threads would run
without having IO "block" the scheduler, even though methods like http.Get
have a synchronous interface. How does this work? Deep down in code like
net/fd_unix.go, the socket code calls syscall.Write, which (when you follow
it far enough) calls runtime.entersyscall, which signals to the scheduler
that it won’t have anything to do for a while
, and other work should

For that reason you probably don’t want to use callbacks for asynchronous code.

.then / .catch / .finally

This is the other common approach to async operations in Javascript.
Fortunately most API’s in Go are blocking, so you can just do one thing and
then the other thing. Say you want to execute the following async actions:

var changeEmail = function(userId, newEmail) {
    return checkEmailAgainstBlacklist(newEmail).then(function() {
        return getUser(userId);
    }).then(function(user) { = newEmail;
    }).catch(function(err) {
        throw err;

In Go, you’d just do each action in turn and wait for them to finish:

func changeEmail(userId string, newEmail string) (User, error) {
    err := checkEmailAgainstBlacklist(newEmail)
    if err != nil {
        return nil, err
    user, err := getUser(userId)
    if err != nil {
        return nil, err
    user.Email = newEmail
    return user.Save()

.finally is usually implemented with the defer keyword; you can at any time
defer code to run at the point the function terminates.

resp, err := http.Get("")
// Note Close() will panic if err is non-nil, you still have to check err
defer resp.Body.Close()
io.Copy(os.Stdout, resp.Body)


Just ignore the error response, or handle it; there’s no real difference
between sync and async errors, besides calling panic, which your code
shouldn’t be doing.


Doing multiple things at once in Javascript is pretty easy with Promises.
Here we dispatch two database queries at once, and handle the results in one

var checkUserOwnsPickup = function(userId, pickupId) {
    return Promise.join(
    ).spread(function(user, pickup) {
        return pickup.userId ===;

Unfortunately doing this in Go is a little more complicated. Let’s start with
a slightly easier case – doing something where we don’t care about the result,
whether it errors or not. You can put each in a goroutine and they’ll execute
in the background.

func subscribeUser(user User) {
    go func() {
    go func() {

The easiest way to pass the results back is to declare the variables in the
outer scope and assign them in a goroutine.

func main() {
    var user models.User
    var userErr error
    var pickup models.Pickup
    var pickupErr error
    var wg sync.WaitGroup
    go func() {
        user, userErr = getUser(userId)
    go func() {
        pickup, pickupErr = getPickup(pickupId)
    // Error checking omitted

More verbose than Promise.join, but the lack of generics primitives makes a
shorter solution tough. You can gain more control, and early terminate by using
channels, but it’s more verbose to do so.

Do the same as the Go version of Promise.join above, and then write a for
loop to loop over the resolved values. This might be a little trickier because
you need to specify the types of everything, and you can’t create arrays with
differing types. If you want a given concurrency value, there are some neat
code examples in this post on Pipelines.

Checking for races

One of the strengths of Go is the race detector, which detects whether
two threads can attempt to read/write the same variable in different orders
without synchronizing first. It’s slightly slower than normal, but I highly
encourage you to run your test suite with the race detector on. Here’s our
make target:

    go test -race ./... -timeout 2s

Once it found a defect in our code that turned out to be a standard library
! I highly recommend enabling the race detector.

The equivalent in Javascript would be code that varies the event loop each
callback runs in, and checks that you still get the same results.


I hope this helps! I wrote this to be the guide I wish I had when I was getting
started figuring out how to program concurrently in Go.

Thanks to Kyle Conroy, Alan Shreve, Chris
, @rf, and members of the #go-nuts IRC channel for
answering my silly questions.

Encoding and Decoding JSON, with Go’s net/http package

This is a pretty common task: encode JSON and send it to a server, decode JSON on the server, and vice versa. Amazingly, the existing resources on how to do this aren't very clear. So let's walk through each case, for the following simple User object:

type User struct{
    Id      string
    Balance uint64

Sending JSON in the body of a POST/PUT request

Let's start with the trickiest one: the body of a Go's http.Request is an io.Reader, which doesn't fit well if you have a struct - you need to write the struct first and then copy that to a reader.

func main() {
    u := User{Id: "US123", Balance: 8}
    b := new(bytes.Buffer)
    res, _ := http.Post("", "application/json; charset=utf-8", b)
    io.Copy(os.Stdout, res.Body)

Decoding JSON on the server

Let's say you're expecting the client to send JSON data to the server. Easy, decode it with json.NewDecoder(r.Body).Decode(&u). Here's what that looks like with error handling:

func main() {
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        var u User
        if r.Body == nil {
            http.Error(w, "Please send a request body", 400)
        err := json.NewDecoder(r.Body).Decode(&u)
        if err != nil {
            http.Error(w, err.Error(), 400)
    log.Fatal(http.ListenAndServe(":8080", nil))

Encoding JSON in a server response

Just the opposite of the above - call json.NewEncoder(w).Encode(&u) to write JSON to the server.

func main() {
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        u := User{Id: "US123", Balance: 8}
    log.Fatal(http.ListenAndServe(":8080", nil))

Reading a JSON response from the server.

This time you're going to read the response body in the client, after making the request.

func main() {
    u := User{Id: "US123", Balance: 8}
    b := new(bytes.Buffer)
    res, _ := http.Post("", "application/json; charset=utf-8", b)
    var body struct {
        // sends back key/value pairs, no map[string][]string
        Headers map[string]string `json:"headers"`
        Origin  string            `json:"origin"`

That's it! I hope it helps you a lot. Note that we only had to encode/decode the response to a byte array one time - in every other case we passed it directly to the reader/writer, which is one of the really nice things about Go, and interfaces - in most cases it's really easy to pass around streams, instead of having to deal with the intermediate steps.

How to push to multiple Github accounts from the same machine

You might want/need to check out repos locally as several different Github users. For example, my Github account is kevinburke, but I push code at work as kevinburkeshyp. If I could only commit/push as one of those users on each machine, or I had to manually specify the SSH key to use, it would be a big hassle!

Generate some SSH keys

First generate SSH keys for each account, and then upload the public keys to the right Github accounts.

ssh-keygen -t rsa -b 4096 -f "$HOME/.ssh/kevinburke_rsa"
ssh-keygen -t rsa -b 4096 -f "$HOME/.ssh/kevinburkeshyp_rsa"

Set up SSH host aliasing

You're going to use your standard SSH config for the host you use more often, and an alias for the one you use less often. Here's my SSH configuration for the host I use more often - on my work laptop, it's my work account.

    IdentityFile ~/.ssh/kevinburkeshyp_rsa
    User kevinburkeshyp

Now, set up an alias for your second account.

    IdentityFile ~/.ssh/kevinburke_rsa
    User kevinburke

The Host is set to - this is what you write into your SSH config locally. The HostName is what's used for DNS resolution, which is how this still works. Note you'll want to switch the IdentityFile and the User with the key and usernames you set when you generated the keys.

Now if you want to clone something into the second account, use the special host alias:

git clone [email protected]:kevinburke/hamms.git

That's it! It should just work.

Which Commit Name/Email Address?

Most people configure their commit email address using the following command:

$ git config --global [email protected]

This sets your email address in $HOME/.gitconfig and applies that value to every repository/commit on your system. Obviously this is a problem if you have two different email addresses! I want to use [email protected] with my Shyp account and [email protected] with my personal account.

I fix this by putting an empty email in the global $HOME/.gitconfig:

    name = Kevin Burke
    email = ""

When I try a commit in a new repository, Git shows the following message:

*** Please tell me who you are.


  git config --global "[email protected]"
  git config --global "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got [email protected](none)')

Which is good! It prevents me from committing until I've configured my email address. I then have helpers set up to easily configure this.

alias setinburkeemail="git config [email protected]'"

That's it! What tips do you use for managing two different Github accounts?

Just Return One Error

Recently there's been a trend in API's to return more than one error, or to always return an array of errors, to the client, when there's a 400 or a 500 server error. From the JSON API specification:

Error objects MUST be returned as an array keyed by errors in the top level of a JSON API document.

Here's an example.

HTTP/1.1 422 Unprocessable Entity
Content-Type: application/vnd.api+json

  "errors": [
      "status": "422",
      "source": { "pointer": "/data/attributes/first-name" },
      "title":  "Invalid Attribute",
      "detail": "First name must contain at least three characters."

I've also seen it in model validation; for example, Waterline returns an array of errors in the invalidAttributes field of the returned error object.

I think this is overkill, and adds significant complexity to your application for little benefit. Let's look at why.

  • Clients are rarely prepared to show multiple errors. At Shyp, we returned an errors key from the API on a 400 or 500 error for about 18 months. All four of the API's clients - the iOS app, the Android app, our error logger, and our internal dashboard - simply processed this by parsing and displaying errors[0], and ignoring the rest. It makes sense from the client's perspective - the errors were being shown on a phone (not much room) or an alert box in the browser (show two, and you get the "Prevent this page from displaying more alerts" dialog). But if clients can't make use of multiple errors, it makes me wonder why we return them from the API.

  • You rarely catch every problem with a request. Say a pickup is 1) missing the to address, 2) missing the city, 3) shipping to a country we don't ship to, and 4) contains a photo of a firearm, which we can't ship. If we return an error, it's going to be ['Missing City', 'Missing Address Line 1']. There's no way we're going to submit your invalid address to our rates service and discover even more problems with the customs form, or try to parse the photo and learn that you're trying to ship a firearm. So even if you fix the first errors, we're still going to return you a 400 from the API.

  • Continuing to process the request is dangerous. Attaching multiple errors to an API response implies that you discovered a problem with an incoming request and continued processing it. This is dangerous! Every part of your system needs to be prepared to handle invalid inputs. This might be a useful defensive measure, but it's also a lot of work for every function in your system to avoid making assumptions about the input that's passed to it. In my experience, passing around bad inputs leads to unnecessary server errors and crashes.

  • Programming languages and libraries return one error. Generally if you call raise or throw, you're limited to throwing a single Error object. If you're using a third party library that can error, usually library functions are limited to returning one error. Returning a single error from your API or your model validation function aligns with 99% of error handling paradigms that currently exist.

A Caveat

There's one place where returning multiple errors may make sense - a signup form with multiple form fields, where it can be frustrating to only return one problem at a time. There are two ways to deal with this:

  • Perform validation on the client, and detect/display multiple errors there. It's a better user experience to do this, since feedback is more or less instantaneous. You'll need the validation in your API anyway, but if someone manages to bypass the client checks, it's probably fine to just return one error at a time.

  • Special case your returned error with a key like attributes, and return the necessary error messages, just for this case. I'd rather special case this error message, than force every other error to be an array just to handle this one place in the API.

Safely Moving a Large Shrinkwrapped Dependency

This past week the team at Shyp decided to fork the framework we use (Sails.js) and the ORM it uses (Waterline). It was an interesting exercise in JS tooling, and I wasted a lot of time on busywork, so I thought I'd share a little bit about the process.

Why Fork?

We've been on an outdated fork of Sails/Waterline for a while. We chose not to update because we were worried that updates would introduce new errors, and many of the updates were new features that we didn't want or need.

The recent ugly discussions between the former CEO of Balderdash and other core contributors to Sails/Waterline made us rethink our dependency on the core tools. We concluded upgrading our framework was going to always be unlikely. We also figured we could remove the many parts of the framework that we don't want or need (sockets, cookies, blueprints, the horribly insecure "action routes", CSRF, Mongo support, Grunt, sails generate, sails www, the multi-query methods that aren't safe against race conditions, &c), and apply bugfixes to parts of the code that we knew were broken. Fortunately the changes already started to pay for themselves at the end of an 8 hour day, as I'll explain later.

Rollout Plan

On its face, the rollout plan was pretty simple.

  1. Fork Sails to Shyp/sails.

  2. Update the HEAD of the master branch to point at the version we have installed in production.

  3. Point the Sails dependency at our fork (no code changes).

  4. Make changes that we want.

  5. Update the dependency.


This broke down somewhere around (3), for a few reasons.

  • If I tried to npm install, every single package Sails depended on, and all of their sub-dependencies, would update. We didn't want this, for several reasons - I don't trust package maintainers to stay on top of updates, or predict when changes/updates will break code, and package updates amount to a large amount of unvetted code being deployed into your API without any code review.

    I didn't know how to work around this so I punted on it by hard-coding the update in the npm shrinkwrap file.

  • My next attempt was to try to generate a shrinkwrap file for Sails, and give our Sails fork the same versions of every dependency & sub-dependency we had in our API. This was the right idea, but failed in its implementation. Sails dependencies are currently alphabetized, but they weren't in the version that we were running. I alphabetized the package.json list while generating a shrinkwrap file, which changed the dependency order in the Shyp/sails shrinkwrap file. This made it impossible to diff against the shrinkwrap file in our API, which listed Sails's dependencies in an arbitrary order.

  • Finally I figured out what needed to happen. First, alphabetize the Sails dependency list in the Shyp API repository. Once that's done, I could use that shrinkwrap file as the basis for the shrinkwrap file in the Sails fork, so the dependencies in the Sails project matched up with the Shyp dependencies. Specifically, I accomplished this by sorting the dependencies in the package.json file, placing the sorted file in api/node_modules/sails/package.json, and then re-running clingwrap with the same dependencies installed. I then verified the same version of each dependency was installed by sorting the JSON blob in the old and new shrinkwrap files, and ensuring that the diff was empty.

    Once that was done, I could update dependencies in the (alphabetized) fork, and the diff in Shyp API's (now alphabetized) npm-shrinkwrap.json would be small. It was great to finally be able to generate a small diff in npm-shrinkwrap.json, and verify that it was correct.

Compounding the frustration was that each build took about 2 minutes, to wipe and reinstall all versions of every dependency. It was a slow afternoon that got more frustrating as it wore on.


  • Before adding or attempting to move any large NPM dependency, ensure that the dependencies are in alphabetical order. You'll run into too many headaches if this is not the case, and you won't be able to regenerate the npm-shrinkwrap file or be able to reason about it.

  • It's surprising that the builtin build tools don't enforce alphabetical order.

  • NPM's shrinkwrap tool does not generate the same file every time you run it. Specifically, if you install a package, occasionally the shrinkwrap file will report the "from" string is a range, for example, "from": "colors@>=0.6.0-1 <0.7.0", and sometimes it will report the "from" string as a URL. For a while, we worked around this by generating the NPM shrinkwrap file twice - the second time we ran it, it would generate a consistent set of strings.

      rm -rf node_modules
      npm cache clear
      npm install --production
      npm shrinkwrap
      npm install --production
      npm shrinkwrap

    Eventually we switched to using clingwrap, which solves this problem by avoiding the "from" and "resolved" fields entirely. We've also heard NPM 3 contains updates to the shrinkwrap file, but NPM 3 also believes our dependency tree contains errors, so we've been unable to effectively test it. Clingwrap also has some errors with NPM 3 and github repositories.

  • It was incredibly tempting to just say "fuck it" and deploy the updated version of every dependency and subdependency. Based on the amount of work it took me to figure out how to generate and check in consistent diffs, I came away thinking that there are a lot of Javascript applications out there that are unable to control/stabilize the versions of libraries that they use.

  • These are all problems you can avoid or mitigate by thinking incredibly hard before taking on a dependency, especially if, like Sails, the tool you're leaning on has a lot of its own dependencies. Is there really an advantage to using _.first(arr) instead of arr[0]? Do you really need the pluralize library, or is it sufficient to just append an 's' for your use case? Do you need to take a dependency to build a state machine, or can you use a dictionary and an UPDATE query?


This took a lot of time to get right, and felt like busywork for at least three of the four hours I spent working on this, but we finally got some benefits out of this by the end of the day.

  • I committed a change to immediately error if anyone attempts to use the count() method, which attempts to pull the entire table into memory. Deploying this in our API application revealed four places where we were calling count()! Fortunately, they were all in test code. Still, it's delightful to be able to automatically enforce a rule we previously had to rely on code review (and shared team memory) to catch.

  • We removed the Grunt dependency, which we never used. There are about 4500 lines of deletions waiting in pull requests.

  • We understand shrinkwrap a little better, and hopefully this process won't be so painful next time.

  • We removed some incorrect Mongo-specific code from Waterline. This is the type of adaptation that we can only do by giving up compatibility, and we're happy with how easy this is now.

So that was my Wednesday last week. As it stands, we're planning on developing this fork in the open. I don't think you should use it - there are still better alternatives - but if you're stuck on Waterline with Postgres, or using Sails as a REST API, you might want to take a look.

Everything you need to know about state machines

Do you have objects in your system that can be in different states (accounts, invoices, messages, employees)? Do you have code that updates these objects from one state to another? If so, you probably want a state machine.

What is a state machine?

To Everything, There is a Season, by Shawn Clover. CC BY-NC 2.0

At its root, a state machine defines the legal transitions between states in your system, is responsible for transitioning objects between states, and prevents illegal transitions.

You sound like an architecture astronaut, why do I need this?

Let's talk about some bad things that can happen if you don't have a state machine in place.

(Some of these actually happened! Some are invented.)

  • A user submits a pickup. We pick up the item and ship it out. Two weeks later, a defect causes the app to resubmit the same pickup, and reassign a driver, for an item that's already been shipped.

  • Two users submit a pickup within a second of each other. Our routing algorithm fetches available drivers, computes each driver's distance to the pickup, and says the same driver is available for both pickups. We assign the same driver to both pickups.

  • A user submits a pickup. A defect in a proxy causes the submit request to be sent multiple times. We end up assigning four drivers to the pickup, and sending the user four text messages that their pickup's been assigned.

  • An item is misplaced at the warehouse and sent straight to the packing station. Crucial steps in the shipping flow get skipped.

  • Code for updating the state of an object is littered between several different classes and controllers, which handle the object in different parts of its lifecycle. It becomes difficult to figure out how the object moves between various states. Customer support tells you that an item is in a particular state that makes it impossible for them to do their jobs. It takes great effort to figure out how it got there.

These are all really bad positions to be in! A lot of pain for you and a lot of pain for your teams in the field.

You are already managing state (poorly)

Do you have code in your system that looks like this?

def submit(pickup_id):
    pickup = Pickups.find_by_id(pickup_id)
    if pickup.state != 'DRAFT':
        throw new StateError("Can't submit a pickup that's already been submitted")
    pickup.state = 'SUBMITTED'
    MessageService.send_message(pickup.user.phone_number, 'Your driver is on the way!')

By checking the state of the pickup before moving to the next state, you're managing the state of your system! You are (at least partially) defining what transitions are allowed between states, and what transitions aren't. To avoid the issues listed above, you'll want to consolidate the state management in one place in your codebase.

Okay, how should I implement the state machine?

There are a lot of libraries that promise to manage this for you. You don't need any of them (too much complexity), and you don't need a DSL. You just need a dictionary and a single database query.

The dictionary is for defining transitions, allowable input states, and the output state. Here's a simplified version of the state machine we use for Pickups.

states = {
    submit: {
        before: ['DRAFT'],
        after:   'SUBMITTED',
    assign: {
        before: ['SUBMITTED'],
        after:   'ASSIGNED',
    cancel: {
        before: ['DRAFT', 'SUBMITTED', 'ASSIGNED'],
        after:   'CANCELED',
    collect: {
        before: ['ASSIGNED'],
        after:   'COLLECTED',

Then you need a single function, transition, that takes an object ID, the name of a transition, and (optionally) additional fields you'd like to set on the object if you update its state.

The transition function looks up the transition in the states dictionary, and generates a single SQL query:

    state = 'newstate',
    extraField1 = 'extraValue1'
    id = $1 AND
    state IN ('oldstate1', 'oldstate2')

If your UPDATE query returns a row, you successfully transitioned the item! Return the item from the function. If it returns zero rows, you failed to transition the item. It can be tricky to determine why this happened, since you don't know which (invalid) state the item was in that caused it to not match. We cheat and fetch the record to give a better error message - there's a race there, but we note the race in the error message, and it gives us a good idea of what happened in ~98% of cases.

Note what you don't want to do - you don't want to update the object in memory and then call .save() on it. Not only is .save() dangerous, but fetching the item before you attempt to UPDATE it means you'll be vulnerable to race conditions between two threads attempting to transition the same item to two different states (or, twice to the same state). Ask, don't tell - just try the transition and then handle success or failure appropriately.

Say you send a text message to a user after they submit their pickup - if two threads can successfully call the submit transition, the user will get 2 text messages. The UPDATE query above ensures that exactly one thread will succeed at transitioning the item, which means you can (and want to) pile on whatever only-once actions you like (sending messages, charging customers, assigning drivers, &c) after a successful transition and ensure they'll run once. For more about consistency, see Weird Tricks to Write Faster, More Correct Database Queries.

Being able to issue queries like this is one of the benefits of using a relational database with strong consistency guarantees. Your mileage (and the consistency of your data) may vary when attempting to implement a state transition like this using a new NoSQL database. Note that with the latest version of MongoDB, it's possible to read stale data, meaning that (as far as I can tell) the WHERE clause might read out-of-date data, and you can apply an inconsistent state transition.

Final Warnings

A state machine puts you in a much better position with respect to the consistency of your data, and makes it easy to guarantee that actions performed after a state transition (invoicing, sending messages, expensive operations) will be performed exactly once for each legal transition, and will be rejected for illegal transitions. I can't stress enough how often this has saved our bacon.

You'll still need to be wary of code that makes decisions based on other properties of the object. For example, you might set a driver_id on the pickup when you assign it. If other code (or clients) decide to make a decision based on the presence or absence of the driver_id field, you're making a decision based on the state of the object, but outside of the state machine framework, and you're vulnerable to all of the bullet points mentioned above. You'll need to be vigilant about these, and ensure all calling code is making decisions based on the state property of the object, not any auxiliary properties.

You'll also need to be wary of code that tries a read/check-state/write pattern; it's vulnerable to the races mentioned above. Always always just try the state transition and handle failure if it fails.

Finally, some people might try to sneak in code that just updates the object state, outside of the state machine. Be wary of this in code reviews and try to force all state updates to happen in the StateMachine class.

Weird Tricks to Write Faster, More Correct Database Queries

Adrian Colyer wrote a great summary of a recent paper by Peter Bailis et al. In the paper the database researchers examine open source Rails applications and observe that the applications apply constraints - foreign key references, uniqueness constraints - in a way that's not very performant or correct.

I was pretty surprised to read about this! For the most part we have avoided problems like this at Shyp, and I didn't realize how widespread this problem is; I certainly have written a lot of bad queries in the past.

So! Let's learn some tips for writing better queries. Everything below will help you write an application that is more correct - it will avoid consistency problems in your data - and more performant - you should be able to achieve the same results as Rails, with fewer queries!

ps - The info below may be really obvious to you! Great! There are a lot of people who aren't familiar with these techniques, as the paper above indicates.

Use database constraints to enforce business logic

Say you define an ActiveRecord class that looks like this:

class User < ActiveRecord::Base
  validates :email, uniqueness: true

What actually happens when you try to create a new user? It turns out Rails will make 4 (four!) roundtrips to the database.

  1. BEGIN a transaction.

  2. Perform a SELECT to see if any other users have that email address.

  3. If the SELECT turns up zero rows, perform an INSERT to add the row.

  4. Finally, COMMIT the result.

This is pretty slow! It also increases the load on your application and your database, since you need to make 4 requests for every INSERT. Bailis et al also show that with your database's default transaction isolation level, it's possible to insert two records with the same key. Furthermore, there are some ActiveRecord queries which skip the built-in validations, as Gary Bernhardt discussed in his video, "Where Correctness Is Enforced", way back in 2012. Any query which inserts data and skips the validations can compromise the integrity of your database.

What if I told you you can do the same insert in one query instead of four, and it would be more correct than the Rails version? Instead of Rails's migration, write this:


The UNIQUE is the key bit there; it adds a unique key on the table. Then, instead of wrapping the query in a transaction, just try an INSERT.

> insert into users (email) values ([email protected]');
> insert into users (email) values ([email protected]');
ERROR:  duplicate key value violates unique constraint "users_email_key"
DETAIL:  Key (email)=([email protected]) already exists.

You'll probably need to add better error handling around the failure case - at least we did, for the ORM we use. But at any level of query volume, or if speed counts (and it probably does), it's worth it to investigate this.

Just Try the Write

Say you wanted to read a file. You could write this:

if not os.path.isfile(filename):
    raise ValueError("File does not exist")
with open(filename, 'r') as f:

But that would still be vulnerable to a race! What if the OS or another thread deleted the file between the isfile check and the with open line - the latter would throw an IOError, which won't be handled. Far better to just try to read the file and handle errors appropriately.

    with open(filename, 'r') as f:
except IOError:
    raise ValueError("File does not exist")

Say you have a foreign key reference - phone_numbers.user_id refers to, and you want to validate that the user_id is valid. You could do:

def write_phone_number(number, user_id):
    user = Users.find_by_id(user_id)
    if user is None:
        raise NotFoundError("User not found")
    Number.create(number=number, user_id=user_id)

Just try to write the number! If you have a foreign key constraint in the database, and the user doesn't exist, the database will tell you so. Otherwise you have a race between the time you fetch the user and the time you create the number.

def write_phone_number(number, user_id):
        Number.create(number=number, user_id=user_id)
    except DatabaseError as e:
        if is_foreign_key_error(e):
            raise NotFoundError("Don't know that user id")

Updates Should Compose

Let's say you have the following code to charge a user's account.

def charge_customer(account_id, amount=20):
    account = Accounts.get_by_id(account_id)
    account.balance = account.balance - 20
    if account.balance <= 0:
        throw new ValueError("Negative account balance")

Under the hood, here's what that will generate:

SELECT * FROM accounts WHERE id = ?
UPDATE accounts SET balance = 30 WHERE id = ?;

So far, so good. But what happens if two requests come in to charge the account at the same time? Say the account balance is $100

  1. Thread 1 wants to charge $30. It reads the account balance at $100.

  2. Thread 2 wants to charge $15. It reads the account balance at $100.

  3. Thread 1 subtracts $30 from $100 and gets a new balance of $70.

  4. Thread 2 subtracts $15 from $100 and gets a new balance of $85.

  5. Thread 1 attempts to UPDATE the balance to $70.

  6. Thread 2 attempts to UPDATE the balance to $85.

This is clearly wrong! The balance after $45 of charges should be $55, but it was $70, or $85, depending on which UPDATE went last. There are a few ways to deal with this:

  • create some kind of locking service to lock the row before the read and after you write the balance. The other thread will wait for the lock before it reads/writes the balance. These are hard to get right and will carry a latency penalty.

  • Run the update in a transaction; this will create an implicit lock on the row. If the transaction runs at the SERIALIZABLE or REPEATABLE READ levels, this is safe. Note most databases will set the default transaction level to READ COMMITTED, which won't protect against the issue referenced above.

  • Skip the SELECT and write a single UPDATE query that looks like this:

      UPDATE accounts SET balance = balance - 20 WHERE id = ?;

That last UPDATE is composable. You can run a million balance updates in any order, and the end balance will be exactly the same, every time. Plus you don't need a transaction or a locking service; it's exactly one write (and faster than the .save() version above!)

But if I do just one UPDATE, I can't check whether the balance will go below zero! You can - you just need to enforce the nonnegative constraint in the database, not the application.

CREATE TABLE accounts (
    id integer primary key,
    balance integer CHECK (balance >= 0),

That will throw any time you try to write a negative balance, and you can handle the write failure in the application layer.

Update: Apparently MySQL accepts check constraints as valid syntax, but does not execute them, so you might need to take a different approach. Thanks olivier for pointing this out!

The key point is that your updates should be able to run in any order without breaking the application. Use relative ranges - balance = balance - 20 for example - if you can. Or, only apply the UPDATE if the previous state of the database is acceptable, via a WHERE clause. The latter technique is very useful for state machines:

UPDATE pickups SET status='submitted' WHERE status='draft' AND id=?;

That update will either succeed (if the pickup was in draft), or return zero rows. If you have a million threads try that update at the same time, only one will succeed - an incredibly valuable property!

Beware of save()

The save() function in an ORM is really unfortunate for two reasons. First, to call .save(), you have to retrieve an instance of the object via a SELECT call. If you have an object's ID and some fields to read, you can avoid needing to do the read by just trying the UPDATE. This introduces more latency and the possibility of writing stale data.

Second, some implementations of .save() will issue an UPDATE and update every column.

This can lead to updates getting clobbered. Say two requests come in, one to update a user's phone number, and the other to update a user's email address, and both call .save() on the record.

UPDATE users SET [email protected]', phone_number='newnumber' WHERE id = 1;
UPDATE users SET [email protected]', phone_number='oldnumber' WHERE id = 1;

In this scenario the first UPDATE gets clobbered, and the old email gets persisted. This is really bad! We told the first thread that we updated the email address, and then we overwrote it. Your users and your customer service team will get really mad, and this will be really hard to reproduce. Be wary of .save - if correctness is important (and it should be!), use an UPDATE with only the column that you want.

Partial Indexes

If you thought the previous section was interesting, check this out. Say we have a pickups table. Each pickup has a driver ID and a status (DRAFT, ASSIGNED, QUEUED, etc).

CREATE TABLE pickups (
    id integer,
    driver_id INTEGER REFERENCES drivers(id),
    status TEXT

We want to enforce a rule that a given driver can only have one ASSIGNED pickup at a time. You can do this in the application by using transactions and writing very, very careful code... or you can ask Postgres to do it for you:

CREATE UNIQUE INDEX "only_one_assigned_driver" ON pickups(driver_id) WHERE
    status = 'ASSIGNED';

Now watch what happens if you attempt to violate that constraint:

> INSERT INTO pickups (id, driver_id, status) VALUES (1, 101, 'ASSIGNED');
> INSERT INTO pickups (id, driver_id, status) VALUES (2, 101, 'DRAFT');
INSERT 0 1 -- OK, because it's draft; doesn't hit the index.
> INSERT INTO pickups (id, driver_id, status) VALUES (3, 101, 'ASSIGNED');
ERROR:  duplicate key value violates unique constraint "only_one_assigned_driver"
DETAIL:  Key (driver_id)=(101) already exists.

We got a duplicate key error when we tried to insert a second ASSIGNED record! Because you can trust the database to not ever screw this up, you have more flexibility in your application code. Certainly you don't have to be as careful to preserve the correctness of the system, since it's impossible to put in a bad state!


In many instances your ORM may be generating a query that's both slow, and can lead to concurrency errors. That's the bad news. The good news is you can write database queries that are both faster and more correct!

A good place to start is by reversing the traditional model of ORM development. Instead of starting with the code in the ORM and working backwards to the query, start with the query you want, and figure out how to express than in your ORM. You'll probably end up using the lower-level methods offered by your ORM a lot more, and you'll probably discover defects in the way that your ORM handles database errors. That's okay! You are on a much happier path.

Thanks to Alan Shreve and Kyle Conroy for reading drafts of this post.

Maybe Automatically Updating Dependencies Isn’t a Great Idea

There's a distressing feeling in the Node.js community that apps without up-to-date dependencies are somehow not as good, or stable, as apps that always keep their dependencies up to date. So we see things like and badges that show whether the project's dependencies are up to date (and, implicitly, shame anyone whose dependencies aren't green).

I'm not talking about updating dependencies for a good reason (covered below); I'm talking about the practice of updating dependencies for the sake of keeping the dependencies updated. In the best possible case, a dependency update does nothing. The application keeps running exactly as it was. In the worst case, your servers crash, millions of dollars of business value are affected, state is compromised, or worse.

One day at Twilio we tried to deploy code that had a defect in it. The deployment tool noticed the errors and tried to rollback the deployment by putting the old nodes in the load balancer and taking the new ones out. Except... when it went to take the new nodes out, the worker process crashed. So we ended up with both the new (faulty) nodes and the old nodes in the load balancer, and our most reliable tool for cluster management couldn't pull the bad nodes out.

We did some investigation and it turns out one of our dependencies had updated. Well, it wasn't a direct dependency - we locked down all of those - it was a dependency of a dependency, which upgraded to version 3, and introduced an incompatibility.

Fundamentally, updating dependencies is a dangerous operation. Most people would never deploy changes to production without having them go through code review, but I have observed that many feel comfortable bumping a package.json number without looking at the diff of what changed in the dependency.

New releases of dependencies are usually less tested in the wild than older versions. We know the best predictor of the number of errors in code is the number of lines written. The current version of your dependency (which you know works) has 0 lines of diff when compared with itself, but the newest release has a greater-than-0 number of lines of code changed. Code changes are risky, and so are dependency updates.

Errors and crashes introduced by dependency updates are difficult to debug. Generally, the errors are not in your code; they're deep in a node_modules or site-packages folder or similar. You are probably more familiar with your application code than the intricacies of your third party tools. Tracking down the error usually involves figuring out what version of the code used to be running (not easy!) and staring at a diff between the two.

But my tests will catch any dependency errors, you say. Maybe you have great coverage around your application. But do the dependencies you're pulling in have good test coverage? Are the interactions between your dependencies tested? How about the interactions between the dependency and every possible version of a subdependency? Do your tests cover every external interface?

But the dependencies I'm pulling in use semver, so I'll know if things break. This only saves you if you actually read the CHANGELOG, or the package author correctly realizes a breaking change. Otherwise you get situations like this. Which just seems sad; the reporter must have taken time to update the package, then gotten an error report, then had to figure out what change crashed the servers, then mitigated the issue. A lot of downside there - wasted time and the business fallout of a crashing application, and I'm struggling to figure out what benefit the reporter got from updating to the latest possible version.

When to Update Dependencies

Generally I think you should lock down the exact versions of every dependency and sub-dependency that you use. However, there are a few cases where it makes sense to pull in the latest and greatest thing. In every case, at the very least I read the CHANGELOG and scan the package diff before pulling in the update.

Security Upgrades

An application issues a new release to patch a security vulnerability, and you need to get the latest version of the app to patch the same hole. Even here, you want to ensure that the vulnerability actually affects your application, and that the changed software does not break your application. You may not want to grab the entire upstream changeset, but only port in the patch that fixes the security issue.

Performance Improvement

Recently we noticed our web framework was sleeping for 50ms on every POST and PUT request. Of course you would want to upgrade to avoid this (although we actually fixed it by removing the dependency).

You Need a Hot New Feature

We updated mocha recently because it wouldn't print out stack traces for things that weren't Error objects. We submitted a patch and upgraded mocha to get that feature.

You Need a Bug Fix

Your version of the dependency may have a defect, and upgrading will fix the issue. Ensure that the fix was actually coded correctly.

A Final Warning

Updated dependencies introduce a lot of risk and instability into your project. There are valid reasons to update and you'll need to weigh the benefit against the risk. But updating dependencies just for the sake of updating them is just going to run you into trouble.

You can avoid all of these problems by not adding a dependency in the first place. Think really, really hard before reaching for a package to solve your problem. Actually read the code and figure out if you need all of it or just a subset. Maybe if you need to pluralize your application's model names, it's okay to just add an 's' on the end, instead of adding the pluralize library. Sure, the Volcano model will be Volcanos instead of Volcanoes but maybe that's okay for your system.

Unfortunately my desired solution for adding dependencies - fork a library, rip out the parts you aren't using, and copy the rest directly into your source tree - isn't too popular. But I think it would help a lot with enforcing the idea that you own your dependencies and the code changes inside.

Don’t Use Sails (or Waterline)

The Shyp API currently runs on top of the Sails JS framework. It's an extremely popular framework - the project has over 11,000 stars on Github, and it's one of the top 100 most popular projects on the site. However, we've had a very poor experience with it, and with Waterline, the ORM that runs underneath it. Remember when you learned that does a DNS lookup to check whether a URL is equivalent to another URL? Imagine finding an issue like that every two weeks or so and that's the feeling I get using Sails and Waterline.

The project's maintainers are very nice, and considerate - we have our disagreements about the best way to build software, but they're generally responsive. It's also clear from the error handling in the project (at least the getting started docs) that they've thought a lot about the first run experience, and helping people figure out the answers to the problems they encounter trying to run Sails.

That said, here are some of the things we've struggled with:

  • The website broke all incoming Google links ("sails views", "sails models", etc), as well as about 60,000 of its own autogenerated links, for almost a year. Rachael Shaw has been doing great work to fix them again, but it was pretty frustrating that documentation was so hard to find for that long.

  • POST and PUT requests that upload JSON or form-urlencoded data sleep for 50ms in the request parser. This sleep occupied about 30% of the request time on our servers, and 50-70% of the time in controller tests.

  • The community of people who use Sails doesn't seem to care much about performance or correctness. The above error was present for at least a year and not one person wondered why simple POST requests take 50ms longer than a simple GET. For a lot of the issues above and below it seems like we are the only people who have ran into them, or care.

  • By default Sails generates a route for every function you define in a controller, whether it's meant to be public or not. This is a huge security risk, because you generally don't think to write policies for these implicitly-created routes, so it's really easy to bypass any authentication rules you have set up and hit a controller directly.

  • Blueprints are Sails's solution for a CRUD app and we've observed a lot of unexpected behavior with them. For one example, passing an unknown column name as the key parameter in a GET request (?foo=bar) will cause the server to return a 500.

  • If you want to test the queries in a single model, there's no way to do it besides loading/lifting the entire application, which is dog slow - on our normal sized application, it takes at least 7 seconds to begin running a single test.

  • Usually when I raise an issue on a project, the response is that there's some newer, better thing being worked on, that I should use instead. I appreciate that, but porting an application has lots of downside and little upside. I also worry about the support and correctness of the existing tools that are currently in wide use.

  • Hardcoded typos in command arguments.

  • No documented responsible disclosure policy, or information on how security vulnerabilities are handled.


Waterline is the ORM that powers Sails. The goal of Waterline is to provide the same query interface for any database that you would like to use. Unfortunately, this means that the supported feature set is the least common denominator of every supported database. We use Postgres, and by default this means we can't get a lot of leverage out of it.

These issues are going to be Postgres oriented, because that's the database we use. Some of these have since been fixed, but almost all of them (apart from the data corruption issues) have bit us at one point or another.*

  • No support for transactions. We had to write our own transaction interface completely separate from the ORM.

  • No support for custom Postgres types (bigint, bytea, array). If you set a column to type 'array' in Waterline, it creates a text field in the database and serializes the array by calling JSON.stringify.

  • If you define a column to be type 'integer', Waterline will reject things that don't look like integers... except for Mongo IDs, which look like "4cdfb11e1f3c000000007822". Waterline will pass these through to the database no matter what backend data store you are using.

  • Waterline offers a batch interface for creating items, e.g. Users.create([user1, user2]). Under the hood, however, creating N items issues N insert requests for one record each, instead of one large request. 29 out of 30 times, the results will come back in order, but there used to be a race where sometimes create will return results in a different order than the order you inserted them. This caused a lot of intermittent, hard-to-parse failures in our tests until we figured out what was going on.

  • Waterline queries are case insensitive; that is, Users.find().where(name: 'FOO') will turn into SELECT * FROM users WHERE name = LOWER('FOO');. There's no way to turn this off. If you ask Sails to generate an index for you, it will place the index on the uppercased column name, so your queries will miss it. If you generate the index yourself, you pretty much have to use the lowercased column value & force every other query against the database to use that as well.

  • The .count function used to work by pulling the entire table into memory and checking the length of the resulting array.

  • No way to split out queries to send writes to a primary and reads to a replica. No support for canceling in-flight queries or setting a timeout on them.

  • The test suite is shared by every backend adapter; this makes it impossible for the Waterline team to write tests for database-specific behavior or failure handling (unique indexes, custom types, check constraints, etc). Any behavior specific to your database is poorly tested at best.

  • "Waterline truncated a JOIN table". There are probably more issues in this vein, but we excised all .populate, .associate, and .destroy calls from our codebase soon after this, to reduce the risk of data loss.

  • When Postgres raises a uniqueness or constraint violation, the resulting error handling is very poor. Waterline used to throw an object instead of an Error instance, which means that Mocha would not print anything about the error unless you called console.log(new Error(err)); to turn it into an Error. (It's since been fixed in Waterline, and I submitted a patch to Mocha to fix this behavior, but we stared at empty stack traces for at least six months before that). Waterline attempts to use regex matching to determine whether the error returned by Postgres is a uniqueness constraint violation, but the regex fails to match other types of constraint failures like NOT NULL errors or partial unique indexes.

  • The error messages returned by validation failures are only appropriate to display if the UI can handle newlines and bullet points. Parsing the error message to display any other scenario is very hard; we try really hard to dig the underlying pg error object out and use that instead. Mostly nowadays we've been creating new database access interfaces that wrap the Waterline model instances and handle errors appropriately.


I appreciate the hard work put in by the Sails/Waterline team and contributors, and it seems like they're really interested in fixing a lot of the issues above. I think it's just really hard to be an expert in sixteen different database technologies, and write a good library that works with all of them, especially when you're not using a given database day in and day out.

You can build an app that's reliable and performant on top of Sails and Waterline - we think ours is, at least. You just have to be really careful, avoid the dangerous parts mentioned above, and verify at every step that the ORM and the router are doing what you think they are doing.

The sad part is that in 2015, you have so many options for building a reliable service, that let you write code securely and reliably and can scale to handle large numbers of open connections with low latency. Using a framework and an ORM doesn't mean you need to enter a world of pain. You don't need to constantly battle your framework, or worry whether your ORM is going to delete your data, or it's generating the correct query based on your code. Better options are out there! Here are some of the more reliable options I know about.

  • Instagram used Django well through its $1 billion dollar acquisition. Amazing community and documentation, and the project is incredibly stable.

  • You can use Dropwizard from either Java or Scala, and I know from experience that it can easily handle hundreds/thousands of open connections with incredibly low latency.

  • Hell, the Go standard library has a lot of reliable, multi-threaded, low latency tools for doing database reads/writes and server handling. The third party libraries are generally excellent.

I'm not amazingly familiar with backend Javascript - this is the only server framework I've used - but if I had to use Javascript I would check out whatever the Walmart and the Netflix people are using to write Node, since they need to care a lot about performance and correctness.

*: If you are using a database without traditional support for transactions and constraints, like Mongo, correct behavior is going to be very difficult to verify. I wish you the best of luck.