Posts Tagged With: Code

Go Concurrency for Javascript Developers

Often the best way to learn a new language is to implement something you know
with it. Let’s take a look at some common async Javascript patterns, and how
you’d implement them in Go.

Callbacks

You can certainly implement callbacks in Go! Functions are first class
citizens. Here we make a HTTP request and hit the callback with the response
body or an error.

func makeRequest(method string, url string, cb func(error, io.ReadCloser)) {
    req, _ := http.NewRequest(method, url, nil)
    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        cb(err, nil)
    } else {
        cb(err, resp.Body)
    }
}
func main() {
    makeRequest("GET", "http://ipinfo.io/json", func(err error, body io.ReadCloser) {
        defer body.Close()
        io.Copy(os.Stdout, body)
    })
}

Go uses callbacks in the standard library in some places – time.AfterFunc,
http.HandlerFunc. Other places callbacks are mostly used to allow you to
specify what function or action you want to take, in interaction with a useful
piece of code – filepath.Walk, strings.LastIndexFunc, etc.

Most of the same problems with callbacks apply to Go as Javascript, namely, you
can forget to call a callback, or you can call it more than once. At least you
can’t try to callback that is undefined in Go, since the compiler will not
let you do this.

But callbacks aren’t idiomatic. Javascript needs callbacks because there’s
only a single thread of execution, processing every request on your server
sequentially. If you "wait"/"block" for something to finish (e.g. by calling
fs.readFileSync or any of the Sync methods, every other request on your
server has to wait until you’re done. I’m simplifying, but you have to have
callbacks in Javascript so everything else (other incoming HTTP requests, for
example) have a chance to run.

Go does not have this same limitation. A Javascript thread can only run on one
CPU; Go can use all of the CPU’s on your computer, so multiple threads can
do work at the same time, even if all of them are making "blocking" function
calls, like HTTP requests, database queries, or reading or writing files.

Furthermore, even if you only had one CPU, multiple Go threads would run
without having IO "block" the scheduler, even though methods like http.Get
have a synchronous interface. How does this work? Deep down in code like
net/fd_unix.go, the socket code calls syscall.Write, which (when you follow
it far enough) calls runtime.entersyscall, which signals to the scheduler
that it won’t have anything to do for a while
, and other work should
proceed.

For that reason you probably don’t want to use callbacks for asynchronous code.

.then / .catch / .finally

This is the other common approach to async operations in Javascript.
Fortunately most API’s in Go are blocking, so you can just do one thing and
then the other thing. Say you want to execute the following async actions:

var changeEmail = function(userId, newEmail) {
    return checkEmailAgainstBlacklist(newEmail).then(function() {
        return getUser(userId);
    }).then(function(user) {
        user.email = newEmail;
        return user.save();
    }).catch(function(err) {
        console.error(err);
        throw err;
    });
}

In Go, you’d just do each action in turn and wait for them to finish:

func changeEmail(userId string, newEmail string) (User, error) {
    err := checkEmailAgainstBlacklist(newEmail)
    if err != nil {
        return nil, err
    }
    user, err := getUser(userId)
    if err != nil {
        return nil, err
    }
    user.Email = newEmail
    return user.Save()
}

.finally is usually implemented with the defer keyword; you can at any time
defer code to run at the point the function terminates.

resp, err := http.Get("http://ipinfo.io/json")
// Note Close() will panic if err is non-nil, you still have to check err
checkError(err)
defer resp.Body.Close()
io.Copy(os.Stdout, resp.Body)

Promise.try

Just ignore the error response, or handle it; there’s no real difference
between sync and async errors, besides calling panic, which your code
shouldn’t be doing.

Promise.join

Doing multiple things at once in Javascript is pretty easy with Promises.
Here we dispatch two database queries at once, and handle the results in one
function.

var checkUserOwnsPickup = function(userId, pickupId) {
    return Promise.join(
        getUser(userId),
        getPickup(pickupId)
    ).spread(function(user, pickup) {
        return pickup.userId === user.id;
    })
}

Unfortunately doing this in Go is a little more complicated. Let’s start with
a slightly easier case – doing something where we don’t care about the result,
whether it errors or not. You can put each in a goroutine and they’ll execute
in the background.

func subscribeUser(user User) {
    go func() {
        sendWelcomeEmail(user.Email)
    }()
    go func() {
        metrics.Increment('user.signup')
    }()
}

The easiest way to pass the results back is to declare the variables in the
outer scope and assign them in a goroutine.

func main() {
    var user models.User
    var userErr error
    var pickup models.Pickup
    var pickupErr error
    var wg sync.WaitGroup
    wg.Add(2)
    go func() {
        user, userErr = getUser(userId)
        wg.Done()
    }()
    go func() {
        pickup, pickupErr = getPickup(pickupId)
        wg.Done()
    }()
    wg.Wait()
    // Error checking omitted
    fmt.Println(user.Id)
    fmt.Println(pickup.Id)
}

More verbose than Promise.join, but the lack of generics primitives makes a
shorter solution tough. You can gain more control, and early terminate by using
channels, but it’s more verbose to do so.

Promise.map

Do the same as the Go version of Promise.join above, and then write a for
loop to loop over the resolved values. This might be a little trickier because
you need to specify the types of everything, and you can’t create arrays with
differing types. If you want a given concurrency value, there are some neat
code examples in this post on Pipelines.

Checking for races

One of the strengths of Go is the race detector, which detects whether
two threads can attempt to read/write the same variable in different orders
without synchronizing first. It’s slightly slower than normal, but I highly
encourage you to run your test suite with the race detector on. Here’s our
make target:

test:
    go test -race ./... -timeout 2s

Once it found a defect in our code that turned out to be a standard library
error
! I highly recommend enabling the race detector.

The equivalent in Javascript would be code that varies the event loop each
callback runs in, and checks that you still get the same results.

Thanks

I hope this helps! I wrote this to be the guide I wish I had when I was getting
started figuring out how to program concurrently in Go.

Thanks to Kyle Conroy, Alan Shreve, Chris
Goddard
, @rf, and members of the #go-nuts IRC channel for
answering my silly questions.

Encoding and Decoding JSON, with Go’s net/http package

This is a pretty common task: encode JSON and send it to a server, decode JSON
on the server, and vice versa. Amazingly, the existing resources on how to do
this aren’t very clear. So let’s walk through each case, for the following
simple User object:

type User struct{
    Id      string
    Balance uint64
}

Sending JSON in the body of a POST/PUT request

Let’s start with the trickiest one: the body of a Go’s http.Request is an
io.Reader, which doesn’t fit well if you have a struct – you need to write
the struct first and then copy that to a reader.

func main() {
    u := User{Id: "US123", Balance: 8}
    b := new(bytes.Buffer)
    json.NewEncoder(b).Encode(u)
    res, _ := http.Post("https://httpbin.org/post", "application/json; charset=utf-8", b)
    io.Copy(os.Stdout, res.Body)
}

Decoding JSON on the server

Let’s say you’re expecting the client to send JSON data to the server. Easy,
decode it with json.NewDecoder(r.Body).Decode(&u). Here’s what that looks
like with error handling:

func main() {
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        var u User
        if r.Body == nil {
            http.Error(w, "Please send a request body", 400)
            return
        }
        err := json.NewDecoder(r.Body).Decode(&u)
        if err != nil {
            http.Error(w, err.Error(), 400)
            return
        }
        fmt.Println(u.Id)
    })
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Encoding JSON in a server response

Just the opposite of the above – call json.NewEncoder(w).Encode(&u) to write
JSON to the server.

func main() {
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        u := User{Id: "US123", Balance: 8}
        json.NewEncoder(w).Encode(u)
    })
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Reading a JSON response from the server.

This time you’re going to read the response body in the client, after making
the request.

func main() {
    u := User{Id: "US123", Balance: 8}
    b := new(bytes.Buffer)
    json.NewEncoder(b).Encode(u)
    res, _ := http.Post("https://httpbin.org/post", "application/json; charset=utf-8", b)
    var body struct {
        // httpbin.org sends back key/value pairs, no map[string][]string
        Headers map[string]string `json:"headers"`
        Origin  string            `json:"origin"`
    }
    json.NewDecoder(res.Body).Decode(&body)
    fmt.Println(body)
}

That’s it! I hope it helps you a lot. Note that we only had to encode/decode
the response to a byte array one time – in every other case we passed it
directly to the reader/writer, which is one of the really nice things about Go,
and interfaces – in most cases it’s really easy to pass around streams, instead
of having to deal with the intermediate steps.

How to push to multiple Github accounts from the same machine

You might want/need to check out repos locally as several
different Github users. For example, my Github account is
kevinburke, but I push code at work as
kevinburkeshyp. If I could only
commit/push as one of those users on each machine, or I had to manually specify
the SSH key to use, it would be a big hassle!

Generate some SSH keys

First generate SSH keys for each account, and then upload the public keys to
the right Github accounts.

ssh-keygen -t rsa -b 4096 -f "$HOME/.ssh/kevinburke_rsa"
ssh-keygen -t rsa -b 4096 -f "$HOME/.ssh/kevinburkeshyp_rsa"

Set up SSH host aliasing

You’re going to use your standard SSH config for the host you use more often,
and an alias for the one you use less often. Here’s my SSH configuration for
the host I use more often – on my work laptop, it’s my work account.

Host github.com
    IdentityFile ~/.ssh/kevinburkeshyp_rsa
    HostName github.com
    User kevinburkeshyp

Now, set up an alias for your second account.

Host github.com-kevinburke
    IdentityFile ~/.ssh/kevinburke_rsa
    HostName github.com
    User kevinburke

The Host is set to github.com-kevinburke – this is what you write into your
SSH config locally. The HostName is what’s used for DNS resolution, which is
how this still works. Note you’ll want to switch the IdentityFile and the
User with the key and usernames you set when you generated the keys.

Now if you want to clone something into the second account, use the special
host alias:

git clone [email protected]:kevinburke/hamms.git

That’s it! It should just work.

Which Commit Name/Email Address?

Most people configure their commit email address using the following command:

$ git config --global user.email [email protected]

This sets your email address in $HOME/.gitconfig and applies that value to
every repository/commit on your system. Obviously this is a problem if you have
two different email addresses! I want to use [email protected] with my Shyp
account and [email protected] with my personal account.

I fix this by putting an empty email in the global $HOME/.gitconfig:

[user]
    name = Kevin Burke
    email = ""

When I try a commit in a new repository, Git shows the following message:

*** Please tell me who you are.

Run

  git config --global user.email "[email protected]"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got [email protected](none)')

Which is good! It prevents me from committing until I’ve configured my email
address. I then have helpers set up to easily configure this.

alias setinburkeemail="git config user.email [email protected]'"

That’s it! What tips do you use for managing two different Github accounts?

Just Return One Error

Recently there’s been a trend in API’s to return more than one error, or to
always return an array of errors, to the client, when there’s a 400 or a 500
server error. From the JSON API specification:

Error objects MUST be returned as an array keyed by errors in the top level
of a JSON API document.

Here’s an example.

HTTP/1.1 422 Unprocessable Entity
Content-Type: application/vnd.api+json

{
  "errors": [
    {
      "status": "422",
      "source": { "pointer": "/data/attributes/first-name" },
      "title":  "Invalid Attribute",
      "detail": "First name must contain at least three characters."
    }
  ]
}

I’ve also seen it in model validation; for example, Waterline
returns an array of errors in the invalidAttributes field of the returned
error object.

I think this is overkill, and adds significant complexity to your application
for little benefit. Let’s look at why.

  • Clients are rarely prepared to show multiple errors. At Shyp, we returned
    an errors key from the API on a 400 or 500 error for about 18 months.
    All four of the API’s clients – the iOS app, the Android app, our error
    logger, and our internal dashboard – simply processed this by parsing and
    displaying errors[0], and ignoring the rest. It makes sense from the client’s
    perspective – the errors were being shown on a phone (not much room) or an
    alert box in the browser (show two, and you get the "Prevent this page from
    displaying more alerts" dialog). But if clients can’t make use of multiple
    errors, it makes me wonder why we return them from the API.

  • You rarely catch every problem with a request. Say a pickup is 1) missing
    the to address, 2) missing the city, 3) shipping to a country we don’t ship
    to, and 4) contains a photo of a firearm, which we can’t ship. If we return an
    error, it’s going to be ['Missing City', 'Missing Address Line 1']. There’s
    no way we’re going to submit your invalid address to our rates service and
    discover even more problems with the customs form, or try to parse the photo
    and learn that you’re trying to ship a firearm. So even if you fix the first
    errors, we’re still going to return you a 400 from the API.

  • Continuing to process the request is dangerous. Attaching multiple errors
    to an API response implies that you discovered a problem with an incoming
    request and continued processing it. This is dangerous! Every part of your
    system needs to be prepared to handle invalid inputs. This might be a useful
    defensive measure, but it’s also a lot of work for every function in your
    system to avoid making assumptions about the input that’s passed to it. In my
    experience, passing around bad inputs leads to unnecessary server errors and crashes.

  • Programming languages and libraries return one error. Generally if you
    call raise or throw, you’re limited to throwing a single Error object. If
    you’re using a third party library that can error, usually library functions
    are limited to returning one error. Returning a single error from your API or
    your model validation function aligns with 99% of error handling paradigms that
    currently exist.

A Caveat

There’s one place where returning multiple errors may make sense – a signup
form with multiple form fields, where it can be frustrating to only return one
problem at a time. There are two ways to deal with this:

  • Perform validation on the client, and detect/display multiple errors there.
    It’s a better user experience to do this, since feedback is more or less
    instantaneous. You’ll need the validation in your API anyway, but if someone
    manages to bypass the client checks, it’s probably fine to just return one
    error at a time.

  • Special case your returned error with a key like attributes, and return the
    necessary error messages, just for this case. I’d rather special case this
    error message, than force every other error to be an array just to handle this
    one place in the API.

Safely Moving a Large Shrinkwrapped Dependency

This past week the team at Shyp decided to fork the framework we use (Sails.js)
and the ORM it uses (Waterline). It was an interesting exercise in JS tooling,
and I wasted a lot of time on busywork, so I thought I’d share a little bit
about the process.

Why Fork?

We’ve been on an outdated fork of Sails/Waterline for a while. We chose not to
update because we were worried that updates would introduce new errors, and
many of the updates were new features that we didn’t want or need.

The recent ugly discussions between the former CEO of Balderdash and other core
contributors to Sails/Waterline made us rethink our dependency on the core
tools. We concluded upgrading our framework was going to always be unlikely. We
also figured we could remove the many parts of the framework that we don’t want
or need (sockets, cookies, blueprints, the horribly insecure "action routes",
CSRF, Mongo support, Grunt, sails generate, sails www, the multi-query
methods that aren’t safe against race conditions, &c), and apply bugfixes to
parts of the code that we knew were broken. Fortunately the changes already
started to pay for themselves at the end of an 8 hour day, as I’ll explain
later.

Rollout Plan

On its face, the rollout plan was pretty simple.

  1. Fork Sails to Shyp/sails.

  2. Update the HEAD of the master branch to point at the version we have
    installed in production.

  3. Point the Sails dependency at our fork (no code changes).

  4. Make changes that we want.

  5. Update the dependency.

Breakdowns

This broke down somewhere around (3), for a few reasons.

  • If I tried to npm install github.com/Shyp/sails.git, every single package
    Sails depended on, and all of their sub-dependencies, would update. We didn’t
    want this, for several reasons – I don’t trust package
    maintainers to stay on top of updates, or predict when changes/updates will
    break code, and package updates amount to a large amount of unvetted code being
    deployed into your API without any code review.

    I didn’t know how to work around this so I punted on it by hard-coding the
    update in the npm shrinkwrap file.

  • My next attempt was to try to generate a shrinkwrap file for Sails, and
    give our Sails fork the same versions of every dependency & sub-dependency we
    had in our API. This was the right idea, but failed in its implementation.
    Sails dependencies are currently alphabetized, but they weren’t in the version
    that we were running. I alphabetized the package.json list while generating
    a shrinkwrap file, which changed the dependency order in the Shyp/sails
    shrinkwrap file. This made it impossible to diff against the shrinkwrap file in
    our API, which listed Sails’s dependencies in an arbitrary order.

  • Finally I figured out what needed to happen. First, alphabetize the
    Sails dependency list in the Shyp API repository. Once that’s done, I
    could use that shrinkwrap file as the basis for the shrinkwrap file in
    the Sails fork, so the dependencies in the Sails project matched up with
    the Shyp dependencies. Specifically, I accomplished this by sorting the
    dependencies in the package.json file, placing the sorted file in
    api/node_modules/sails/package.json, and then re-running clingwrap with the
    same dependencies installed. I then verified the same version of each
    dependency was installed by sorting the JSON blob in the old and new shrinkwrap
    files, and ensuring that the diff was empty.

    Once that was done, I could update dependencies in the (alphabetized) fork,
    and the diff in Shyp API’s (now alphabetized) npm-shrinkwrap.json would
    be small. It was great to finally be able to generate a small diff in
    npm-shrinkwrap.json, and verify that it was correct.

Compounding the frustration was that each build took about 2 minutes, to wipe
and reinstall all versions of every dependency. It was a slow afternoon that
got more frustrating as it wore on.

Notes

  • Before adding or attempting to move any large NPM dependency, ensure
    that the dependencies are in alphabetical order
    . You’ll run into too many
    headaches if this is not the case, and you won’t be able to regenerate the
    npm-shrinkwrap file or be able to reason about it.

  • It’s surprising that the builtin build tools don’t enforce alphabetical
    order.

  • NPM’s shrinkwrap tool does not generate the same file every time you run it.
    Specifically, if you install a package, occasionally the shrinkwrap file will
    report the "from" string is a range, for example, "from": "colors@>=0.6.0-1 <0.7.0", and sometimes it will report the "from" string as a npmjs.com URL.
    For a while, we worked around this by generating the NPM shrinkwrap file
    twice – the second time we ran it, it would generate a consistent set of
    strings.

      rm -rf node_modules
      npm cache clear
      npm install --production
      npm shrinkwrap
      npm install --production
      npm shrinkwrap
    

    Eventually we switched to using clingwrap, which solves
    this problem by avoiding the "from" and "resolved" fields entirely.
    We’ve also heard NPM 3 contains updates to the shrinkwrap file, but NPM 3
    also believes our dependency tree contains errors, so we’ve been unable to
    effectively test it. Clingwrap also has some errors with NPM 3 and github
    repositories
    .

  • It was incredibly tempting to just say "fuck it" and deploy the updated
    version of every dependency and subdependency. Based on the amount of work it
    took me to figure out how to generate and check in consistent diffs, I came
    away thinking that there are a lot of Javascript applications out there that
    are unable to control/stabilize the versions of libraries that they use.

  • These are all problems you can avoid or mitigate by thinking incredibly hard
    before taking on a dependency, especially if, like Sails, the tool you’re
    leaning on has a lot of its own dependencies. Is there really an advantage to
    using _.first(arr) instead of arr[0]? Do you really need the pluralize
    library, or is it sufficient to just append an ‘s’ for your use case? Do
    you need to take a dependency to build a state machine, or can you use a
    dictionary and an UPDATE query
    ?

Benefits

This took a lot of time to get right, and felt like busywork for at least three
of the four hours I spent working on this, but we finally got some benefits out
of this by the end of the day.

  • I committed a change to immediately error if anyone attempts to use the
    count() method
    , which attempts to pull the entire table into memory.
    Deploying this in our API application revealed four places where we were
    calling count()! Fortunately, they were all in test code. Still, it’s
    delightful to be able to automatically enforce a rule we previously had to rely
    on code review (and shared team memory) to catch.

  • We removed the Grunt dependency, which we never used. There are
    about 4500 lines of deletions waiting in pull requests.

  • We understand shrinkwrap a little better, and hopefully this process won’t be
    so painful next time.

  • We removed some incorrect Mongo-specific code from Waterline. This
    is the type of adaptation that we can only do by giving up compatibility, and
    we’re happy with how easy this is now.

So that was my Wednesday last week. As it stands, we’re planning on developing
this fork in the open. I don’t think you should use it – there are still
better alternatives
– but if you’re stuck on Waterline with
Postgres, or using Sails as a REST API, you might want to take a look.

Everything you need to know about state machines

Do you have objects in your system that can be in different states (accounts,
invoices, messages, employees)? Do you have code that updates these objects
from one state to another? If so, you probably want a state machine.

What is a state machine?


To Everything, There is a Season, by Shawn Clover. CC BY-NC 2.0

At its root, a state machine defines the legal transitions between states in
your system, is responsible for transitioning objects between states, and
prevents illegal transitions.

You sound like an architecture astronaut, why do I need this?

Let’s talk about some bad things that can happen if you don’t have a state
machine in place.

(Some of these actually happened! Some are invented.)

  • A user submits a pickup. We pick up the item and ship it out. Two weeks
    later, a defect causes the app to resubmit the same pickup, and reassign a
    driver, for an item that’s already been shipped.

  • Two users submit a pickup within a second of each other. Our routing
    algorithm fetches available drivers, computes each driver’s distance to the
    pickup, and says the same driver is available for both pickups. We assign the
    same driver to both pickups.

  • A user submits a pickup. A defect in a proxy causes the submit request to
    be sent multiple times. We end up assigning four drivers to the pickup, and
    sending the user four text messages that their pickup’s been assigned.

  • An item is misplaced at the warehouse and sent straight to the packing
    station. Crucial steps in the shipping flow get skipped.

  • Code for updating the state of an object is littered between several
    different classes and controllers, which handle the object in different parts
    of its lifecycle. It becomes difficult to figure out how the object moves
    between various states. Customer support tells you that an item is in a
    particular state that makes it impossible for them to do their jobs. It takes
    great effort to figure out how it got there.

These are all really bad positions to be in! A lot of pain for you and a lot
of pain for your teams in the field.

You are already managing state (poorly)

Do you have code in your system that looks like this?

def submit(pickup_id):
    pickup = Pickups.find_by_id(pickup_id)
    if pickup.state != 'DRAFT':
        throw new StateError("Can't submit a pickup that's already been submitted")
    pickup.state = 'SUBMITTED'
    pickup.save()
    MessageService.send_message(pickup.user.phone_number, 'Your driver is on the way!')

By checking the state of the pickup before moving to the next state, you’re
managing the state of your system! You are (at least partially) defining what
transitions are allowed between states, and what transitions aren’t. To avoid
the issues listed above, you’ll want to consolidate the state management in one
place in your codebase.

Okay, how should I implement the state machine?

There are a lot of libraries that promise to manage this for you. You don’t
need any of them (too much complexity), and you don’t need a DSL. You just need
a dictionary and a single database query.

The dictionary is for defining transitions, allowable input states, and the
output state. Here’s a simplified version of the state machine we use for
Pickups.

states = {
    submit: {
        before: ['DRAFT'],
        after:   'SUBMITTED',
    },
    assign: {
        before: ['SUBMITTED'],
        after:   'ASSIGNED',
    },
    cancel: {
        before: ['DRAFT', 'SUBMITTED', 'ASSIGNED'],
        after:   'CANCELED',
    },
    collect: {
        before: ['ASSIGNED'],
        after:   'COLLECTED',
    },
}

Then you need a single function, transition, that takes an object ID, the
name of a transition, and (optionally) additional fields you’d like to set on
the object if you update its state.

The transition function looks up the transition in the states dictionary,
and generates a single SQL query:

UPDATE table SET
    state = 'newstate',
    extraField1 = 'extraValue1'
WHERE
    id = $1 AND
    state IN ('oldstate1', 'oldstate2')
RETURNING *

If your UPDATE query returns a row, you successfully transitioned the item!
Return the item from the function. If it returns zero rows, you failed to
transition the item. It can be tricky to determine why this happened, since you
don’t know which (invalid) state the item was in that caused it to not match.
We cheat and fetch the record to give a better error message – there’s a race
there, but we note the race in the error message, and it gives us a good idea
of what happened in ~98% of cases.

Note what you don’t want to do – you don’t want to update the object
in memory and then call .save() on it. Not only is .save()
dangerous
, but fetching the item before you attempt to UPDATE it means
you’ll be vulnerable to race conditions between two threads attempting to
transition the same item to two different states (or, twice to the same state).
Ask, don’t tell – just try the transition and then handle success or failure
appropriately.

Say you send a text message to a user after they submit their pickup – if two
threads can successfully call the submit transition, the user will get 2 text
messages. The UPDATE query above ensures that exactly one thread will succeed
at transitioning the item, which means you can (and want to) pile on whatever
only-once actions you like (sending messages, charging customers, assigning
drivers, &c) after a successful transition and ensure they’ll run once.
For more about consistency, see Weird Tricks to Write Faster, More Correct
Database Queries
.

Being able to issue queries like this is one of the benefits of using a
relational database with strong consistency guarantees. Your mileage (and
the consistency of your data) may vary when attempting to implement a state
transition like this using a new NoSQL database. Note that with the latest
version of MongoDB, it’s possible to read stale data, meaning that
(as far as I can tell) the WHERE clause might read out-of-date data, and you
can apply an inconsistent state transition.

Final Warnings

A state machine puts you in a much better position with respect to the
consistency of your data, and makes it easy to guarantee that actions performed
after a state transition (invoicing, sending messages, expensive operations)
will be performed exactly once for each legal transition, and will be
rejected for illegal transitions. I can’t stress enough how often this has
saved our bacon.

You’ll still need to be wary of code that makes decisions based on other
properties of the object. For example, you might set a driver_id on the
pickup when you assign it. If other code (or clients) decide to make a decision
based on the presence or absence of the driver_id field, you’re making a
decision based on the state of the object, but outside of the state machine
framework, and you’re vulnerable to all of the bullet points mentioned above.
You’ll need to be vigilant about these, and ensure all calling code is making
decisions based on the state property of the object, not any auxiliary
properties.

You’ll also need to be wary of code that tries a read/check-state/write
pattern; it’s vulnerable to the races mentioned above. Always always just try
the state transition and handle failure if it fails.

Finally, some people might try to sneak in code that just updates the object
state, outside of the state machine. Be wary of this in code reviews and try to
force all state updates to happen in the StateMachine class.

Weird Tricks to Write Faster, More Correct Database Queries

Adrian Colyer wrote a great summary of a recent paper by Peter
Bailis et al. In the paper the database researchers examine open source Rails
applications and observe that the applications apply constraints – foreign key
references, uniqueness constraints – in a way that’s not very performant or
correct.

I was pretty surprised to read about this! For the most part we have avoided
problems like this at Shyp, and I didn’t realize
how widespread this problem is; I certainly have written a lot of bad queries
in the past.

So! Let’s learn some tips for writing better queries. Everything below
will help you write an application that is more correct – it will avoid
consistency problems in your data – and more performant – you should be able
to achieve the same results as Rails, with fewer queries!

ps – The info below may be really obvious to you! Great! There are
a lot of people who aren’t familiar with these techniques, as the paper above
indicates.

Use database constraints to enforce business logic

Say you define an ActiveRecord class that looks like this:

class User < ActiveRecord::Base
  validates :email, uniqueness: true
end

What actually happens when you try to create a new user? It turns out Rails
will make 4 (four!) roundtrips to the database.

  1. BEGIN a transaction.

  2. Perform a SELECT to see if any other users have that email address.

  3. If the SELECT turns up zero rows, perform an INSERT to add the row.

  4. Finally, COMMIT the result.

This is pretty slow! It also increases the load on your application and your
database, since you need to make 4 requests for every INSERT. Bailis et al also
show that with your database’s default transaction isolation level, it’s
possible to insert two records with the same key. Furthermore, there
are some ActiveRecord queries which skip the built-in validations, as Gary
Bernhardt discussed in his video, "Where Correctness Is Enforced"
,
way back in 2012. Any query which inserts data and skips the validations can
compromise the integrity of your database.

What if I told you you can do the same insert in one query instead of four,
and it would be more correct than the Rails version? Instead of Rails’s
migration, write this:

CREATE TABLE users (email TEXT UNIQUE);

The UNIQUE is the key bit there; it adds a unique key on the table. Then,
instead of wrapping the query in a transaction, just try an INSERT.

> insert into users (email) values ([email protected]');
INSERT 0 1
> insert into users (email) values ([email protected]');
ERROR:  duplicate key value violates unique constraint "users_email_key"
DETAIL:  Key (email)=([email protected]) already exists.

You’ll probably need to add better error handling around the failure case – at
least we did, for the ORM we use. But at any level of query volume, or
if speed counts (and it probably does), it’s worth it to investigate this.

Just Try the Write

Say you wanted to read a file. You could write this:

if not os.path.isfile(filename):
    raise ValueError("File does not exist")
with open(filename, 'r') as f:
    f.read()
    ...

But that would still be vulnerable to a race! What if the OS or another thread
deleted the file between the isfile check and the with open line – the
latter would throw an IOError, which won’t be handled. Far better to just try
to read the file and handle errors appropriately.

try:
    with open(filename, 'r') as f:
        f.read()
        ...
except IOError:
    raise ValueError("File does not exist")

Say you have a foreign key reference – phone_numbers.user_id refers to
users.id, and you want to validate that the user_id is valid. You could do:

def write_phone_number(number, user_id):
    user = Users.find_by_id(user_id)
    if user is None:
        raise NotFoundError("User not found")
    Number.create(number=number, user_id=user_id)

Just try to write the number! If you have a foreign key constraint in the
database, and the user doesn’t exist, the database will tell you so. Otherwise
you have a race between the time you fetch the user and the time you create the
number.

def write_phone_number(number, user_id):
    try
        Number.create(number=number, user_id=user_id)
    except DatabaseError as e:
        if is_foreign_key_error(e):
            raise NotFoundError("Don't know that user id")

Updates Should Compose

Let’s say you have the following code to charge a user’s account.

def charge_customer(account_id, amount=20):
    account = Accounts.get_by_id(account_id)
    account.balance = account.balance - 20
    if account.balance <= 0:
        throw new ValueError("Negative account balance")
    else
        account.save()

Under the hood, here’s what that will generate:

SELECT * FROM accounts WHERE id = ?
UPDATE accounts SET balance = 30 WHERE id = ?;

So far, so good. But what happens if two requests come in to charge the account
at the same time? Say the account balance is $100

  1. Thread 1 wants to charge $30. It reads the account balance at $100.

  2. Thread 2 wants to charge $15. It reads the account balance at $100.

  3. Thread 1 subtracts $30 from $100 and gets a new balance of $70.

  4. Thread 2 subtracts $15 from $100 and gets a new balance of $85.

  5. Thread 1 attempts to UPDATE the balance to $70.

  6. Thread 2 attempts to UPDATE the balance to $85.

This is clearly wrong! The balance after $45 of charges should be $55, but it
was $70, or $85, depending on which UPDATE went last. There are a few ways to
deal with this:

  • create some kind of locking service to lock the row before the read and
    after you write the balance. The other thread will wait for the lock before it
    reads/writes the balance. These are hard to get right and will carry a latency
    penalty.

  • Run the update in a transaction; this will create an implicit lock on
    the row. If the transaction runs at the SERIALIZABLE or REPEATABLE READ
    levels
    , this is safe. Note most databases will set the default
    transaction level to READ COMMITTED, which won’t protect against the issue
    referenced above.

  • Skip the SELECT and write a single UPDATE query that looks like this:

      UPDATE accounts SET balance = balance - 20 WHERE id = ?;
      

That last UPDATE is composable. You can run a million balance updates in any
order, and the end balance will be exactly the same, every time. Plus you don’t
need a transaction or a locking service; it’s exactly one write (and faster
than the .save() version above!)

But if I do just one UPDATE, I can’t check whether the balance will go below
zero!
You can – you just need to enforce the nonnegative constraint in the
database, not the application.

CREATE TABLE accounts (
    id integer primary key,
    balance integer CHECK (balance >= 0),
);

That will throw any time you try to write a negative balance, and you can
handle the write failure in the application layer.

Update: Apparently MySQL accepts check constraints as valid syntax, but does
not execute them
, so you might need to take a different
approach. Thanks olivier for pointing this out!

The key point is that your updates should be able to run in any order without
breaking the application. Use relative ranges – balance = balance - 20 for
example – if you can. Or, only apply the UPDATE if the previous state of the
database is acceptable, via a WHERE clause. The latter technique is very useful
for state machines:

UPDATE pickups SET status='submitted' WHERE status='draft' AND id=?;

That update will either succeed (if the pickup was in draft), or return zero
rows. If you have a million threads try that update at the same time, only one
will succeed – an incredibly valuable property!

Beware of save()

The save() function in an ORM is really unfortunate for two reasons. First,
to call .save(), you have to retrieve an instance of the object via a
SELECT call. If you have an object’s ID and some fields to read, you can
avoid needing to do the read by just trying the UPDATE. This introduces more
latency and the possibility of writing stale data.

Second, some implementations of .save() will issue an UPDATE and update
every column
.

This can lead to updates getting clobbered. Say two requests come in, one to
update a user’s phone number, and the other to update a user’s email address,
and both call .save() on the record.

UPDATE users SET [email protected]', phone_number='newnumber' WHERE id = 1;
UPDATE users SET [email protected]', phone_number='oldnumber' WHERE id = 1;

In this scenario the first UPDATE gets clobbered, and the old email gets
persisted. This is really bad! We told the first thread that we updated the
email address, and then we overwrote it. Your users and your customer service
team will get really mad, and this will be really hard to reproduce. Be wary of
.save – if correctness is important (and it should be!), use an UPDATE with
only the column that you want.

Partial Indexes

If you thought the previous section was interesting, check this out. Say
we have a pickups table. Each pickup has a driver ID and a status (DRAFT,
ASSIGNED, QUEUED, etc).

CREATE TABLE pickups (
    id integer,
    driver_id INTEGER REFERENCES drivers(id),
    status TEXT
);

We want to enforce a rule that a given driver can only have one ASSIGNED pickup
at a time. You can do this in the application by using transactions and writing
very, very careful code… or you can ask Postgres to do it for you:

CREATE UNIQUE INDEX "only_one_assigned_driver" ON pickups(driver_id) WHERE
    status = 'ASSIGNED';

Now watch what happens if you attempt to violate that constraint:

> INSERT INTO pickups (id, driver_id, status) VALUES (1, 101, 'ASSIGNED');
INSERT 0 1
> INSERT INTO pickups (id, driver_id, status) VALUES (2, 101, 'DRAFT');
INSERT 0 1 -- OK, because it's draft; doesn't hit the index.
> INSERT INTO pickups (id, driver_id, status) VALUES (3, 101, 'ASSIGNED');
ERROR:  duplicate key value violates unique constraint "only_one_assigned_driver"
DETAIL:  Key (driver_id)=(101) already exists.

We got a duplicate key error when we tried to insert a second ASSIGNED
record! Because you can trust the database to not ever screw this up, you have
more flexibility in your application code. Certainly you don’t have to be as
careful to preserve the correctness of the system, since it’s impossible to put
in a bad state!

Summary

In many instances your ORM may be generating a query that’s both slow, and can
lead to concurrency errors. That’s the bad news. The good news is you can
write database queries that are both faster and more correct!

A good place to start is by reversing the traditional model of ORM development.
Instead of starting with the code in the ORM and working backwards to the
query, start with the query you want, and figure out how to express than in
your ORM. You’ll probably end up using the lower-level methods offered by your
ORM a lot more, and you’ll probably discover defects in the way that your ORM
handles database errors. That’s okay! You are on a much happier path.

Thanks to Alan Shreve and Kyle Conroy for reading drafts
of this post.

Maybe Automatically Updating Dependencies Isn’t a Great Idea

There’s a distressing feeling in the Node.js community that apps without
up-to-date dependencies are somehow not as good, or stable, as apps
that always keep their dependencies up to date. So we see things like
greenkeeper.io and badges that show whether the
project’s dependencies are up to date (and, implicitly, shame
anyone whose dependencies aren’t green).

I’m not talking about updating dependencies for a good reason (covered below);
I’m talking about the practice of updating dependencies for the sake of keeping
the dependencies updated. In the best possible case, a dependency update does
nothing. The application keeps running exactly as it was. In the worst case,
your servers crash, millions of dollars of business value are affected, state
is compromised, or worse.

One day at Twilio we tried to deploy code that had a defect in it. The
deployment tool noticed the errors and tried to rollback the deployment by
putting the old nodes in the load balancer and taking the new ones out.
Except… when it went to take the new nodes out, the worker process crashed.
So we ended up with both the new (faulty) nodes and the old nodes in the load
balancer, and our most reliable tool for cluster management couldn’t pull the
bad nodes out.

We did some investigation and it turns out one of our dependencies had updated.
Well, it wasn’t a direct dependency – we locked down all of those – it was a
dependency of a dependency, which upgraded to version 3, and introduced an
incompatibility.

Fundamentally, updating dependencies is a dangerous operation. Most people
would never deploy changes to production without having them go through code
review, but I have observed that many feel comfortable bumping a package.json
number without looking at the diff of what changed in the dependency.

New releases of dependencies are usually less tested in the wild than older
versions. We know the best predictor of the number of errors in code is the
number of lines written. The current version of your dependency (which you know
works) has 0 lines of diff when compared with itself, but the newest release
has a greater-than-0 number of lines of code changed. Code changes are risky,
and so are dependency updates.

Errors and crashes introduced by dependency updates are difficult to debug.
Generally, the errors are not in your code; they’re deep in a node_modules
or site-packages folder or similar. You are probably more familiar with your
application code than the intricacies of your third party tools. Tracking down
the error usually involves figuring out what version of the code used to be
running (not easy!) and staring at a diff between the two.

But my tests will catch any dependency errors, you say. Maybe you have great
coverage around your application. But do the dependencies you’re pulling in
have good test coverage? Are the interactions between your dependencies tested?
How about the interactions between the dependency and every possible version of
a subdependency? Do your tests cover every external interface?

But the dependencies I’m pulling in use semver, so I’ll know if things
break.
This only saves you if you actually read the CHANGELOG, or the package
author correctly realizes a breaking change. Otherwise you get situations
like this
. Which just seems sad; the reporter must have taken time
to update the package, then gotten an error report, then had to figure out
what change crashed the servers, then mitigated the issue. A lot of downside
there – wasted time and the business fallout of a crashing application, and
I’m struggling to figure out what benefit the reporter got from updating to the
latest possible version.

When to Update Dependencies

Generally I think you should lock down the exact versions of every dependency
and sub-dependency that you use. However, there are a few cases where it makes
sense to pull in the latest and greatest thing. In every case, at the very
least I read the CHANGELOG and scan the package diff before pulling in the
update.

Security Upgrades

An application issues a new release to patch a security vulnerability, and you
need to get the latest version of the app to patch the same hole. Even here,
you want to ensure that the vulnerability actually affects your application,
and that the changed software does not break your application. You may not want
to grab the entire upstream changeset, but only port in the patch that fixes
the security issue.

Performance Improvement

Recently we noticed our web framework was sleeping for 50ms on every
POST and PUT request. Of course you would want to upgrade to avoid this
(although we actually fixed it by removing the dependency).

You Need a Hot New Feature

We updated mocha recently because it wouldn’t print out stack traces for things
that weren’t Error objects
. We submitted a patch and upgraded mocha to
get that feature.

You Need a Bug Fix

Your version of the dependency may have a defect, and upgrading will fix the
issue. Ensure that the fix was actually coded correctly.

A Final Warning

Updated dependencies introduce a lot of risk and instability into your project.
There are valid reasons to update and you’ll need to weigh the benefit against
the risk. But updating dependencies just for the sake of updating them is just
going to run you into trouble.

You can avoid all of these problems by not adding a dependency in the first
place
. Think really, really hard before reaching for a package to solve your
problem. Actually read the code and figure out if you need all of it or just
a subset. Maybe if you need to pluralize your application’s model names, it’s
okay to just add an ‘s’ on the end, instead of adding the pluralize library.
Sure, the Volcano model will be Volcanos instead of Volcanoes but maybe that’s
okay for your system.

Unfortunately my desired solution for adding dependencies – fork a library, rip
out the parts you aren’t using, and copy the rest directly into your source
tree – isn’t too popular. But I think it would help a lot with enforcing the
idea that you own your dependencies and the code changes inside.

Don’t Use Sails (or Waterline)

The Shyp API currently runs on top of the Sails JS framework. It’s
an extremely popular framework – the project has over 11,000 stars on Github,
and it’s one of the top 100 most popular projects on the site. However, we’ve
had a very poor experience with it, and with Waterline, the ORM
that runs underneath it. Remember when you learned that java.net.URL does a
DNS lookup to check whether a URL is equivalent to another URL
? Imagine
finding an issue like that every two weeks or so and that’s the feeling I get
using Sails and Waterline.

The project’s maintainers are very nice, and considerate – we have our
disagreements about the best way to build software, but they’re generally
responsive. It’s also clear from the error handling in the project (at least
the getting started docs) that they’ve thought a lot about the first run
experience, and helping people figure out the answers to the problems they
encounter trying to run Sails.

That said, here are some of the things we’ve struggled with:

  • The sailsjs.org website broke all incoming Google links ("sails
    views", "sails models", etc), as well as about 60,000 of its own autogenerated
    links, for almost a year. Rachael Shaw has been doing great work to
    fix them again, but it was pretty frustrating that documentation was so hard
    to find
    for that long.

  • POST and PUT requests that upload JSON or form-urlencoded data sleep for
    50ms in the request parser
    . This sleep occupied about 30% of the
    request time on our servers, and 50-70% of the time in controller tests.

  • The community of people who use Sails doesn’t seem to care much about
    performance or correctness. The above error was present for at least a year and
    not one person wondered why simple POST requests take 50ms longer than a simple
    GET. For a lot of the issues above and below it seems like we are the only
    people who have ran into them, or care.

  • By default Sails generates a route for every function you define in a
    controller
    , whether it’s meant to be public or not. This is a huge
    security risk, because you generally don’t think to write policies for these
    implicitly-created routes, so it’s really easy to bypass any authentication
    rules you have set up and hit a controller directly.

  • Blueprints are Sails’s solution for a CRUD app and we’ve observed a lot of
    unexpected behavior with them. For one example, passing an unknown column name
    as the key parameter in a GET request (?foo=bar) will cause the server to
    return a 500.

  • If you want to test the queries in a single model, there’s no way to do it
    besides loading/lifting the entire application, which is dog slow
    on our normal sized application, it takes at least 7 seconds to begin running a
    single test.

  • Usually when I raise an issue on a project, the response is that there’s some
    newer, better thing being worked on, that I should use instead. I appreciate
    that, but porting an application has lots of downside and little upside. I
    also worry about the support and correctness of the existing tools that are
    currently in wide use.

  • Hardcoded typos in command arguments.

  • No documented responsible disclosure policy, or information on how
    security vulnerabilities are handled.

Waterline

Waterline is the ORM that powers Sails. The goal of Waterline is to provide
the same query interface for any database that you would like to use.
Unfortunately, this means that the supported feature set is the least common
denominator of every supported database. We use Postgres, and by default this
means we can’t get a lot of leverage out of it.

These issues are going to be Postgres oriented, because that’s the database we
use. Some of these have since been fixed, but almost all of them (apart from
the data corruption issues) have bit us at one point or another.*

  • No support for transactions. We had to write our own
    transaction interface
    completely separate from the ORM.

  • No support for custom Postgres types (bigint, bytea, array). If you set
    a column to type 'array' in Waterline, it creates a text field in the
    database and serializes the array by calling JSON.stringify.

  • If you define a column to be type 'integer', Waterline will reject things
    that don’t look like integers… except for Mongo IDs, which look
    like "4cdfb11e1f3c000000007822". Waterline will pass these through to the
    database no matter what backend data store you are using.

  • Waterline offers a batch interface for creating items, e.g.
    Users.create([user1, user2]). Under the hood, however, creating N items
    issues N insert requests for one record each, instead of one large request. 29
    out of 30 times, the results will come back in order, but there used to be a
    race where sometimes create will return results in a different order
    than the order you inserted them. This caused a lot of intermittent,
    hard-to-parse failures in our tests until we figured out what was going on.

  • Waterline queries are case insensitive; that is, Users.find().where(name: 'FOO') will turn into SELECT * FROM users WHERE name = LOWER('FOO');.
    There’s no way to turn this off. If you ask Sails to generate an index for you,
    it will place the index on the uppercased column name, so your queries
    will miss it. If you generate the index yourself, you pretty much have to use
    the lowercased column value & force every other query against the database to
    use that as well.

  • The .count function used to work by pulling the entire table into memory
    and checking the length of the resulting array.

  • No way to split out queries to send writes to a primary and reads to a
    replica. No support for canceling in-flight queries or setting a timeout on
    them.

  • The test suite is shared by every backend adapter; this makes it impossible
    for the Waterline team to write tests for database-specific behavior or failure
    handling (unique indexes, custom types, check constraints, etc). Any behavior
    specific to your database is poorly tested at best.

  • "Waterline truncated a JOIN table". There are probably more
    issues in this vein, but we excised all .populate, .associate, and
    .destroy calls from our codebase soon after this, to reduce the risk of data
    loss.

  • When Postgres raises a uniqueness or constraint violation, the resulting
    error handling is very poor. Waterline used to throw an object instead of
    an Error instance, which means that Mocha would not print anything about
    the error unless you called console.log(new Error(err)); to turn it into
    an Error. (It’s since been fixed in Waterline, and I submitted a patch to
    Mocha
    to fix this behavior, but we stared at empty stack traces for
    at least six months before that). Waterline attempts to use regex matching to
    determine whether the error returned by Postgres is a uniqueness constraint
    violation, but the regex fails to match other types of constraint failures like
    NOT NULL errors or partial unique indexes.

  • The error messages returned by validation failures are only appropriate to
    display if the UI can handle newlines and bullet points. Parsing the error
    message to display any other scenario is very hard; we try really hard to dig
    the underlying pg error object out and use that instead. Mostly nowadays
    we’ve been creating new database access interfaces that wrap
    the Waterline model instances and handle errors appropriately.

Conclusion

I appreciate the hard work put in by the Sails/Waterline team and contributors,
and it seems like they’re really interested in fixing a lot of the issues
above. I think it’s just really hard to be an expert in sixteen different
database technologies
, and write a good library that works with all of
them, especially when you’re not using a given database day in and day out.

You can build an app that’s reliable and performant on top of Sails and
Waterline – we think ours is, at least. You just have to be really careful,
avoid the dangerous parts mentioned above, and verify at every step that the
ORM and the router are doing what you think they are doing.

The sad part is that in 2015, you have so many options for building a
reliable service, that let you write code securely and reliably and can scale
to handle large numbers of open connections with low latency. Using a framework
and an ORM doesn’t mean you need to enter a world of pain. You don’t need to
constantly battle your framework, or worry whether your ORM is going to delete
your data, or it’s generating the correct query based on your code. Better
options are out there!
Here are some of the more reliable options I know
about.

  • Instagram used Django well through its $1 billion dollar
    acquisition. Amazing community and documentation, and the project is incredibly
    stable.

  • You can use Dropwizard from either Java or Scala, and I
    know from experience that it can easily handle hundreds/thousands of open
    connections with incredibly low latency.

  • Hell, the Go standard library has a lot of reliable, multi-threaded,
    low latency tools for doing database reads/writes and server handling. The
    third party libraries are generally excellent.

I’m not amazingly familiar with backend Javascript – this is the only
server framework I’ve used – but if I had to use Javascript I would check out
whatever the Walmart and the Netflix people are using to write Node, since they
need to care a lot about performance and correctness.

*: If you are using a database without traditional support for
transactions and constraints, like Mongo, correct behavior is going to be very
difficult to verify. I wish you the best of luck.

Stepping Up Your Pull Request Game

Okay! You had an idea for how to improve the project, the maintainers indicated
they’d approve it, you checked out a new branch, made some changes, and you are
ready to submit it for review. Here are some tips for submitting a changeset
that’s more likely to pass through code review quickly, and make it easier for
people to understand your code.

A Very Brief Caveat

If you are new to programming, don’t worry about getting these details
right!
There are a lot of other useful things to learn first, like the
details of a programming language, how to test your code, how to retrieve
data you need, then parse it and transform it into a useful format. I started
programming by copy/pasting text into the WordPress "edit file" UI.

Write a Good Commit Message

A big, big part of your job as an engineer is communicating what you’re doing
to other people
. When you communicate, you can be more sure that you’re
building the right thing, you increase usage of things that you’ve built, and
you ensure that people don’t assign credit to someone else when you build
things. Part of this job is writing a clear message for the rest of the team
when you change something.

Fortunately you are probably already doing this! If you write a description of
your change in the pull request summary field, you’re already halfway there.
You just need to put that great message in the commit instead of in the pull
request.

The first thing you need to do is stop typing git commit -m "Updated the widgets" and start typing just git commit. Git will try to open a text
editor
; you’ll want to configure this to use the editor of your
choice.

A lot of words have been written about writing good commit messages; Tim Pope
wrote my favorite post
about it.

How big should a commit be? Bigger than you think; I rarely use more than one
commit in a pull request, though I try to limit pull requests to 400 lines
removed or added, and sometimes break a multiple-commit change into multiple
pull requests.

Logistics

If you use Vim to write commit messages, the editor will show you if the
summary goes beyond 50 characters.

If you write a great commit message, and the pull request is one commit, Github
will display it straight in the UI! Here’s an example commit in the terminal:

And here’s what that looks like when I open that commit as a pull request in
Github – note Github has autofilled the subject/body of the message.

Note if your summary is too long, Github will truncate it:

Review Your Pull Request Before Submitting It

Before you hit "Submit", be sure to look over your diff so you don’t submit
an obvious error. In this pass you should be looking for things like typos,
syntax errors, and debugging print statements which are left in the diff. One
last read through can be really useful.

Everyone struggles to get code reviews done, and it can be frustrating for
reviewers to find things like print statements in a diff. They might be more
hesitant to review your code in the future if it has obvious errors in it.

Make Sure The Tests Pass

If the project has tests, try to verify they pass before you submit your
change. It’s annoying that Github doesn’t make the test pass/failure state more
obvious before you submit.

Hopefully the project has a test convention – a make test command in a
Makefile, instructions in a CONTRIBUTING.md, or automatic builds via Travis
CI. If you can’t get the tests set up, or it seems like there’s an unrelated
failure, add a separate issue explaining why you couldn’t get the tests
running.

If the project has clear guidelines on how to run the tests, and they fail on
your change, it can be a sign you weren’t paying close enough attention.

The Code Review Process

I don’t have great advice here, besides a) patience is a virtue, and b)
the faster you are to respond to feedback, the easier it will go for
reviewers. Really you should just read Glen Sanford’s excellent post on code
review
.

Ready to Merge

Okay! Someone gave you a LGTM and it’s time to merge. As a part of the code
review process, you may have ended up with a bunch of commits that look like
this:

These commits are detritus, the prototypes of the creative process, and they
shouldn’t be part of the permanent record. Why fix them? Because six months
from now, when you’re staring at a piece of confusing code and trying to figure
out why it’s written the way it is, you really want to see a commit message
explaining the change that looks like this, and not one that
says "fix tests".

There are two ways to get rid of these:

Git Amend

If you just did a commit and have a new change that should be part of the same
commit, use git amend to add changes to the current commit. If you don’t need
to change the message, use git amend --no-edit (I map this to git alter in
my git config).

Git Rebase

You want to squash all of those typo fix commits into one. Steve Klabnik has a
good guide
for how to do this. I use this script, saved as rb:

    local branch="$1"
    if [[ -z "$branch" ]]; then
        branch='master'
    fi
    BRANCHREF="$(git symbolic-ref HEAD 2>/dev/null)"
    BRANCHNAME=${BRANCHREF##refs/heads/}
    if [[ "$BRANCHNAME" == "$branch" ]]; then
        echo "Switch to a branch first"
        exit 1
    fi
    git checkout "$branch"
    git pull origin "$branch"
    git checkout "$BRANCHNAME"
    if [[ -n "$2" ]]; then
        git rebase "$branch" "$2"
    else
        git rebase "$branch"
    fi
    git push origin --force "$BRANCHNAME"

If you run that with rb master, it will pull down the latest master from
origin, rebase your branch against it, and force push to your branch on origin.
Run rb master -i and select "squash" to squash your commits down to one.

As a side effect of rebasing, you’ll resolve any merge conflicts that have
developed between the time you opened the pull request and the merge time! This
can take the headache away from the person doing the merge, and help prevent
mistakes.