Posts Tagged With: Design

Things to Use Instead of JWT

You might have heard that you shouldn't be using JWT. That advice is correct - you really shouldn't use it. In general, specifications that allow the attacker to choose the algorithm for negotiation have more problems than ones that don't (see TLS). N libraries need to implement M different encryption and decryption algorithms, and an attacker only needs to find a vulnerability in one of them, or a vulnerability in their combination. JWT has seen both of these errors; unlike TLS, it hasn't already been deployed onto billions of devices around the world.

This is a controversial opinion, but implementation errors should lower your opinion of a specification. An error in one implementation means other implementations are more likely to contain the same or different errors. It implies that it's more difficult to correctly implement the spec. JWT implementations have been extremely buggy.

But The Bad Implementations Were Written by Bad Authors

In the 1800's rail cars were coupled by an oval link on one end and a socket on the other. A railway worker would drop a pin down through the socket, keeping the link in place.

The train engineer could not see the coupler at the time of coupling, and the operation was fraught. Many couplers had their hands mangled. Worse, there was no buffer between the cars, and it was easy to get crushed if the coupling missed. Tens of thousands of people died.

Link-and-pin railway coupler

Still, the railroads stuck with them because link-and-pin couplers were cheap. You could imagine excuses being made about the people who died, or lost their fingers or hands; they were inattentive, they weren't following the right procedure, bad luck happens and we can't do anything about it, etc.

In 1893 Congress outlawed the link-and-pin coupler and deaths fell by one third within a year, and that's despite the high cost of switching to automatic couplers.

Alternatives

Update, January 2018: PAST is an excellent library implementing many of the suggestions here. Use PAST. Follow its development for use in your language.

What should you be using instead of JWT? That depends on your use case.

I want users to authenticate with a username and secret token

Have them make a request to your server over TLS; you don't need any additional encryption. TLS provides an encryption layer, you don't need any additional encryption or hashing besides TLS.

I want to post a public key and have users send me encrypted messages with it

The technical name for this is asymmetric encryption; only the user with the private key can decrypt the message. This is pretty magical; the magic is that people don't need the private key to send you messages that you can read. It was illegal to ship this technology outside of the US for most of the 90's.

JWT supports public key encryption with RSA, but you don't want to use it for two reasons. One, RSA is notoriously tricky to implement, especially when compared with elliptic curve cryptography. Thomas Ptáček explains:

The weight of correctness/safety in elliptic curve systems falls primarily on cryptographers, who must provide a set of curve parameters optimized for security at a particular performance level; once that happens, there aren't many knobs for implementors to turn that can subvert security. The opposite is true in RSA. Even if you use RSA-OAEP, there are additional parameters to supply and things you have to know to get right.

You don't want the random person implementing your JWT library to be tuning RSA. Two, the algorithm used by JWT doesn't support forward secrecy. With JWT, someone can slurp all of your encrypted messages, and if they get your key later, they can decrypt all of your messages after the fact. With forward secrecy, even if your keys are exposed later, an attacker can't read previous messages.

A better elliptic curve library is Nacl's box, which only uses one encryption primitive, and doesn't require any configuration. Or you can have users send you messages with TLS, which also uses public key encryption.

I want to encrypt some data so third parties can't read it, and then be able to decrypt it later

You might use this for browser cookies (if you don't want the user to be able to read or modify the payload), or for API keys / other secrets that need to be stored at rest.

You don't want to use JWT for this because the payload (the middle part) is unencrypted. You can encrypt the entire JWT object, but if you are using a different, better algorithm to encrypt the JWT token, or the data in it, there's not much point in using JWT.

The best algorithm to use for two-way encryption is Nacl's secretbox. Secretbox is not vulnerable to downgrade or protocol switching attacks and the Go secretbox implementation was written by a world-renowned cryptographer who also writes and verifies cryptographic code for Google Chrome.

I want to send some data and have users send it back to me and verify that it hasn't been tampered with

This is the JWT use case. The third part of a JWT is the signature, which is supposed to verify that the header and the payload have not been tampered with since you signed them.

The problem with JWT is the user gets to choose which algorithm to use. In the past, implementations have allowed users to pass "none" as the verification algorithm. Other implementations have allowed access by mixing up RSA and HMAC protocols. In general, implementations are also more complicated than they need to be because of the need to support multiple different algorithms. For example in jwt-go, it's not enough to check err == nil to verify a good token, you also have to check the Valid parameter on a token object. I have seen someone omit the latter check in production.

The one benefit of JWT is a shared standard for specifying a header and a payload. But the server and the client should support only a single algorithm, probably HMAC with SHA-256, and reject all of the others.

If you are rejecting all of the other algorithms, though, you shouldn't leave the code for them lying around in your library. Omitting all of the other algorithms makes it impossible to commit an algorithm confusion error. It also means you can't screw up the implementations of those algorithms.

For fun, I forked jwt-go and ripped out all of the code not related to the HMAC-SHA256 algorithm. That library is currently 2600 lines of Go, and supports four distinct verification algorithms. My fork is only 720 lines and has much simpler API's.

// old
func Parse(tokenString string, keyFunc func(*Token) (interface{}, error)) (*Token, error)
func (m *SigningMethodHMAC) Sign(signingString string, key interface{}) (string, error)
func (m *SigningMethodHMAC) Verify(signingString, signature string, key interface{}) error

// new
func Parse(tokenString string, key *[32]byte) (*Token, error)
func Sign(signingString string, key *[32]byte) string
func Verify(signingString, signature string, key *[32]byte) error

These changes increased the type safety and reduced the number of branches in the code. Reducing the number of branches helps reduce the chance of introducing a defect that compromises security.

It's important to note that my experiment is not JWT. When you reduce JWT to a thing that is secure, you give up the "algorithm agility" that is a proud part of the specification.

We should have more "JWT"-adjacent libraries that only attempt to implement a single algorithm, with a 256-bit random key only, for their own sake and for their users. Or we should give up on the idea of JWT.

Liked what you read? I am available for hire.

An API Client that’s Faster than the API

For the past few weeks I've been working on Logrole, a Twilio log viewer. If you have to browse through your Twilio logs, I think this is the way that you should do it. We were able to do some things around performance and resource conservation that have been difficult to accomplish with today's popular web server technologies.

Picture of Logrole

Fast List Responses

Twilio's API frequently takes over 1 second to return a page of calls or messages via the API. But Logrole usually returns results in under 100ms. How? Every 30 seconds we fetch the most recent page of Calls/Messages/Conferences and store them in a cache. When we download a page of resources, we get the URL to the next page - Twilio's next_page_uri — immediately, but a user might not browse to the next page for another few seconds. We don't have to wait for you to hit Next to get the results - on the server side, we fetch/cache the next page immediately, so it's ready when you finally hit the Next button, and it feels really snappy.

The cache is a simple LRU cache. We run Go structs through encoding/gob and then gzip before storing them, which takes the size of a cache value from 42KB to about 3KB. At this size, about 8,300 values can fit in 25MB of memory.

var buf bytes.Buffer
writer := gzip.NewWriter(&buf)
enc := gob.NewEncoder(writer)
if err := enc.Encode(data); err != nil {
	panic(err)
}
if err := writer.Close(); err != nil {
	panic(err)
}
c.mu.Lock()
defer c.mu.Unlock()
c.c.Add(key, buf.Bytes())
c.Debug("stored data in cache", "key", key, "size", buf.Len(),
    "cache_size", c.c.Len())

Right now one machine is more than enough to serve the website, but if we ever needed multiple machines, we could use a tool like groupcache to share/lookup cached values across multiple different machines.

Combining Shared Queries

The logic in the previous paragraphs leads to a race. If the user requests the Next page before we've finished retrieving/saving the value in the cache, we'll end up making two requests for the exact same data. This isn't too bad in our case, but doubles the load on Twilio, and means the second request could have gotten the results sooner by reusing the response from first request.

The singleflight package is useful for ensuring only one request ever gets made at a time. With singleflight, if a request is already in progress with a given key, a caller will get the return value from the first request. We use the next page URI as the key.

var g singleflight.Group
g.Do(page.NextPageURI, func() (interface{}, error) {
    // 1) return value from cache, if we've stored it
    // 2) else, retrieve the resource from the API
    // 3) store response in the cache
    // 4) return response
})

This technique is also useful for avoiding thundering herd problems.

Canceling Unnecessary Queries

You've configured a 30 second request timeout in nginx, and a query is taking too long, exceeding that timeout. nginx returns a 504 Gateway Timeout and moves on, but your server is still happily processing the request, even though no one is listening. It's a hard problem to solve because it's much easier for a thread to just give up than to give up and tell everyone downstream of you that they can give up too. A lot of our tools and libraries don't have the ability to do that kind of out of band signaling to a downstream process.

Go's context.Context is designed as an antidote for this. We set a timeout in a request handler very early on in the request lifecycle:

ctx, cancel := context.WithTimeout(req.Context(), timeout)
defer cancel()
req = req.WithContext(ctx)
h.ServeHTTP(w, req)

We pass this context to any method call that does I/O - a database query, a API client request (in our case), or a exec.Command. If the timeout is exceeded, we'll get a message on a channel at ctx.Done(), and can immediately stop work, no matter where we are. Stopping work if a context is canceled is a built in property of http.Request and os/exec in Go 1.7, and will be in database/sql starting with Go 1.8.

This is so nice - as a comparison, one of the most popular npm libraries for "stop after a timeout" is the connect-timeout library, which let you execute a callback after a particular amount of time, but does nothing to cancel any in-progress work. No popular ORM's for Node support canceling database queries.

It can be really tricky to enforce an absolute deadline on a HTTP request. In most languages you compute a timeout as a duration, but this timeout might reset to 0 every time a byte is received on the socket, making it difficult to enforce that the request doesn't exceed some wall-clock amount of time. Normally you have to do this by starting a 2nd thread that sleeps for a wall-clock amount of time, then checks whether the HTTP request is still in progress and kills the HTTP request thread if so. This 2nd thread also has to cleanup and close any open file descriptors.

Starting threads / finding and closing FD's may not be easy in your language but Contexts make it super easy to set a deadline for sending/receiving data and communicating that to a lot of different callers. Then the http request code can clean up the same way it would for any canceled request.

Metrics

I've been obsessed with performance for a long time and one of the first things I like to do in a new codebase is start printing out timestamps. How long did tests take? How long did it take to start the HTTP server? It's impossible to optimize something if you're not measuring it and it's hard to make people aware of a problem if you're not printing durations for common tasks.

Logrole prints three numbers in the footer of every response: the server response time, the template render time, and the Twilio API request time. You can use these numbers to get a picture of where the server was spending its time, and whether template rendering is taking too long. I use custom template functions to implement this - we store the request's start time in its Context, and then print time elapsed on screen. Obviously this is not a perfect measure since we can't see the time after the template footer is rendered - mainly the ResponseWriter.Write call. But it's close enough.

Page Footer

HTML5

Logrole loads one CSS file and one font. I would have had to use a lot more Javascript a few years ago, but HTML5 has some really nice features that eliminate the need for Javascript. For example, there's a built in date picker for HTML5, that people can use to filter calls/messages by date (It's best supported in Chrome at the moment). Similarly you don't need Javascript to play recordings anymore. HTML5 has an <audio> element that will provide controls for you.

I've needed Javascript in only three places so far:

  • a "click to show images" toggle button where the CSS necessary to implement it would have been too convoluted
  • a "click-to-copy" button
  • To submit a user's timezone change when they change it in the menu bar (instead of having a separate "Submit" button).

About 50 lines in total, implemented in the HTML below the elements where it's needed.

Conclusion

Combining these techniques, we get a server that uses little memory, doesn't waste any time doing unnecessary work, and responds and renders a slow data set extraordinarily quickly. Before starting a new project, evaluate the feature set of the language/frameworks you plan to use - whether the ORM/API clients you are planning to use support fast cancelation, whether you can set wall-clock timeouts and propagate them easily through your stack, and whether your language makes it easy to combine duplicate requests.

If you are a Twilio customer I hope you will give Logrole a try - I think you will like it a lot.

Thanks to Bradley Falzon, Kyle Conroy and Alan Shreve for reading drafts of this post. Thanks to Brad Fitzpatrick for designing and implementing most of the code mentioned here.

Liked what you read? I am available for hire.

The Frustration and Loneliness of Server-Side Javascript Development

I was hired in December 2014 as the sixth engineer at Shyp. Shyp runs Node.js on the server. It's been a pretty frustrating journey, and I wanted to share some of the experiences we've had. There are no hot takes about the learning curve, or "why are there so many frameworks" in this post.

  • Initially we were running on Node 0.10.30. Timers would consistently fire 50 milliseconds early - that is, if you called setTimeout with a duration of 200 milliseconds, the timer would fire after 150 milliseconds. I asked about this in the #Node.js Freenode channel, and was told it was my fault for using an "ancient" version of Node (it was 18 months old at that time), and that I should get on a "long term stable" version. This made timeout-based tests difficult to write - every test had to set timeouts longer than 50ms.

  • I wrote a metrics library that published to Librato. I expect a background metric publishing daemon to silently swallow/log errors. One day Librato had an outage and returned a 502 to all requests; unbeknownst to me the underlying librato library we were using was also an EventEmitter, that crashed the process if unhandled. This caused about 30 minutes of downtime in our API.

  • The advice on the Node.js website is to crash the process if there's an uncaught exception. Our application is about 60 models, 30 controllers; it doesn't seem particularly large. It consistently takes 40 seconds to boot our server in production; the majority of this time is spent requiring files. Obviously we try not to crash, and fix crashes when we see them, but we can't take 40 seconds of downtime because someone sent an input we weren't expecting. I asked about this on the Node.js mailing list but got mostly peanuts. Recently the Go core team sped up build times by 2x in a month.

  • We discovered our framework was sleeping for 50ms on POST and PUT requests for no reason. I've previously written about the problems with Sails, so I am going to omit most of that here.

  • We upgraded to Node 4 (after spending a day dealing with a nasty TLS problem) and observed our servers were consuming 20-30% more RAM, with no accompanying speed, CPU utilization, or throughput increase. We were left wondering what benefit we got from upgrading.

  • It took about two months to figure out how to generate a npm shrinkwrap file that produced reliable changes when we tried to update it. Often attempting to modify/update it would change the "from" field in every single dependency.

  • The sinon library appears to be one of the most popular ways to stub the system time. The default method for stubbing (useFakeTimers) leads many other libraries to hang inexplicably. I noticed this after spending 90 minutes wondering why stubbing the system time caused CSV writing to fail. The only ways to debug this are 1) to add console.log statements at deeper and deeper levels, since you can't ever hit ctrl+C and print a stack trace, or 2) take a core dump.

  • The library we used to send messages to Slack crashed our server if/when Slack returned HTML instead of JSON.

  • Our Redis provider changed something - we don't know what, since the version number stayed the same - in their Redis instance, which causes the Redis connection library we use to crash the process with a "Unhandled event" message. We have no idea why this happens, but it's tied to the number of connections we have open - we've had to keep a close eye on the number of concurrent Redis connections, and eventually just phased out Redis altogether.

  • Our database framework doesn't support transactions, so we had to write our own transaction library.

  • The underlying database driver doesn't have a way to time out/cancel connection acquisition, so threads will wait forever for a connection to become available. I wrote a patch for this; it hasn't gotten any closer to merging in 10 weeks, so I published a new NPM package with the patch, and submitted that upstream.

  • I wrote a job queue in Go. I couldn't benchmark it using our existing Node server as the downstream server, the Node server was too slow - a basic Express app would take 50ms on average to process incoming requests and respond with a 202, with 100 concurrent in-flight requests. I had to write a separate downstream server for benchmarking.

  • Last week I noticed our integration servers were frequently running out of memory. Normally they use about 700MB of memory, but we saw they would accumulate 500MB of swap in about 20 seconds. I think the app served about thirty requests total in that time frame. There's nothing unusual about the requests, amount of traffic being sent over HTTP or to the database during this time frame; the amount of data was in kilobytes. We don't have tools like pprof. We can't take a heap dump, because the box is out of memory by that point.

Summary

You could argue - and certainly you could say this about Sails - that we chose bad libraries, and we should have picked better. But part of the problem with the Node ecosystem is it's hard to distinguish good libraries from bad.

I've also listed problems - pound your head, hours or days of frustrating debugging - with at least six different libraries above. At what point do you move beyond "this one library is bad" and start worrying about the bigger picture? Do the libraries have problems because the authors don't know better, or because the language and standard library make it difficult to write good libraries? I don't think it helps.

You could argue that we are a fast growing company and we'd have these problems with any language or framework we chose. Maybe? I really doubt it. The job queue is 6000 lines of Go code written from scratch. There was one really bad problem in about two months of development, and a) the race detector caught it before production, b) I got pointed to the right answer in IRC pretty quickly.

You could also argue we are bad engineers and better engineers wouldn't have these problems. I can't rule this out, but I certainly don't think this is the case.

The Javascript community gets a lot of credit for its work on codes of conduct, on creating awesome conferences, for creating a welcoming, thoughtful community. But the most overwhelming feeling I've gotten over the past year has been loneliness. I constantly feel like I am the only person running into these problems with libraries, these problems in production, or I'm the only person who cares about the consequences of e.g. waiting forever for a connection to become available. Frequently posting on Github or in IRC and getting no response, or getting told "that's the way it is". In the absence of feedback, I am left wondering how everyone else is getting along, or how their users are.

I've learned a lot about debugging Node applications, making builds faster, finding and fixing intermittent test failures and more. I am available to speak or consult at your company or event, about what I have learned. I am also interested in figuring out your solutions to these problems - I will buy you a coffee or a beer and chat about this at any time.

Liked what you read? I am available for hire.

Just Return One Error

Recently there's been a trend in API's to return more than one error, or to always return an array of errors, to the client, when there's a 400 or a 500 server error. From the JSON API specification:

Error objects MUST be returned as an array keyed by errors in the top level of a JSON API document.

Here's an example.

HTTP/1.1 422 Unprocessable Entity
Content-Type: application/vnd.api+json

{
  "errors": [
    {
      "status": "422",
      "source": { "pointer": "/data/attributes/first-name" },
      "title":  "Invalid Attribute",
      "detail": "First name must contain at least three characters."
    }
  ]
}

I've also seen it in model validation; for example, Waterline returns an array of errors in the invalidAttributes field of the returned error object.

I think this is overkill, and adds significant complexity to your application for little benefit. Let's look at why.

  • Clients are rarely prepared to show multiple errors. At Shyp, we returned an errors key from the API on a 400 or 500 error for about 18 months. All four of the API's clients - the iOS app, the Android app, our error logger, and our internal dashboard - simply processed this by parsing and displaying errors[0], and ignoring the rest. It makes sense from the client's perspective - the errors were being shown on a phone (not much room) or an alert box in the browser (show two, and you get the "Prevent this page from displaying more alerts" dialog). But if clients can't make use of multiple errors, it makes me wonder why we return them from the API.

  • You rarely catch every problem with a request. Say a pickup is 1) missing the to address, 2) missing the city, 3) shipping to a country we don't ship to, and 4) contains a photo of a firearm, which we can't ship. If we return an error, it's going to be ['Missing City', 'Missing Address Line 1']. There's no way we're going to submit your invalid address to our rates service and discover even more problems with the customs form, or try to parse the photo and learn that you're trying to ship a firearm. So even if you fix the first errors, we're still going to return you a 400 from the API.

  • Continuing to process the request is dangerous. Attaching multiple errors to an API response implies that you discovered a problem with an incoming request and continued processing it. This is dangerous! Every part of your system needs to be prepared to handle invalid inputs. This might be a useful defensive measure, but it's also a lot of work for every function in your system to avoid making assumptions about the input that's passed to it. In my experience, passing around bad inputs leads to unnecessary server errors and crashes.

  • Programming languages and libraries return one error. Generally if you call raise or throw, you're limited to throwing a single Error object. If you're using a third party library that can error, usually library functions are limited to returning one error. Returning a single error from your API or your model validation function aligns with 99% of error handling paradigms that currently exist.

A Caveat

There's one place where returning multiple errors may make sense - a signup form with multiple form fields, where it can be frustrating to only return one problem at a time. There are two ways to deal with this:

  • Perform validation on the client, and detect/display multiple errors there. It's a better user experience to do this, since feedback is more or less instantaneous. You'll need the validation in your API anyway, but if someone manages to bypass the client checks, it's probably fine to just return one error at a time.

  • Special case your returned error with a key like attributes, and return the necessary error messages, just for this case. I'd rather special case this error message, than force every other error to be an array just to handle this one place in the API.

Liked what you read? I am available for hire.

Don’t Use Sails (or Waterline)

The Shyp API currently runs on top of the Sails JS framework. It's an extremely popular framework - the project has over 11,000 stars on Github, and it's one of the top 100 most popular projects on the site. However, we've had a very poor experience with it, and with Waterline, the ORM that runs underneath it. Remember when you learned that java.net.URL does a DNS lookup to check whether a URL is equivalent to another URL? Imagine finding an issue like that every two weeks or so and that's the feeling I get using Sails and Waterline.

The project's maintainers are very nice, and considerate - we have our disagreements about the best way to build software, but they're generally responsive. It's also clear from the error handling in the project (at least the getting started docs) that they've thought a lot about the first run experience, and helping people figure out the answers to the problems they encounter trying to run Sails.

That said, here are some of the things we've struggled with:

  • The sailsjs.org website broke all incoming Google links ("sails views", "sails models", etc), as well as about 60,000 of its own autogenerated links, for almost a year. Rachael Shaw has been doing great work to fix them again, but it was pretty frustrating that documentation was so hard to find for that long.

  • POST and PUT requests that upload JSON or form-urlencoded data sleep for 50ms in the request parser. This sleep occupied about 30% of the request time on our servers, and 50-70% of the time in controller tests.

  • The community of people who use Sails doesn't seem to care much about performance or correctness. The above error was present for at least a year and not one person wondered why simple POST requests take 50ms longer than a simple GET. For a lot of the issues above and below it seems like we are the only people who have ran into them, or care.

  • By default Sails generates a route for every function you define in a controller, whether it's meant to be public or not. This is a huge security risk, because you generally don't think to write policies for these implicitly-created routes, so it's really easy to bypass any authentication rules you have set up and hit a controller directly.

  • Blueprints are Sails's solution for a CRUD app and we've observed a lot of unexpected behavior with them. For one example, passing an unknown column name as the key parameter in a GET request (?foo=bar) will cause the server to return a 500.

  • If you want to test the queries in a single model, there's no way to do it besides loading/lifting the entire application, which is dog slow - on our normal sized application, it takes at least 7 seconds to begin running a single test.

  • Usually when I raise an issue on a project, the response is that there's some newer, better thing being worked on, that I should use instead. I appreciate that, but porting an application has lots of downside and little upside. I also worry about the support and correctness of the existing tools that are currently in wide use.

  • Hardcoded typos in command arguments.

  • No documented responsible disclosure policy, or information on how security vulnerabilities are handled.

Waterline

Waterline is the ORM that powers Sails. The goal of Waterline is to provide the same query interface for any database that you would like to use. Unfortunately, this means that the supported feature set is the least common denominator of every supported database. We use Postgres, and by default this means we can't get a lot of leverage out of it.

These issues are going to be Postgres oriented, because that's the database we use. Some of these have since been fixed, but almost all of them (apart from the data corruption issues) have bit us at one point or another.*

  • No support for transactions. We had to write our own transaction interface completely separate from the ORM.

  • No support for custom Postgres types (bigint, bytea, array). If you set a column to type 'array' in Waterline, it creates a text field in the database and serializes the array by calling JSON.stringify.

  • If you define a column to be type 'integer', Waterline will reject things that don't look like integers... except for Mongo IDs, which look like "4cdfb11e1f3c000000007822". Waterline will pass these through to the database no matter what backend data store you are using.

  • Waterline offers a batch interface for creating items, e.g. Users.create([user1, user2]). Under the hood, however, creating N items issues N insert requests for one record each, instead of one large request. 29 out of 30 times, the results will come back in order, but there used to be a race where sometimes create will return results in a different order than the order you inserted them. This caused a lot of intermittent, hard-to-parse failures in our tests until we figured out what was going on.

  • Waterline queries are case insensitive; that is, Users.find().where(name: 'FOO') will turn into SELECT * FROM users WHERE name = LOWER('FOO');. There's no way to turn this off. If you ask Sails to generate an index for you, it will place the index on the uppercased column name, so your queries will miss it. If you generate the index yourself, you pretty much have to use the lowercased column value & force every other query against the database to use that as well.

  • The .count function used to work by pulling the entire table into memory and checking the length of the resulting array.

  • No way to split out queries to send writes to a primary and reads to a replica. No support for canceling in-flight queries or setting a timeout on them.

  • The test suite is shared by every backend adapter; this makes it impossible for the Waterline team to write tests for database-specific behavior or failure handling (unique indexes, custom types, check constraints, etc). Any behavior specific to your database is poorly tested at best.

  • "Waterline truncated a JOIN table". There are probably more issues in this vein, but we excised all .populate, .associate, and .destroy calls from our codebase soon after this, to reduce the risk of data loss.

  • When Postgres raises a uniqueness or constraint violation, the resulting error handling is very poor. Waterline used to throw an object instead of an Error instance, which means that Mocha would not print anything about the error unless you called console.log(new Error(err)); to turn it into an Error. (It's since been fixed in Waterline, and I submitted a patch to Mocha to fix this behavior, but we stared at empty stack traces for at least six months before that). Waterline attempts to use regex matching to determine whether the error returned by Postgres is a uniqueness constraint violation, but the regex fails to match other types of constraint failures like NOT NULL errors or partial unique indexes.

  • The error messages returned by validation failures are only appropriate to display if the UI can handle newlines and bullet points. Parsing the error message to display any other scenario is very hard; we try really hard to dig the underlying pg error object out and use that instead. Mostly nowadays we've been creating new database access interfaces that wrap the Waterline model instances and handle errors appropriately.

Conclusion

I appreciate the hard work put in by the Sails/Waterline team and contributors, and it seems like they're really interested in fixing a lot of the issues above. I think it's just really hard to be an expert in sixteen different database technologies, and write a good library that works with all of them, especially when you're not using a given database day in and day out.

You can build an app that's reliable and performant on top of Sails and Waterline - we think ours is, at least. You just have to be really careful, avoid the dangerous parts mentioned above, and verify at every step that the ORM and the router are doing what you think they are doing.

The sad part is that in 2015, you have so many options for building a reliable service, that let you write code securely and reliably and can scale to handle large numbers of open connections with low latency. Using a framework and an ORM doesn't mean you need to enter a world of pain. You don't need to constantly battle your framework, or worry whether your ORM is going to delete your data, or it's generating the correct query based on your code. Better options are out there! Here are some of the more reliable options I know about.

  • Instagram used Django well through its $1 billion dollar acquisition. Amazing community and documentation, and the project is incredibly stable.

  • You can use Dropwizard from either Java or Scala, and I know from experience that it can easily handle hundreds/thousands of open connections with incredibly low latency.

  • Hell, the Go standard library has a lot of reliable, multi-threaded, low latency tools for doing database reads/writes and server handling. The third party libraries are generally excellent.

I'm not amazingly familiar with backend Javascript - this is the only server framework I've used - but if I had to use Javascript I would check out whatever the Walmart and the Netflix people are using to write Node, since they need to care a lot about performance and correctness.

*: If you are using a database without traditional support for transactions and constraints, like Mongo, correct behavior is going to be very difficult to verify. I wish you the best of luck.

Liked what you read? I am available for hire.

Stepping Up Your Pull Request Game

Okay! You had an idea for how to improve the project, the maintainers indicated they'd approve it, you checked out a new branch, made some changes, and you are ready to submit it for review. Here are some tips for submitting a changeset that's more likely to pass through code review quickly, and make it easier for people to understand your code.

A Very Brief Caveat

If you are new to programming, don't worry about getting these details right! There are a lot of other useful things to learn first, like the details of a programming language, how to test your code, how to retrieve data you need, then parse it and transform it into a useful format. I started programming by copy/pasting text into the WordPress "edit file" UI.

Write a Good Commit Message

A big, big part of your job as an engineer is communicating what you're doing to other people. When you communicate, you can be more sure that you're building the right thing, you increase usage of things that you've built, and you ensure that people don't assign credit to someone else when you build things. Part of this job is writing a clear message for the rest of the team when you change something.

Fortunately you are probably already doing this! If you write a description of your change in the pull request summary field, you're already halfway there. You just need to put that great message in the commit instead of in the pull request.

The first thing you need to do is stop typing git commit -m "Updated the widgets" and start typing just git commit. Git will try to open a text editor; you'll want to configure this to use the editor of your choice.

A lot of words have been written about writing good commit messages; Tim Pope wrote my favorite post about it.

How big should a commit be? Bigger than you think; I rarely use more than one commit in a pull request, though I try to limit pull requests to 400 lines removed or added, and sometimes break a multiple-commit change into multiple pull requests.

Logistics

If you use Vim to write commit messages, the editor will show you if the summary goes beyond 50 characters.

If you write a great commit message, and the pull request is one commit, Github will display it straight in the UI! Here's an example commit in the terminal:

And here's what that looks like when I open that commit as a pull request in Github - note Github has autofilled the subject/body of the message.

Note if your summary is too long, Github will truncate it:

Review Your Pull Request Before Submitting It

Before you hit "Submit", be sure to look over your diff so you don't submit an obvious error. In this pass you should be looking for things like typos, syntax errors, and debugging print statements which are left in the diff. One last read through can be really useful.

Everyone struggles to get code reviews done, and it can be frustrating for reviewers to find things like print statements in a diff. They might be more hesitant to review your code in the future if it has obvious errors in it.

Make Sure The Tests Pass

If the project has tests, try to verify they pass before you submit your change. It's annoying that Github doesn't make the test pass/failure state more obvious before you submit.

Hopefully the project has a test convention - a make test command in a Makefile, instructions in a CONTRIBUTING.md, or automatic builds via Travis CI. If you can't get the tests set up, or it seems like there's an unrelated failure, add a separate issue explaining why you couldn't get the tests running.

If the project has clear guidelines on how to run the tests, and they fail on your change, it can be a sign you weren't paying close enough attention.

The Code Review Process

I don't have great advice here, besides a) patience is a virtue, and b) the faster you are to respond to feedback, the easier it will go for reviewers. Really you should just read Glen Sanford's excellent post on code review.

Ready to Merge

Okay! Someone gave you a LGTM and it's time to merge. As a part of the code review process, you may have ended up with a bunch of commits that look like this:

These commits are detritus, the prototypes of the creative process, and they shouldn't be part of the permanent record. Why fix them? Because six months from now, when you're staring at a piece of confusing code and trying to figure out why it's written the way it is, you really want to see a commit message explaining the change that looks like this, and not one that says "fix tests".

There are two ways to get rid of these:

Git Amend

If you just did a commit and have a new change that should be part of the same commit, use git amend to add changes to the current commit. If you don't need to change the message, use git amend --no-edit (I map this to git alter in my git config).

Git Rebase

You want to squash all of those typo fix commits into one. Steve Klabnik has a good guide for how to do this. I use this script, saved as rb:

    local branch="$1"
    if [[ -z "$branch" ]]; then
        branch='master'
    fi
    BRANCHREF="$(git symbolic-ref HEAD 2>/dev/null)"
    BRANCHNAME=${BRANCHREF##refs/heads/}
    if [[ "$BRANCHNAME" == "$branch" ]]; then
        echo "Switch to a branch first"
        exit 1
    fi
    git checkout "$branch"
    git pull origin "$branch"
    git checkout "$BRANCHNAME"
    if [[ -n "$2" ]]; then
        git rebase "$branch" "$2"
    else
        git rebase "$branch"
    fi
    git push origin --force "$BRANCHNAME"

If you run that with rb master, it will pull down the latest master from origin, rebase your branch against it, and force push to your branch on origin. Run rb master -i and select "squash" to squash your commits down to one.

As a side effect of rebasing, you'll resolve any merge conflicts that have developed between the time you opened the pull request and the merge time! This can take the headache away from the person doing the merge, and help prevent mistakes.

Liked what you read? I am available for hire.

A look at the new retry behavior in urllib3

"Build software like a tank." I am not sure where I read this, but I think about it a lot, especially when writing HTTP clients. Tanks are incredible machines - they are designed to move rapidly and protect their inhabitants in any kind of terrain, against enemy gunfire, or worse.

HTTP clients often run in unfriendly territory, because they usually involve a network connection between machines. Connections can fail, packets can be dropped, the other party may respond very slowly, or with a new unknown error message, or they might even change the API from under you. All of this means writing an HTTP client like a tank is difficult. Here are some examples of things that a desirable HTTP client would do for you, that are never there by default.

  • If a request fails to reach the remote server, we would like to retry it no matter what. We don't want to wait around for the server forever though, so we want to set a timeout on the connection attempt.

  • If we send the request but the remote server doesn't respond in a timely manner, we want to retry it, but only on requests where it is safe to send the request again - so called idempotent requests.

  • If the server returns an unexpected response, we want to always retry if the server didn't do any processing - a 429, 502 or a 503 response usually indicate this - as well as all idempotent requests.

  • Generally we want to sleep between retries to allow the remote connection/server to recover, so to speak. To help prevent thundering herd problems, we usually sleep with an exponential back off.

Here's an example of how you might code this:

def resilient_request(method, uri, retries):
    while True:
        try:
            resp = requests.request(method, uri)
            if resp.status < 300:
                break
            if resp.status in [429, 502, 503]:
                retries -= 1
                if retries <= 0:
                    raise
                time.sleep(2 ** (3 - retries))
                continue
            if resp.status >= 500 and method in ['GET', 'PUT', 'DELETE']:
                retries -= 1
                if retries <= 0:
                    raise
                time.sleep(2 ** (3 - retries))
                continue
        except (ConnectionError, ConnectTimeoutError):
            retries -= 1
            if retries <= 0:
                raise
            time.sleep(2 ** (3 - retries))
        except TimeoutError:
            if method in ['GET', 'PUT', 'DELETE']:
                retries -= 1
                if retries <= 0:
                    raise
                time.sleep(2 ** (3 - retries))
                continue

Holy cyclomatic complexity, Batman! This suddenly got complex, and the control flow here is not simple to follow, reason about, or test. Better hope we caught everything, or we might end up in an infinite loop, or try to access resp when it has not been set. There are some parts of the above code that we can break into sub-methods, but you can't make the code too much more compact than it is there, since most of it is control flow. It's also a pain to write this type of code and verify its correctness; most people just try once, as this comment from the pip library illustrates. This is a shame and the reliability of services on the Internet suffers.

A better way

Andrey Petrov and I have been putting in a lot of work make it really, really easy for you to write resilient requests in Python. We pushed the complexity of the above code down into the urllib3 library, closer to the request that goes over the wire. Instead of the above, you'll be able to write this:

def retry_callable(method, response):
    """ Determine whether to retry this
    return ((response.status >= 400 and method in IDEMPOTENT_METHODS)
            or response.status in (429, 503))
retry = urllib3.util.Retry(read=3, backoff_factor=2,
                           retry_callable=retry_callable)
http = urllib3.PoolManager()
resp = http.request(method, uri, retries=retry)

You can pass a callable to the retries object to determine the retry behavior you'd like to see. Alternatively you can use the convenience method_whitelist and codes_whitelist helpers to specify which methods to retry.

retry = urllib3.util.Retry(read=3, backoff_factor=2,
                           codes_whitelist=set([429, 500]))
http = urllib3.PoolManager()
resp = http.request(method, uri, retries=retry)

And you will get out the same results as the 30 lines above. urllib3 will do all of the hard work for you to catch the conditions mentioned above, with sane (read: non-intrusive) defaults.

This is coming soon to urllib3 (and with it, to Python Requests and pip); we're looking for a bit more review on the pull request before we merge it. We hope this makes it easier for you to write high performance HTTP clients in Python, and appreciate your feedback!

Thanks to Andrey Petrov for reading a draft of this post.

Liked what you read? I am available for hire.

Designing a better shower faucet

How should you design the controls for a shower? Let's take a quick look.

Affordance

a hammer

A device should make clear by its design how to use it. Take a hammer for example.

No one has ever looked at a hammer and wondered which end you are supposed to grab and which part you're supposed to pound nails with. This is an example of good affordance.

Some things do not have such good affordance, like the shower at my friend's house. It looked like this, except the handles were perfectly horizontal.

shower faucet and handles

The shower handles have one good affordance - you know where you are supposed to grab, and it's clear you are supposed to rotate the handles. However they leave the following questions unanswered.

  • Which one is hot and which one is cold?
  • How far do I have to turn the handles to reach the desired temperature?
  • What combination of hot and cold do I want?
  • Which direction do I turn the handles, up or down?

That's pretty bad for a device which doesn't need to do much. Maybe not as bad as this sink with two faucets, one for hot and one for cold:

sink fail

But it leaves a lot for the user to figure out, especially when there is usually a lag between when you move the handle and when the temperature changes, making it tough to figure out what's going on.

Designing a Better Showerhead

Functionally, a device should have two properties:

  1. Allow you to do the tasks you want
  2. Make it easy for you to do those tasks.

That's it - if the device is pretty on top of this, that's a big bonus. What do we want a shower to do?

  • Turn on hot water
  • Occasionally, make it even hotter
  • Turn off the water

That's it. These tasks don't map terribly well to the current set of faucets, which ask you to perform a juggling act to get water at the right temperature.

So how can we design a tool to do just this? I'll assume for the moment we have to stick with a physical interface - a tablet for a shower control would allow interesting choices like customizing the shower temperature per user, but would put this out of the reach of most homes. A good start would be a simple control to turn the water on and off. It's not necessary that the control shows the state of water, on or off, as you get that feedback from the hot water - it could just be a button that you press.

That's a good start, now how to control the temperature? I wasn't able to find good data, but my guess is that most people want showers in a 15 degree range of hot to very hot. Either way, there should be a sliding control that lets you select temperatures in this range.

The sliding shower handle comes close and this is one of the better designs I've seen:

Sliding shower handle

However it still has two problems. It shouldn't have a cold range at all, or select a temperature which will burn you.

Second, sliding the handle changes both pressure and temperature. You should get the best pressure available the moment you slide the handle a little bit.

Third, the feedback you get when you turn the shower off could be better. The device could offer a little resistance, and then slide into place when turning it on or off - this way you know that the shower is on or off, similar to the way stoves and iPhone headphones slide into place with a satisfying click.

A shower handle that gave resistance when turning it on or off, turned on full blast straight away, and only let you slide between various hot temperatures. That would be nice.

Liked what you read? I am available for hire.

Why Jeeps Don’t Have Square Headlights & other useful notes

This is one of the most interesting blog posts I have seen in a while, 37signals interviewing Clotaire Rapaille, a cultural consultant, marketing guru, psychologist, anthropologist, call-him-what-you-will. It will be the most interesting thing you read today. 37signals is a great website if you are looking for examples of great design (and not-so-great design, too). As of today it is entering my feed reader.

Liked what you read? I am available for hire.