Posts Tagged With: Efficiency

Weird Tricks to Write Faster, More Correct Database Queries

Posted on October 12, 2015

Adrian Colyer wrote a great summary of a recent paper by Peter Bailis et al. In the paper the database researchers examine open source Rails applications and observe that the applications apply constraints - foreign key references, uniqueness constraints - in a way that's not very performant or correct.

I was pretty surprised to read about this! For the most part we have avoided problems like this at Shyp, and I didn't realize how widespread this problem is; I certainly have written a lot of bad queries in the past.

So! Let's learn some tips for writing better queries. Everything below will help you write an application that is more correct - it will avoid consistency problems in your data - and more performant - you should be able to achieve the same results as Rails, with fewer queries!

ps - The info below may be really obvious to you! Great! There are a lot of people who aren't familiar with these techniques, as the paper above indicates.

Use database constraints to enforce business logic

Say you define an ActiveRecord class that looks like this:

class User < ActiveRecord::Base
  validates :email, uniqueness: true
end

What actually happens when you try to create a new user? It turns out Rails will make 4 (four!) roundtrips to the database.

BEGIN a transaction.
Perform a SELECT to see if any other users have that email address.
If the SELECT turns up zero rows, perform an INSERT to add the row.
Finally, COMMIT the result.

This is pretty slow! It also increases the load on your application and your database, since you need to make 4 requests for every INSERT. Bailis et al also show that with your database's default transaction isolation level, it's possible to insert two records with the same key. Furthermore, there are some ActiveRecord queries which skip the built-in validations, as Gary Bernhardt discussed in his video, "Where Correctness Is Enforced", way back in 2012. Any query which inserts data and skips the validations can compromise the integrity of your database.

What if I told you you can do the same insert in one query instead of four, and it would be more correct than the Rails version? Instead of Rails's migration, write this:

CREATE TABLE users (email TEXT UNIQUE);

The UNIQUE is the key bit there; it adds a unique key on the table. Then, instead of wrapping the query in a transaction, just try an INSERT.

> insert into users (email) values ('foo@example.com');
INSERT 0 1
> insert into users (email) values ('foo@example.com');
ERROR:  duplicate key value violates unique constraint "users_email_key"
DETAIL:  Key (email)=(foo@example.com) already exists.

You'll probably need to add better error handling around the failure case - at least we did, for the ORM we use. But at any level of query volume, or if speed counts (and it probably does), it's worth it to investigate this.

Just Try the Write

Say you wanted to read a file. You could write this:

if not os.path.isfile(filename):
    raise ValueError("File does not exist")
with open(filename, 'r') as f:
    f.read()
    ...

But that would still be vulnerable to a race! What if the OS or another thread deleted the file between the isfile check and the with open line - the latter would throw an IOError, which won't be handled. Far better to just try to read the file and handle errors appropriately.

try:
    with open(filename, 'r') as f:
        f.read()
        ...
except IOError:
    raise ValueError("File does not exist")

Say you have a foreign key reference - phone_numbers.user_id refers to users.id, and you want to validate that the user_id is valid. You could do:

def write_phone_number(number, user_id):
    user = Users.find_by_id(user_id)
    if user is None:
        raise NotFoundError("User not found")
    Number.create(number=number, user_id=user_id)

Just try to write the number! If you have a foreign key constraint in the database, and the user doesn't exist, the database will tell you so. Otherwise you have a race between the time you fetch the user and the time you create the number.

def write_phone_number(number, user_id):
    try
        Number.create(number=number, user_id=user_id)
    except DatabaseError as e:
        if is_foreign_key_error(e):
            raise NotFoundError("Don't know that user id")

Updates Should Compose

Let's say you have the following code to charge a user's account.

def charge_customer(account_id, amount=20):
    account = Accounts.get_by_id(account_id)
    account.balance = account.balance - 20
    if account.balance <= 0:
        throw new ValueError("Negative account balance")
    else
        account.save()

Under the hood, here's what that will generate:

SELECT * FROM accounts WHERE id = ?
UPDATE accounts SET balance = 30 WHERE id = ?;

So far, so good. But what happens if two requests come in to charge the account at the same time? Say the account balance is $100

Thread 1 wants to charge $30. It reads the account balance at $100.
Thread 2 wants to charge $15. It reads the account balance at $100.
Thread 1 subtracts $30 from $100 and gets a new balance of $70.
Thread 2 subtracts $15 from $100 and gets a new balance of $85.
Thread 1 attempts to UPDATE the balance to $70.
Thread 2 attempts to UPDATE the balance to $85.

This is clearly wrong! The balance after $45 of charges should be $55, but it was $70, or $85, depending on which UPDATE went last. There are a few ways to deal with this:

create some kind of locking service to lock the row before the read and after you write the balance. The other thread will wait for the lock before it reads/writes the balance. These are hard to get right and will carry a latency penalty.
Run the update in a transaction; this will create an implicit lock on the row. If the transaction runs at the SERIALIZABLE or REPEATABLE READ levels, this is safe. Note most databases will set the default transaction level to READ COMMITTED, which won't protect against the issue referenced above.

Skip the SELECT and write a single UPDATE query that looks like this:

    UPDATE accounts SET balance = balance - 20 WHERE id = ?;

That last UPDATE is composable. You can run a million balance updates in any order, and the end balance will be exactly the same, every time. Plus you don't need a transaction or a locking service; it's exactly one write (and faster than the .save() version above!)

But if I do just one UPDATE, I can't check whether the balance will go below zero! You can - you just need to enforce the nonnegative constraint in the database, not the application.

CREATE TABLE accounts (
    id integer primary key,
    balance integer CHECK (balance >= 0),
);

That will throw any time you try to write a negative balance, and you can handle the write failure in the application layer.

Update: Apparently MySQL accepts check constraints as valid syntax, but does not execute them, so you might need to take a different approach. Thanks olivier for pointing this out!

The key point is that your updates should be able to run in any order without breaking the application. Use relative ranges - balance = balance - 20 for example - if you can. Or, only apply the UPDATE if the previous state of the database is acceptable, via a WHERE clause. The latter technique is very useful for state machines:

UPDATE pickups SET status='submitted' WHERE status='draft' AND id=?;

That update will either succeed (if the pickup was in draft), or return zero rows. If you have a million threads try that update at the same time, only one will succeed - an incredibly valuable property!

Beware of save()

The save() function in an ORM is really unfortunate for two reasons. First, to call .save(), you have to retrieve an instance of the object via a SELECT call. If you have an object's ID and some fields to read, you can avoid needing to do the read by just trying the UPDATE. This introduces more latency and the possibility of writing stale data.

Second, some implementations of .save() will issue an UPDATE and update every column.

This can lead to updates getting clobbered. Say two requests come in, one to update a user's phone number, and the other to update a user's email address, and both call .save() on the record.

UPDATE users SET email='oldemail@example.com', phone_number='newnumber' WHERE id = 1;
UPDATE users SET email='newemail@example.com', phone_number='oldnumber' WHERE id = 1;

In this scenario the first UPDATE gets clobbered, and the old email gets persisted. This is really bad! We told the first thread that we updated the email address, and then we overwrote it. Your users and your customer service team will get really mad, and this will be really hard to reproduce. Be wary of .save - if correctness is important (and it should be!), use an UPDATE with only the column that you want.

Partial Indexes

If you thought the previous section was interesting, check this out. Say we have a pickups table. Each pickup has a driver ID and a status (DRAFT, ASSIGNED, QUEUED, etc).

CREATE TABLE pickups (
    id integer,
    driver_id INTEGER REFERENCES drivers(id),
    status TEXT
);

We want to enforce a rule that a given driver can only have one ASSIGNED pickup at a time. You can do this in the application by using transactions and writing very, very careful code... or you can ask Postgres to do it for you:

CREATE UNIQUE INDEX "only_one_assigned_driver" ON pickups(driver_id) WHERE
    status = 'ASSIGNED';

Now watch what happens if you attempt to violate that constraint:

> INSERT INTO pickups (id, driver_id, status) VALUES (1, 101, 'ASSIGNED');
INSERT 0 1
> INSERT INTO pickups (id, driver_id, status) VALUES (2, 101, 'DRAFT');
INSERT 0 1 -- OK, because it's draft; doesn't hit the index.
> INSERT INTO pickups (id, driver_id, status) VALUES (3, 101, 'ASSIGNED');
ERROR:  duplicate key value violates unique constraint "only_one_assigned_driver"
DETAIL:  Key (driver_id)=(101) already exists.

We got a duplicate key error when we tried to insert a second ASSIGNED record! Because you can trust the database to not ever screw this up, you have more flexibility in your application code. Certainly you don't have to be as careful to preserve the correctness of the system, since it's impossible to put in a bad state!

Summary

In many instances your ORM may be generating a query that's both slow, and can lead to concurrency errors. That's the bad news. The good news is you can write database queries that are both faster and more correct!

A good place to start is by reversing the traditional model of ORM development. Instead of starting with the code in the ORM and working backwards to the query, start with the query you want, and figure out how to express than in your ORM. You'll probably end up using the lower-level methods offered by your ORM a lot more, and you'll probably discover defects in the way that your ORM handles database errors. That's okay! You are on a much happier path.

Thanks to Alan Shreve and Kyle Conroy for reading drafts of this post.

Liked what you read? I am available for hire.

Tools I use to get the job done

Posted on June 17, 2011

Every day I look for new tools that'll help me get my work done faster. Here's some of the stuff I use to get the job done.

My [only] machine is a 13' MacBook Pro. I have a 120GB solid state drive, which is the single best upgrade you can make to any machine - it makes your laptop so much faster. I also have 8GB of RAM. Don't buy RAM or an SSD from Apple, buy the parts from Newegg and then install them yourself using the guides on iFixit. With the new unibody laptops you'll need a Phillips #00 and a Torx T5 screwdriver but you can get these from your local hardware store for less than $10.

I also have a 2TB external drive with backups and my music library, and a Samsung scanner. I use AudioTechnica noise-canceling headphones to stay focused in busy environments.

I have a pretty elaborate set of dotfiles that I use to configure the terminal, vim, and other applications. I keep these in a Mercurial repository so I can get my configuration set up really quickly on a new machine.

iTerm 2 and bash are fantastic for a terminal, and MacVim is a great GUI editor for vim - I use it to write blog posts and code. I use Adium for chat and Quicksilver to open and close applications and control iTunes. Skitch is an excellent tool for screenshots, because you can share photos with one click - I use it at least once a day. I use Colloquy for IRC (it's chat for programming people). Jumpcut helps keep track of things I've pasted to the Clipboard and TextExpander expands snippets I use often, like my email address and our home wireless password.

I'm obsessed with maximizing screen space. I keep the Dock minimized, (a) because I really never use it and (b) because it takes up about 50-100 pixels at the bottom of the screen. Divvy resizes/moves windows (to quickly make windows full screen, or left half/right half) and Witch switch between open windows (instead of applications) quickly. At home I have two ASUS 24' monitors; one connects through the DisplayPort and the other one connects with a Diamond BVU adapter.

I use Chrome for casual web browsing and Firefox + Firebug for web development. I use 1Password to generate strong passwords. I use Webfaction for web hosting; it's the best combination of price and features (full SSH, multiple websites, etc) I've been able to find. I might upgrade to Linode soon however, for faster sites and more space on the server. I use rsync to copy my 1Password Anywhere file to the Webfaction server, giving me access to my passwords from any web browser. I also use Shell in a Box to get SSH access to Webfaction from any browser.

I really wish there were better single site browser apps for Mac. Apparently Chrome and Firefox are both working on tools for this. I have single site apps for Google Voice, Calendar, and Gmail, configured with Prism. This helps me pull them up quickly, especially when I have 40+ tabs open in Chrome.

Some stuff I do that's atypical: I type using a DVORAK keyboard, to the endless annoyance of people trying to borrow my laptop. I also use Scroll Reverser to have the scroll direction mirror that of mobile devices; you move your fingers up to scroll down (apparently this is the default in Mac OS Lion).

I have bank accounts with ING Direct and Ally Bank. In my investment account on Sharebuilder I'm invested 50% in VT (a world market fund), 30% in PCRDX (a commodity futures index fund), and 20% in RWO (a global real estate fund).

Liked what you read? I am available for hire.

Free Trade & Sports

Posted on March 13, 2007

The idea of protectionism, when applied to a sport like baseball or basketball, is ridiculous. Say the NBA suddenly decided to limit the number of foreign players that could be drafted to four players in the upcoming draft. In the name of "protecting" American players at home. Arguments are made by American players - "they're taking our jobs," "American talent isn't as developed as foreign talent so we need quotas," or my favorite, "Foreign players are better than we are and they will play for less money than we will." This idea is ridiculous - when Dirk Nowitzki, Yao Ming, or Steve Nash play games, the quality of play is better and tickets are cheaper. More fans (I think) would be willing to go see a better quality game, too.

Liked what you read? I am available for hire.

Once You Go Mac, You Never Go Back

Posted on March 6, 2007

I'm home for Spring Break and working on my PC at home... I keep reaching for the command-space tab to bring up Quicksilver and keep looking to change the volume in the keyboard. As my new, terrible Verizon RAZR and Apple prove, the best features of a phone aren't the camera, or the video text messages, or the new technology. They're the ease of use and the free, simple stuff that makes using a product easy and intuitive.

Liked what you read? I am available for hire.

Verizon Phones Are Terrible

Posted on January 22, 2007

Well, the phones aren't terrible, but the software that comes with them is. I recently broke my two year old Samsung phone, which I was urged to replace but chose not to because it was so easy to use. I was disappointed by the software on new Verizon phones. First off, there is only one software running on every phone - unlike choice of plans or physical products, the user has no options when it comes to the interface. The interface is bereft of opportunities for customization, and a lot of the shortcuts on the phone point straight to opportunities to buy expensive games or ringtones from Verizon. It takes about ten button presses to change the text message alert status, and the display letters are small enough that you have to squint to read them. You can't set the alarm clock as one of the main shortcut buttons either. Verizon has emphasis in the wrong places - all of their phone 'capability' is based on whether the phone has a camera, or whether you can buy games, or send emails. I want to know how easy it is to do stuff with it. Anyways, I might be switching to Sprint. Verizon has had excellent coverage, I'll give them that, but I don't like the software on their phones, and that software is something I'm gonna have to navigate every time I try to do something with the phone. I'm amazed that no one's designed alternative software for phones - the Mac or Linux equivalent. It seems to me there would be a market for good software - software that does T9 Word better, or has easier access to the things you can do.

Liked what you read? I am available for hire.

You Know You Think Like A Libertarian When…

Posted on November 20, 2006

The government decides to put out a new set of one-dollar coins featuring every dead president, and all you can think about is how much of a tremendous waste of taxpayer dollars it represents!

Liked what you read? I am available for hire.

Excellent Mac App: Quicksilver

Posted on September 14, 2006

I have an iBook at college, and recently I started using the application Quicksilver. If you download and open it, odds are you will be initially confused by how simple the app is - just two boxes. I was discouraged from using it. I simply forced myself for an afternoon's browsing session to complete every action - open a new link, compose an email, play a new song, find a file, attach that file to a message to send to a contact, skip to the next song, run a Google search - using Quicksilver. The beauty of Quicksilver is that you can do all of the actions I just named simply and without using the mouse. The two windows in Quicksilver comprise an object and a verb - start typing the object and once you've found what you want (a song, a photo, a document, a Firefox bookmark, an application, the function that lets you run Google Search), press Tab and QS switches to the verb - what you want to do to the object (open, play, compose email, new folder, open URL, copy, etc.) This not only feels really cool but saves the time you'd need to move the mouse to the correct location and open something. Plus, it's nice to be able to do everything on your computer from a single location. Dan Dickinson has a good tutorial on his blog, A Better OS in 10 Minutes. I highly recommend this program to anyone who wants to run their computer faster and/or more simply.

Liked what you read? I am available for hire.

Has anyone else noticed that automatic sinks/flushers don’t work?

Posted on August 18, 2006

I remember when I was about six or seven, automatic sinks and flushers were the coolest things ever. "No hands!" I remember the Beavis & Butthead episode when they stand in front of a urinal, move to the next one, and laugh while the urinals flush in a line. If only automatic flushers worked as well as that. I walked into the bathroom today and stepped up to the urinal when it flushed. I did my deed and it flushed again as I exited the area. That's twice as much water as if I just did it myself. Ostensibly, automatic flushers/sinks were implemented for three reasons: to save you the trouble of pushing the handle, to save yourself from germs on the handle, and to save water. To paraphrase Cal Naughton from "Talladega Nights," these are three pretty good things. Problems arise in implementation, with the sinks being a bigger issue than the toilets. On about a third of sinks the sensors don't work, and on the ones that do, you usually have to keep your hands about an inch from the sensor for the sink to release water, which comes out in a lukewarm dribble. Until we can get automatic sinks that work reliably I'll pass on the benefits and gladly incur the small cost of getting germs for a tap that works.

Liked what you read? I am available for hire.

More Efficient Checkout Times

Posted on August 17, 2006

At supermarkets, I'm one of the people who will walk back and forth searching for the shortest line. Usually I find it and usually I can bypass about half of the people standing in line to check out. Often the shortest line is a regular checkout even if I have few enough items to qualify for the express checkout line. Instead of having twelve separate checkout lines, clogging up room at the front of the store and taking away valuable real estate, stores should follow the European market example: have one queue and multiple counters close together serving the one line. The primary benefit here is that customers will be helped in the order they entered the checkout line, ensuring that everyone waits roughly the same amount of time. This is an effective way to help prevent shoplifting, if everyone is leaving out of only one exit. Furthermore, you can put the counters closer together, opening up more of the floor for products. Go to a Tesco in Britain and you'll see what I'm talking about. This will never happen in the US, but it would be nice.

Liked what you read? I am available for hire.