Posts Tagged With: Code

A look at the new retry behavior in urllib3

"Build software like a tank." I am not sure where I read this, but I think about it a lot, especially when writing HTTP clients. Tanks are incredible machines - they are designed to move rapidly and protect their inhabitants in any kind of terrain, against enemy gunfire, or worse.

HTTP clients often run in unfriendly territory, because they usually involve a network connection between machines. Connections can fail, packets can be dropped, the other party may respond very slowly, or with a new unknown error message, or they might even change the API from under you. All of this means writing an HTTP client like a tank is difficult. Here are some examples of things that a desirable HTTP client would do for you, that are never there by default.

  • If a request fails to reach the remote server, we would like to retry it no matter what. We don't want to wait around for the server forever though, so we want to set a timeout on the connection attempt.

  • If we send the request but the remote server doesn't respond in a timely manner, we want to retry it, but only on requests where it is safe to send the request again - so called idempotent requests.

  • If the server returns an unexpected response, we want to always retry if the server didn't do any processing - a 429, 502 or a 503 response usually indicate this - as well as all idempotent requests.

  • Generally we want to sleep between retries to allow the remote connection/server to recover, so to speak. To help prevent thundering herd problems, we usually sleep with an exponential back off.

Here's an example of how you might code this:

def resilient_request(method, uri, retries):
    while True:
        try:
            resp = requests.request(method, uri)
            if resp.status < 300:
                break
            if resp.status in [429, 502, 503]:
                retries -= 1
                if retries <= 0:
                    raise
                time.sleep(2 ** (3 - retries))
                continue
            if resp.status >= 500 and method in ['GET', 'PUT', 'DELETE']:
                retries -= 1
                if retries <= 0:
                    raise
                time.sleep(2 ** (3 - retries))
                continue
        except (ConnectionError, ConnectTimeoutError):
            retries -= 1
            if retries <= 0:
                raise
            time.sleep(2 ** (3 - retries))
        except TimeoutError:
            if method in ['GET', 'PUT', 'DELETE']:
                retries -= 1
                if retries <= 0:
                    raise
                time.sleep(2 ** (3 - retries))
                continue

Holy cyclomatic complexity, Batman! This suddenly got complex, and the control flow here is not simple to follow, reason about, or test. Better hope we caught everything, or we might end up in an infinite loop, or try to access resp when it has not been set. There are some parts of the above code that we can break into sub-methods, but you can't make the code too much more compact than it is there, since most of it is control flow. It's also a pain to write this type of code and verify its correctness; most people just try once, as this comment from the pip library illustrates. This is a shame and the reliability of services on the Internet suffers.

A better way

Andrey Petrov and I have been putting in a lot of work make it really, really easy for you to write resilient requests in Python. We pushed the complexity of the above code down into the urllib3 library, closer to the request that goes over the wire. Instead of the above, you'll be able to write this:

def retry_callable(method, response):
    """ Determine whether to retry this
    return ((response.status >= 400 and method in IDEMPOTENT_METHODS)
            or response.status in (429, 503))
retry = urllib3.util.Retry(read=3, backoff_factor=2,
                           retry_callable=retry_callable)
http = urllib3.PoolManager()
resp = http.request(method, uri, retries=retry)

You can pass a callable to the retries object to determine the retry behavior you'd like to see. Alternatively you can use the convenience method_whitelist and codes_whitelist helpers to specify which methods to retry.

retry = urllib3.util.Retry(read=3, backoff_factor=2,
                           codes_whitelist=set([429, 500]))
http = urllib3.PoolManager()
resp = http.request(method, uri, retries=retry)

And you will get out the same results as the 30 lines above. urllib3 will do all of the hard work for you to catch the conditions mentioned above, with sane (read: non-intrusive) defaults.

This is coming soon to urllib3 (and with it, to Python Requests and pip); we're looking for a bit more review on the pull request before we merge it. We hope this makes it easier for you to write high performance HTTP clients in Python, and appreciate your feedback!

Thanks to Andrey Petrov for reading a draft of this post.

How to create rich links in your Sphinx documentation

This will be short, but it seems there's some difficulty doing this, so I thought I'd share.

The gist is, any time you reference a class or method in your own library, in the Python standard library, or in another third-party extension, you can provide a link directly to that project's documentation. This is pretty amazing and only requires a little bit of extra work from you. Here's how.

The Simplest Type of Link

Just create a link using the full import path of the class or attribute or method. Surround it with backticks like this:

Use :meth:`requests.Request.get` to make HTTP Get requests.

That link will show up in text as:

Use requests.Request.get to make HTTP Get requests.

There are a few different types of declarations you can use at the beginning of that phrase:

:attr:
:class:
:meth:
:exc:

The full list is here.

I Don't Want to Link the Whole Thing

To specify just the method/attribute name and not any of the modules or classes that precede it, use a squiggly, like this:

Use :meth:`~requests.Request.get` to make HTTP Get requests.

That link will show up in text as:

Use get to make HTTP Get requests.

I Want to Write My Own Text

This gets a little trickier, but still doable:

Use :meth:`the get() method <requests.Request.get>` to make HTTP Get requests.

That link will show up in text as:

Use the get() method to make HTTP Get requests.

I want to link to someone else's docs

In your docs/conf.py file, add 'sphinx.ext.intersphinx' to the end of the extensions list near the top of the file. Then, add the following anywhere in the file:

    # Add the "intersphinx" extension
    extensions = [
        'sphinx.ext.intersphinx',
    ]
    # Add mappings
    intersphinx_mapping = {
        'urllib3': ('http://urllib3.readthedocs.org/en/latest', None),
        'python': ('http://docs.python.org/3', None),
    }

You can then link to other projects' documentation and then reference it the same way you do your own projects, and Sphinx will magically make everything work.

I want to write the documentation inline in my source code and link to it

Great! I love this as well. Add the 'sphinx.ext.autodoc' extension, then write your documentation. There's a full guide to the inline syntax on the Sphinx website; confusingly, it is not listed on the autodoc page.

    # Add the "intersphinx" extension
    extensions = [
        'sphinx.ext.autodoc',
    ]

Hope that helps! Happy linking.

New blog post about HAProxy

Over on the Twilio Engineering Blog, I have a new post about optimizing your HAProxy configuration. I wrote this mostly because we had some confusion in our configuration about setting options, and if I had it I figured others would as well. Here's a sample:

retries 2
option redispatch
When I said a 30 second connect timeout meant HAProxy would try a bad connection for 30 seconds, I lied. It turns out that by default HAProxy will retry the connect attempt 3 times. So our 30 second connect timeout is actually a 120 second connect timeout, blowing through our SLA and meaning we're returning an empty response to the customer.

Read the full post to learn more about HAProxy.

Automating your IPython Notebook Setup (and getting launchctl to work)

Recently I've fallen in love with the IPython Notebook. It's the Python REPL on steroids and I've probably just scratched the surface of what it can actually do. This will be a short post because long posts make me feel pain when I think about blogging more again. This is also really more about setting up launchctl than IPython, but hopefully that's useful too.

Starting it from the command line is kind of a pain (it tries to save .ipynb files in your current directory, it warns you to save files before closing tabs) so I thought I'd just set it up to run in the background each time I run ipython. Here's how you can get that set up.

Create a virtualenv with iPython

First, you need to install the ipython binary, and the other packages you need to run IPython Notebook.

    # Install virtualenvwrapper, then source it
    pip install virtualenvwrapper
    source /path/to/virtualenvwrapper.sh

mkvirtualenv ipython
pip install ipython tornado pyzmq

Starting IPython When Your Mac Boots

Open a text editor and add the following:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
    <plist version="1.0">
    <dict>
      <key>Label</key>
      <string>com.kevinburke.ipython</string>
      <key>ProgramArguments</key>
      <array>
          <string>/Users/kevin/.envs/ipython/bin/ipython</string>
          <string>notebook</string>
      </array>
      <key>RunAtLoad</key>
      <true/>
      <key>StandardOutPath</key>
      <string>/Users/kevin/var/log/ipython.log</string>
      <key>StandardErrorPath</key>
      <string>/Users/kevin/var/log/ipython.err.log</string>
      <key>ServiceDescription</key>
      <string>ipython notebook runner</string>
      <key>WorkingDirectory</key>
      <string>/Users/kevin/.ipython_notebooks</string>
    </dict>
    </plist>

You will need to replace the word kevin with your name and relevant file locations on your file system. I also save my notebooks in a directory called .ipython_notebooks in my home directory, you may want to add that as well.

Save that in /Library/LaunchDaemons/<yourname>.ipython.plist. Then change the owner to root:

sudo chown root:wheel /Library/LaunchDaemons/<yourname>.ipython.plist

Finally load it:

sudo launchctl load -w /Library/LaunchDaemons/<yourname>.ipython.plist

If everything went ok, IPython should open in a tab. If it didn't go okay, check /var/log/system.log for errors, or one of the two logfiles specified in your plist.

Additional Steps

That's it! I've also found it really useful to run an nginx redirecter locally, as well as a new rule in /etc/hosts, so I can visit http://ipython and get redirected to my notebooks. But that is a topic for a different blog post.

Speeding up test runs by 81% and 13 minutes

Yesterday I sped up our unit/integration test runs from 16 minutes to 3 minutes. I thought I'd share the techniques I used during this process.

  • We had a hunch that an un-mocked network call was taking 3 seconds to time out. I patched this call throughout the test code base. It turns out this did not have a significant effect on the runtime of our tests, but it's good to mock out network calls anyway, even if they fail fast.

  • I ran a profiler on the code. Well that's not true, I just timed various parts of the code to see how long they took, using some code like this:

    import datetime
    start = datetime.datetime.now()
    some_expensive_call()
    total = (datetime.datetime.now() - start).total_seconds()
    print "some_expensive_call took {} seconds".format(total)
    

    It took about ten minutes to zero in on the fixture loader, which was doing something like this:

    def load_fixture(fixture):
        model = find_fixture_in_db(fixture['id'])
        if not model:
            create_model(**fixture)
        else:
            update_model(model, fixture)
    

    The call to find_fixture_in_db was doing a "full table scan" of our SQLite database, and taking about half of the run-time of the integration tests. Moreover in our case it was completely unnecessary, as we were deleting and re-inserting everything with every test run.

    I added a flag to the fixture loader to skip the database lookup if we were doing all inserts. This sped up observed test time by about 35%.

  • I noticed that the local test runner and the Jenkins build runner were running different numbers of tests. This was really confusing. I ended up doing some fancy stuff with the xunit xml output to figure out which extra tests were running locally. Turns out, the same test was running multiple times. The culprit was a stray line in our Makefile:

    nosetests tests/unit tests/unit/* ...
    

    The tests/unit/* change was running all of the tests in compiled .pyc files as well! I felt dumb because I actually added that tests/unit/* change about a month ago, thinking that nosetests wasn't actually running some of the tests in subfolders of our repository. This change cut down on the number of tests run by a factor of 2, which significantly helped the test run time.

  • The Jenkins package install process would remove and re-install the virtualenv before every test run, to ensure we got up-to-date dependencies with every run. Well that was kind of stupid, so instead we switched to running

    pip install --upgrade .
    

on our setup.py file, which should pull in the correct version of dependencies when they changed (most of them are specified either with double-equals, == or greater-than, >=, signs). Needless to say, skipping the full test run every time saved about three to four minutes.

  • I noticed that pip would still uninstall and reinstall packages that were already there. This happened for two reasons. One, our Jenkins box is running an older version of pip, which doesn't have this change from pip 1.1:

    Fixed issue #49 - pip install -U no longer reinstalls the same versions of packages. Thanks iguananaut for the pull request.

    I upgraded the pip and virtualenv versions inside of our virtualenv.

    Also, one dependency in our tests/requirements.txt would install the latest version of requests, which would then be overridden in setup.py by a very specific version of requests, every single time the tests ran. I fixed this by explicitly setting the requests version in the tests/requirements.txt file.

That's it! There was nothing major that was wrong with our process, just fixing the way we did a lot of small things throughout the build. I have a couple of other ideas to speed up the tests, including loading fewer fixtures per test and/or instantiating some objects like Flask's test_client globally instead of once per test. You might not have been as dumb as we were but you'll likely find some speedups if you check your build process as well.

Eliminating more trivial inconveniences

I really enjoyed Sam Saffron's post about eliminating trivial inconveniences in his development process. This resonated with me as I tend to get really distracted by minor hiccups in the development process (page reload taking >2 seconds, switch to a new tab, etc). I took a look at my development process and found a few easy wins.

Automatically run the unit tests in the current file

Twilio's PHP test suite are really slow - we're sloppy about trying to have unit tests avoid hitting the disk, which means that the suite takes a while to run. I wrote a short vim command that will run only the tests in the current file. This tends to make the test iteration loop much, much faster and I can run the entire suite of tests once the current file is passing. The <leader> function in Vim is excellent and I recommend you become familiar with it.

nnoremap <leader>n :execute "!" . "/usr/local/bin/phpunit " . bufname('%') . ' \| grep -v Configuration \| egrep -v "^$" '<CR>

bufname('%') is the file name of the current Vim buffer, and the last two commands are just grepping away output I don't care about. The result is awesome:

Unit test result in vim

Auto reloading the current tab when you change CSS

Sam has a pretty excellent MessageBus option that listens for changes to CSS files, and auto-refreshes a tab when this happens. We don't have anything that good yet but I added a vim leader command to refresh the current page in the browser. By the time I switch from Vim to Chrome (or no time, if I'm viewing them side by side), the page is reloaded.

function! ReloadChrome()
    execute 'silent !osascript ' .
                \'-e "tell application \"Google Chrome\" " ' .
                \'-e "repeat with i from 1 to (count every window)" ' .
                \'-e "tell active tab of window i" ' .
                \'-e "reload" ' .
                \'-e "end tell" ' .
                \'-e "end repeat" ' .
                \'-e "end tell" >/dev/null'
endfunction
nnoremap <leader>o :call ReloadChrome()<CR>:pwd<cr>

Then I just hit <leader>o and Chrome reloads the current tab. This works even if you have the "Developer Tools" open as a separate window, and focused - it reloads the open tab in every window of Chrome.

Pushing the current git branch to origin

It turns out that the majority of my git pushes are just pushing the current git branch to origin. So instead of typing git push origin <branch-name> 100 times a day I added this to my .zshrc:

    push_branch() {
        branch=$(git rev-parse --symbolic-full-name --abbrev-ref HEAD)
        git push $1 $branch
    }
    autoload push_branch
    alias gpob='push_branch origin'

I use this for git pushes almost exclusively now.

Auto reloading our API locally

The Twilio API is based on the open-source flask-restful project, running behind uWSGI. One problem we had was changes to the application code would require a full uWSGI restart, which made local development a pain. Until recently, it was pretty difficult to get new Python code running in uWSGI besides doing a manual reload - you had to implement a file watcher yourself, and then communicate to the running process. But last year uWSGI enabled the py-auto-reload feature, where uWSGI will poll for changes in your application and automatically reload itself. Enable it in your uWSGI config with

py-auto-reload = 1   # 1 second between polls

Or at the command line with uwsgi --py-auto-reload=1.

Conclusion

These changes have all made me a little bit quicker, and helped me learn more about the tools I use on a day to day basis. Hope they're useful to you as well!

Submit forms using Javascript without breaking the Internet, a short guide

Do you write forms on the Internet? Are you planning to send them to your server with Javascript? You should read this.

The One-Sentence Summary

It's okay to submit forms with Javascript. Just don't break the internet.

What Do You Mean, Break the Internet?

Your browser is an advanced piece of software that functions in a specific way, often for very good reasons. Ignore these reasons and annoy your users. User annoyance translates into lower revenue for you.

Here are some of the ways your Javascript form submit can break the Internet.

Submitting to a Different Endpoint Than the Form Action

A portion of your users are browsing the web without Javascript enabled. Some of them, like my friend Andrew, are paranoid. Others are on slow connections and want to save bandwidth. Still others are blind and browse the web with the help of screen readers.

All of these people, when they submit your form, will not hit your fancy Javascript event handler; they will submit the form using the default action and method for the form - which, if unspecified, default to a GET to the current page. Likely, this does not actually submit the form. Which leads to my favorite quote from Mark Pilgrim:

Jakob Nielsen's dog

There is an easy fix: make the form action and method default to the same endpoint that you are POSTing to with Javascript.

You are probably returning some kind of JSON object with an error or success message and then redirecting the user in Javascript. Just change your server endpoint to redirect if the request is not an AJAX request. You can do this because all browsers attach an X-Requested-With: XMLHttpRequest HTTP header to asynchronous Javascript requests.

Changing Parameter Names

Don't change the names of the submitted parameters in Javascript - just submit the same names that you had in your form. In jQuery this is easy, just call the serialize method on the form.

var form = $("#form-id");
$.post('endpoint', $(form).serialize(), function(response) {
    // do something with the response.
});

Attaching the Handler to a Click Action

Believe it or not, there are other ways of submitting a form besides clicking on the submit button. Screen readers, for example, don't click, they submit the form. Also there are lots of people like me who use tab to move between form fields and press the spacebar to submit forms. This means if your form submit starts with:

$("#submit-button").click(function() {
    // Submit the form.
});

You are doing it wrong and breaking the Internet for people like me. You would not believe how many sites don't get this right. Examples in the past week: WordPress, Mint's login page, JetBrains's entire site.

The correct thing to do is attach the event handler to the form itself.

$("#form-id").submit(function() {
    // Write code to submit the form with Javascript
    return false; // Prevents the default form submission
});

This will attach the event to the form however the user submits it. Note the use of return false to avoid submitting the form.

Validation

It's harder to break the Internet with validation. To give fast feedback loop to the user, you should detect and prevent invalid input on the client side.

The annoying thing is you have to do this on both the client side and the server side, in case the user gets past the client side checks. The good news is the browser can help with most of the easy stuff. For example, if you want to check that an email address is valid, use the "email" input type:

<input type="email" />

Then your browser won't actually submit a form that doesn't have a valid email. Similarly you can note required fields with the required HTML attribute. This makes validation on the client a little easier for most of the cases you're trying to check for.

Summary

You can submit forms with Javascript, but most of the time you'll have to put in extra effort to duplicate functionality that already exists in your browser. If you're going to go down that road, please put in the extra effort.

Helping Beginners Get HTML Right

If you've ever tried to teach someone HTML, you know how hard it is to get the syntax right. It's a perfect storm of awfulness.

  • Newbies have to learn all of the syntax, in addition to the names of HTML elements. They don't have the pattern matching skills (yet) to notice when their XML is not right, or the domain knowledge to know it's spelled "href" and not "herf".

  • The browser doesn't provide feedback when you make mistakes - it will render your mistakes in unexpected and creative ways. Miss a closing tag and watch your whole page suddenly acquire italics, or get pasted inside a textarea. Miss a quotation mark and half the content disappears. Add in layouts with CSS and the problem doubles in complexity.

  • Problems tend to compound. If you make a mistake in one place and don't fix it immediately, you can't determine whether future additions are correct.

This leads to a pretty miserable experience getting started - people should be focused on learning how to make an amazingly cool thing in their browser, but instead they get frustrated trying to figure out why the page doesn't look right.

Let's Make Things A Little Less Awful

What can we do to help? The existing tools to help people catch HTML mistakes aren't great. Syntax highlighting helps a little, but sometimes the errors look as pretty as the actual text. XML validators are okay, but tools like HTML Validator spew out red herrings as often as they do real answers. Plus, you have to do work - open the link, copy your HTML in, read the output - to use it.

We can do better. Most of the failures of the current tools are due to the complexity of HTML - which, if you are using all of the features, is Turing complete. But new users are rarely exercising the full complexity of HTML5 - they are trying to learn the principles. Furthermore the mistakes they are making follow a Pareto distribution - a few problems cause the majority of the mistakes.

Catching Mistakes Right Away

To help with these problems I've written an validator which checks for the most common error types, and displays feedback to the user immediately when they refresh the page - so they can instantly find and correct mistakes. It works in the browser, on the page you're working with, so you don't have to do any extra work to validate your file.

Best of all, you can drop it into your HTML file in one line:

</p>
<script type="text/javascript" src="https://raw.github.com/kevinburke/tecate/master/tecate.js"></script>
<p>

Then if there's a problem with your HTML, you'll start getting nice error messages, like this:

error message

Read more about it here, and use it in your next tutorial. I hope you like it, and I hope it helps you with debugging HTML!

It's not perfect - there are a lot of improvements to be made, both in the errors we can catch and on removing false positives. But I hope it's a start.

PS: Because the browser will edit the DOM tree to wipe the mistakes users make, I have to use raw regular expressions to check for errors. I have a feeling I will come to regret this. After all, when parsing HTML with regex, it's clear that the <center> cannot hold. I am accepting this tool will give wrong answers on some HTML documents; I am hoping that the scope of documents turned out by beginning HTML users is simple enough that the center can hold.

How to design your API SDK

I've worked with Twilio's client libraries pretty much every day for the last year and I wanted to share some of the things we've learned about helper libraries.

Should you have helper libraries?

You should think about helper libraries as a more accessible interface to your API. Your helper libraries trade the details of your authentication scheme and URL structure for the ease of "do X in N lines of code." If the benefits of a more accessible interface outweigh the costs, do it.

If people are paying to access your API (Twilio, AWS, Sendgrid, Stripe, for example), then you probably should write helper libraries. A more accessible API translates directly into more revenue for your company.

If you're two founders in a garage somewhere, maybe not. The gap between your company's success and failure is probably not a somewhat easier API interface. Writing a helper library is a lot of work, maybe one to four man-weeks depending on the size of your API and your familiarity with the language in question, plus ongoing maintenance.

You might not need a client library if your customers are all highly experienced programmers. For example the other day I wrote my own client for the Recaptcha API. I knew how I wanted to consume it and learning/installing a Recaptcha library would have been unnecessary overhead.

You may also not need a client library if standard libraries have very good HTTP clients. For example, the Requests library dramatically lowers the barrier for writing a client that uses HTTP basic auth. Developers who are familiar with Requests will have an easier time writing http clients. Implementing HTTP basic auth remains a large pain point in other languages.

How should you design your helper libraries?

Realize that if you are writing a helper library, for many of your customers the helper library will be the API. You should put as much care into its design as you do your HTTP API. Here are a few guiding principles.

  • If you've designed your API in a RESTful way, your API endpoints should map to objects in your system. Translate these objects in a straightforward way into classes in the helper library, making the obvious transformations - translate numbers from strings in the API representation into integers, and translate date strings such as "2012-11-05" into date objects.

  • Your library should be flexible. I will illustrate this with a short story. After much toil and effort, the Twilio SMS team was ready to ship support for Unicode messages. As part of the change, we changed the API's 'Content-Type' header from

application/json

to

application/json; charset=utf-8

We rolled out Unicode SMS and there was much rejoicing; fifteen minutes later, we found out we'd broken three of our helper libraries, and there was much wailing and gnashing of teeth. It turns out the libraries had hard-coded a check for an application/json content-type, and threw an exception when we changed the Content-Type header.

  • Your library should complain loudly if there are errors. Per the point on flexibility above, your HTTP API should validate inputs, not the client library. For example let's say we had the library raise an error if you tried to send an SMS with more than 160 characters in it. If Twilio ever wanted to ship support for concatenated SMS messages, no one who had this library installed would be able to send multi-message texts. Instead, let your HTTP API do the validation and pass through errors in a transparent way.

  • Your library use consistent naming schemes. For example, the convention for updating resources should be the same everywhere. Hanging up a call and changing an account's FriendlyName both represent the same concept, updating a resource. You should have methods to update each that look like:

$account->update('FriendlyName', 'blah');
$call->update('Status', 'completed');

It's okay, even good, to have methods that map to readable verbs:

$account->reserveNumber('+14105556789');
$call->hangup();

However, these should always be thin wrappers around the update() methods.

class Call {
    function hangup() {
        return $this->update('Status', 'completed');
    }
}

Having only the readable-verb names is a path that leads to madness. It becomes much tougher to translate from the underlying HTTP request to code, and much trickier to add new methods or optional parameters later.

  • Your library should include a user agent with the library name and version number, that you can correlate against your own API logs. Custom HTTP clients rarely (read: never) will add their own user agent, and standard library maintainers don't like default user agents much.

  • Your library needs to include installation instructions, preferably written at a beginner level. Users have varying degrees of experience with things you might take for granted, like package managers, and will try to run your code in a variety of different environments (VPS, AWS, on old versions of programming languages, behind a firewall without admin rights, etc). Any steps your library can take to make things easier are good. As an example, the Twilio libraries include the SSL cert necessary for connecting to the Twilio API.

How should you test your library?

The Twilio API has over 20 different endpoints, split into list resources and instance resources, which support the HTTP methods GET, POST, and sometimes DELETE. Let's say there are 50 different combinations of endpoints and HTTP methods in total. Add in implementations for each helper library, and the complexity grows very quickly - if you have 5 helper libraries you're talking about 250 possible methods, all of which could have bugs.

One solution to this is to write a lot of unit tests. The problem is these take a lot of time to write, and at some level you are going to have to mock out the API, or stop short of making the actual API request. Instead we've taken the following approach to testing.

  1. Start with a valid HTTP request, and the parameters that go with it.
  2. Parse the HTTP request and turn it into a piece of sample code that exercises an aspect of your helper library.
  3. Run that code sample, and intercept the HTTP request made by the library.
  4. Compare the output with the original HTTP request.

This approach has the advantage of actually checking against the HTTP request that gets made, so you can test things like URL encoding issues. You can reuse the same set of HTTP requests across all of your libraries. The HTTP "integration" tests will also detect actions that should be possible with the API but are not implemented in the client.

You might think it's difficult to do automated code generation, but it actually was not that much work, and it's very easy if you've written your library in a consistent way. Here's a small sample that generates snippets for our Python helper library.

def process_instance_resource(self, resource, sid, method="GET", params=None):
    """ Generate code snippets for an instance resource """
    get_line = '{} = {}.get("{}")'.format(self.instance_name, self.base, sid)
    if method == "GET":
        interesting_line = 'print {}.{}'.format(self.instance_name,
            self.get_interesting_property(resource))
        return "\n".join([get_line, interesting_line])
    elif method == "POST":
        update_line = '{} = {}.update("{}", {})'.format(
            self.instance_name, self.base, sid, self.transform_params(params))
        interesting_line = 'print {}.{}'.format(
            self.instance_name, self.get_interesting_property(resource))
        return "\n".join([update_line, interesting_line])
    elif method == "DELETE":
        return '{}.delete("{}")'.format(self.base, sid)
    else:
        raise ValueError("Method {} not supported".format(method))

Generating code snippets has the added advantage that you can then easily embed these into your customer-facing documentation, as we've done in our documentation.

How do people use helper libraries?

While pretty much every resource gets used in the aggregate, individual accounts tend to only use one or two resources. This suggests that your API is only being referenced from one or two places within a customer's codebase.

How should you document your helper library?

Per the point above, your library is probably being used in only one or two places in a customer's codebase. This suggests your customer is hiring your API to do a specific job. Your documentation hierarchy should be aligned around those jobs. Combined with the integration test/code snippet generator above, and you should have a working code example for every useful action in your API. You will probably also want to have documentation for the public library interface, such as the types and parameters for each method, but the self-service examples will be OK for 85% of your users.

Virgin Mobile fails web security 101, leaves six million subscriber accounts wide open

Update: Virgin fixed the issue Tuesday night after taking their login page down for four hours. Please see my update at the bottom of this post.

The first sentence of Virgin Mobile USA’s privacy policy announces that “We [Virgin] are strongly committed to protecting the privacy of our customers and visitors to our websites at www.virginmobileusa.com.” Imagine my surprise to find that pretty much anyone can log into your Virgin Mobile account and wreak havoc, as long as they know your phone number.

I reported the issue to Virgin Mobile USA a month ago and they have not taken any action, nor informed me of any concrete steps to fix the problem, so I am disclosing this issue publicly.

The vulnerability

Virgin Mobile forces you to use your phone number as your username, and a 6-digit number as your password. This means that there are only one million possible passwords you can choose.

Screenshot of Virgin Mobile login screen

This is horribly insecure. Compare a 6-digit number with a randomly generated 8-letter password containing uppercase letters, lowercase letters, and digits - the latter has 218,340,105,584,896 possible combinations. It is trivial to write a program that checks all million possible password combinations, easily determining anyone’s PIN inside of one day. I verified this by writing a script to “brute force” the PIN number of my own account.

The scope

Once an attacker has your PIN, they can take the following actions on your behalf:

  • Read your call and SMS logs, to see who’s been calling you and who you’ve been calling

  • Change the handset associated with an account, and start receiving calls/SMS that are meant for you. They don’t even need to know what phone you’re using now. Possible scenarios: $5/minute long distance calls to Bulgaria, texts to or from lovers or rivals, “Mom I lost my wallet on the bus, can you wire me some money?”

  • Purchase a new handset using the credit card you have on file, which may result in $650 or more being charged to your card

  • Change your PIN to lock you out of your account

  • Change the email address associated with your account (which only texts your current phone, instead of sending an email to the old address)

  • Change your mailing address

  • Make your life a living hell

How to protect yourself

There is currently no way to protect yourself from this attack. Changing your PIN doesn’t work, because the new one would be just as guessable as your current PIN. If you are one of the six million Virgin subscribers, you are at the whim of anyone who doesn’t like you. For the moment I suggest vigilance, deleting any credit cards you have stored with Virgin, and considering switching to another carrier.

What Virgin should do to fix the issue

There are a number of steps Virgin could take to resolve the immediate, gaping security issue. Here are a few:

  • Allow people to set more complex passwords, involving letters, digits, and symbols.

  • Freezing your account after 5 failed password attempts, and requiring you to identify more personal information before unfreezing the account.

  • Requiring both your PIN, and access to your handset, to log in. This is known as two-step verification.

In addition, there are a number of best practices Virgin should implement to protect against bad behavior, even if someone knows your PIN:

  • Provide the same error message when someone tries to authenticate with an invalid phone number, as when they try to authenticate with a good phone number but an invalid PIN. Based on the response to the login, I can determine whether your number is a Virgin number or not, making it easy to find targets for this attack.

  • Any time an email or mailing address is changed, send a mail to the old address informing them of the change, with a message like “If you did not request this change, contact our help team.”

  • Require a user to enter their current ESN, or provide information in addition to their password, before changing the handset associated with an account.

  • Add a page to their website explaining their policy for responsible security disclosure, along with a contact email address for security issues.

History of my communication with Virgin Mobile

I tried to reach out to Virgin and tell them about the issue before disclosing it publicly. Here is a history of my correspondence with them.

  • August 15 – Reach out on Twitter to ask if there is any other way to secure my account. The customer rep does not fully understand the problem.

  • August 16 – Brute force access to my own account, validating the attack vector.

  • August 15-17 – Reach out to various customer support representatives, asking if there is any way to secure accounts besides the 6-digit PIN. Mostly confused support reps tell me there is no other way to secure my account. I am asked to always include my phone number and PIN in replies to Virgin.

    Email screenshot of Virgin asking me to include my PIN

  • August 17 – Support rep Vanessa H escalates the issue to headquarters after I explain I’ve found a large vulnerability in Virgin’s online account security. Steven from Sprint Executive and Regulatory Services gives me his phone number and asks me to call.

  • August 17 – I call Steven and explain the issue, who can see the problem and promises to forward the issue on to the right team, but will not promise any more than that. I ask to be kept in the loop as Virgin makes progress investigating the issue. In a followup email I provide a list of actions Virgin could take to mitigate the issue, mirroring the list above.

  • August 24 – Follow up with Steven, asking if any progress has been made. No response.

  • August 30 – Email Steven again. Steven writes that my feedback “has been shared with the appropriate managerial staff” and “the matter is being looked into”.

  • September 4 – I email Steven again explaining that this response is unacceptable, considering this attack may be in use already in the wild. I tell him I am going to disclose the issue publicly and receive no response.

  • September 13 – I follow up with Steven again, informing him that I am going to publish details of the attack in 24 hours, unless I have more concrete information about Virgin’s plans to resolve the issue in a timely fashion.

  • September 14 – Steven calls back to tell me to expect no further action on Virgin Mobile’s end. Time to go public.

Update, Monday night

  • Sprint PR has been emailing reporters telling them that Sprint/Virgin have fixed the issue by locking people out after 4 failed attempts. However, the fix relies on cookies in the user’s browser. This is like Virgin asking me to tell them how many times I’ve failed to log in before, and using that information to lock me out. They are still vulnerable to an attack from anyone who does not use the same cookies with each request. (ed: This issue has been fixed as of Tuesday night)

  • News coverage:

  • This vulnerability only affects Virgin USA, to my knowledge; their other international organizations appear to only share the brand name, not the same code base.

Update, Tuesday night

Virgin’s login page was down for four hours from around 5:30 PDT to 9:30 PDT. I tried my brute force script again after the page came back up. Where before I was getting 200 OK’s with every request, now about 25% of the authentication requests return 503 Service Unavailable, and 25% return 404 Not Found.

Wednesday morning

Virgin took down their login page for 4 hours Tuesday night to deploy new code. Now, after about 20 incorrect logins from one IP address, every further request to their servers returns 404 Not Found. This fixes the main vulnerability I disclosed Monday.

I just got off the phone with Sprint PR representatives. They apologized and blamed a breakdown in the escalation process. I made the case that this is why they need a dedicated page for reporting security and privacy issues, and an email address where security researchers can report problems like this, and know that they will be heard.

I gave the example of Google, who says “customer service doesn’t scale” for many products, but will respond to any security issue sent to security@google.com in a timely fashion, and in many cases award cash bounties to people who find issues. Sprint said they’d look into adding a page to their site.

Even though they’ve fixed the brute force issue, I raised issues with PIN based authentication. No matter how many automated fraud checks they have in place, PIN’s for passwords are a bad idea because:

  • people can’t use their usual password, so they might try something more obvious like their birthday, to remember it.

  • Virgin’s customer service teams ask for it in emails and over the phone, so if an attacker gains access to someone’s email, or is within earshot of someone on a call to customer service, they have the PIN right there.

  • If I get access to your PIN through any means, I can do all of the stuff mentioned above – change your handset, read your call logs, etc. That’s not good and it’s why even though Google etc. allow super complex passwords, they allow users to back it up with another form of verification.

I also said that they should clarify their policy around indemnification. I never actually brute forced an account where I didn’t know the pin, or issue more than one request per second to Virgin’s servers, because I was worried about being arrested or sued for DOSing their website. Fortunately I could prove this particular flaw was a problem by dealing only with my own account. But what if I found an attack where I could change a number in a URL, and access someone else’s account? By definition, to prove the bug exists I’d have to break their terms of service, and there’s no way to know how they would respond.

They said they valued my feedback but couldn’t commit to anything, or tell me about whether they can fix this in the future. At least they listened and will maybe fix it, which is about as good as you can hope for.