Posts Tagged With: Code

Where are Datadog’s US1 and US3 data centers located?

When you sign up for Datadog, you are immediately asked to choose whether you want to have your data stored in US1, US3, or Europe. This is an odd UI decision because Datadog provides no other information about US1 and US3, for example, where they are located or how old the infrastructure is in each place.

Datadog signup form

Fortunately we can use DNS to get some answers to these questions.

Where Datadog's US1 is located

The only info that Datadog provides about each site is the hostname; US1 is available at "app.datadoghq.com" and US3 is available at "us3.datadoghq.com".

Asking for the IP address of the A record for each service gives us a hint where they are located:

$ dig A app.datadoghq.com
; <<>> DiG 9.10.6 <<>> A app.datadoghq.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16096
;; flags: qr rd ra; QUERY: 1, ANSWER: 9, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;app.datadoghq.com.		IN	A
;; ANSWER SECTION:
app.datadoghq.com.	11	IN	CNAME	alb-web-2019-shard1-1762251229.us-east-1.elb.amazonaws.com.
alb-web-2019-shard1-1762251229.us-east-1.elb.amazonaws.com. 14 IN A 52.22.112.220
alb-web-2019-shard1-1762251229.us-east-1.elb.amazonaws.com. 14 IN A 18.205.215.170
alb-web-2019-shard1-1762251229.us-east-1.elb.amazonaws.com. 14 IN A 34.192.68.67
alb-web-2019-shard1-1762251229.us-east-1.elb.amazonaws.com. 14 IN A 54.210.66.118

So it looks like US1 is located in Amazon's us-east-1, a set of five datacenters in northern Virginia.

Where Datadog's US3 is located

Asking for the IP address of the A record for US3:

$ dig A us3.datadoghq.com
; <<>> DiG 9.10.6 <<>> A us3.datadoghq.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47450
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;us3.datadoghq.com.		IN	A
;; ANSWER SECTION:
us3.datadoghq.com.	299	IN	A	20.69.148.133
us3.datadoghq.com.	299	IN	A	20.69.148.69
us3.datadoghq.com.	299	IN	A	20.51.76.5

Doing a reverse IP address lookup reveals that these IP's belong to Microsoft in Washington state, so Datadog is likely using Microsoft Azure for data stored in US3.

Which one to choose

The primary consideration is travel time. It takes a packet about 40ms to travel across the USA, so each roundtrip is about 80ms. If your other data is located on the West Coast, you should consider US3; if your data is located on the East Coast, consider US1.

US3 is newer and may have newer hardware and better configuration; on the flip side, it may not be as well tested because it's the second one.

What about US2 and US4?

US2 and US4 return SOA results for a DNS query. US5 returns a Google data center located in Missouri (maybe that one is next?)

Liked what you read? I am available for hire.

Building a better home network

I finally got my home network in a place where I am happy with it. I wanted to share my setup and what I learned about it. There has never been a better time to set up a great home network; there are several new tools that have made this easier and better than in the past. Hopefully this will help you set up your home network!

My house

My house is two stories on a standard 25 x 100 square foot San Francisco lot. The ground floor looks roughly like this:

 --------------------------------------
|               |                      |
|               |         |   Office   |
|    Garage     | Mudroom |            |
|               |         |-------------
|                           | | | | | |
 ---------------------------------------

Upstairs looks like this:

 --------------------------------------
|    ___________                       |
|               |        Living Room   |
|    Bedroom    | Kitchen              |
|               |         -------------
|               |           | | | | | |
 ---------------------------------------

We have a Roku in the living room. My goals for home internet were:

  • Good wireless connection in every room
  • Ethernet connections in the office
  • Ethernet connection to the Roku
  • Synology network attached storage (NAS) and other external hard drives reachable from anywhere in the house

We are lucky to have Sonic Fiber internet service. Sonic comes in to a box in the garage, and an Ethernet line runs from there to the mudroom. None of the other rooms have Ethernet connections.

Initial setup

Sonic really wants to push Eero routers to everyone.1 Eero is fairly easy to set up, and Sonic collects a small fee from renting the router to you. You can extend your home network by adding more Eero's into a mesh network.

If you have a small apartment, an Eero is probably going to be a good fit. However, the mesh network was not great for achieving any of the goals I had in mind. The repeaters (Eero beacon) do not have any Ethernet out ports. It was difficult to extend the network from the mudroom to the bedroom without renting two extenders, which added about $100 per year, increased latency and lowered speeds. Further, clients on the network kept connecting to an Eero that was further away, instead of the closest one.

Powerline

(NB: please don't stop reading here as I don't recommend this.) My next step was to replace the Eero's with a traditional Netgear wireless router in the mudroom. This also could not reach to the bedroom. So I bought a powerline adapter and plugged one end in near the router and the other end in the bedroom.

Powerline adapters send signal via electric current in your house. They don't offer great speeds. Devices on your network that use a lot of electricity, like laundry machines or the microwave, can render the powerline connection unusable.

There are probably better solutions for you than powerline adapters in 2020.

Extending Ethernet to more rooms

I called a cabling company about the possibility of running Ethernet to more rooms in the house. We decided the bedroom would be very easy since it's directly above the garage. It took a team of two two hours to drill a hole in the garage, run a cable up the side of the house to the bedroom, and install an Ethernet port in the bedroom. This cost about $200.

We looked at running Ethernet to other rooms but the geography of the stairs made this really tricky.

Side note: future proofing cabling

Our house has coax cables - the traditional method of getting e.g. cable TV service - running from the garage to four rooms in the house, but it doesn't have Ethernet set up. This is disappointing since it was built within the last decade.

There are two things you can do to future proof cable runs in your house, and ensure that cables can be replaced/swapped out if mice eat them or whatever. I highly recommend you implement them any time you are running cable. One is to leave a pull cable in the wall next to whatever cable you are installing. If you need to run a new cable, you can attach it to the pull cable, and then pull it all the way through from one end to the other.

Normally cables will be stapled to the wall interior, which makes them impossible to pull through. The other option is to leave cables unstapled. This will let you use the coax/other cable directly as the pull cable. In general though it's better to just leave a second pull line in the wall behind the port.

Without either of these solutions in place, running new cables is going to be messy. You can either try to hide it by running it along the exterior walls or ceiling of your house, or drill holes in the wall every few feet, pass a new cable through, and then patch up the holes.

Side note: cat 5 vs. cat 6

Your internet speed will be bottlenecked by the slowest link in the network. Be careful it isn't your cables!

There are two flavors of Ethernet cable. Category 5 is cheaper, but can only support speeds of 100 Mbps. Category 6 is slightly more expensive but you will need it to get full gigabit speeds.

The Ethernet cables that come with the products you buy may be Cat 5 or 6. Be careful to check which one you are using (it should be written in small print on the outside of the cable).

DHCP

To load google.com, your computer looks up the IP address for Google and sends packets to it. So far so good, but how does Google send packets back? Each client on the network needs a unique local IP address. The router will translate between an open port to Google, say, port 44982, and a local IP address, say, 192.168.0.125, and send packets it receives from the broader Internet on port 44982 to the client with that IP address.

What happens if two clients on your network try to claim the same local IP address? That would be bad. Generally you set up a DHCP server to figure this out. When your phone connects to a wifi network it sends out a packet that says basically "I need an IP address." If a DHCP server is listening anywhere on the network it will find an empty IP address slot and send it back to the phone.2 The phone can then use that IP address.

Generally speaking, a consumer wireless router has three components:

  • wireless radios, that broadcast a network SSID and send packets to and from wireless clients.
  • an Ethernet switch that can split an incoming Internet connection into four or more out ports. Generally this has one WAN port (that connects to your modem/ISP) and four LAN ports (that connect to local devices on your network)
  • a DHCP server.

You can buy products that offer each of these independently - a four way switch without a radio or DHCP server will cost you about $15. But this is a convenient bundle for home networks.

If your network contains multiple switches or multiple routers you need to think about which of these devices will be giving out DHCP.

Two Routers, Too Furious

At this point my network had one router in the bedroom and one router upstairs in the living room, via an ungainly cable up the stairs. So I had good coverage in every room, and the Roku hooked up via Ethernet to the living room router, but this setup still had a few problems. I didn't have the office wired up, and the NAS only worked when you were connected to the living room router.

Furthermore, I kept running into issues where I would walk from the living room to the bedroom or vice versa but my phone/laptop would stay connected to the router in the room I was just in. Because that router was outside its normal "range", I would get more latency and dropped packets than usual, which was frustrating.

How to diagnose and measure this problem

On your laptop, hold down Option when you click the wifi button, and you'll get an extended menu that looks like this.

The key value there is the RSSI parameter, which measures the signal quality from your client to the router. This is currently at -46, a quite good value. Lower than -65 and your connection quality will start to get dicey - you will see lower bandwidth and higher latency and dropped packets.

Apple devices will hang on to the router they are currently connected to until the RSSI gets to -75 or worse, which is a very low value. This is explained in gory detail on this page. Because router coverage areas are supposed to overlap a little bit, this means the connection will have to get very bad before your phone or laptop will start looking for a new radio.

Adjust the power

Generally this means that you don't want the coverage area for the router to reach to the center of the coverage area for the other router, if you can help it. If the coverage areas don't overlap that much, clients will roam to the closest router, which will improve the connection.

You can adjust the coverage area either by physically moving the router or by lowering the power for the radios (which you may be able to do in the admin panel for the router).

If neither of these works, as a last ditch attempt you can give your routers different network names. But this makes it more difficult to keep a connection when you roam from one router to the other.

Ethernet Over... Coax?

I had not managed to get a fixed connection to the office, which would have required snaking a Ethernet cable over at least two doorways and three walls. However, I heard recently about a new technology called MoCA (multimedia over coax), which makes it possible to send an Ethernet signal over the coax line from the garage to the office. I bought a MoCA adapter for each end of the connection - about $160 in total - and wired it up and... it worked like a charm!

Moca ethernet over coax connector in
office

The latency is slightly higher than traditional Ethernet, but only by a few milliseconds, and the bandwidth is not as high as a normal wired connection but it's fine - I am still glad to be able to avoid a wireless connection in that room.

This change let me move my NAS into the office as well, which I'm quite happy about.

Letting Everything Talk to Each Other

At this point I had a $15 unmanaged switch in the garage that received a connection from the Sonic Fiber router, and sent it to three places - the bedroom, the living room and my office. However, the fact that it was unmanaged meant that each location requested a public IP address and DHCP from Sonic. Sonic was not happy with this arrangement - there is a limit of 8 devices per account that are stored in a table mapping a MAC address to an IP address, and after this you need to call in to have the table cleared out. This design also meant that the clients on my network couldn't talk to each other - I couldn't access the NAS unless I was connected to the living room router.

The solution was to upgrade to a "managed" switch in the garage that could give out DHCP. You can buy one that is essentially a wifi router without the radio for about $60. This has the same dashboard interface as your router does and can give out DHCP.

Once this switch was in place, I needed to update the routers to stop giving out DHCP (or put them in "pass through mode") so only a single device on the network was assigning IP addresses. I watched the routers and NAS connect, then assigned static IP's on the local network to each one. It's important to do this before you set them in pass-through mode so you can still access them and tweak their settings.

You should be able to find instructions on pass-through mode or "disable DHCP" for your router online. You may need to change the IP address for the router to match the static IP you gave out in the previous paragraph.

That's it

I finally have a network that supported everything I want to do with it! I can never move now.

Garage router setup

I hope this post was helpful. I think the most important thing to realize is that if you haven't done this in a few years, or your only experience is with consumer grade routers, there are other tools/products you can buy to make your network better.

If you are interested in this space, or interested in improving your office network along these lines, I'm working with a company that is making this drop dead easy to accomplish. Get in touch!

1. I posted on the sonic.net forums to get help several times. Dane Jasper, the Sonic CEO who's active on the forums, responded to most of my questions with "you should just use Eero." I love that he is on the forums but Eero is just not great for what I'm trying to do.

2. I'm simplifying - there are two roundtrips, not one - but the details are really not that important.

Liked what you read? I am available for hire.

Using an AWS Aurora Postgres Database as a Source for Database Manager Service

Say you have a Aurora RDS PostgreSQL database that you want to use as the source database for Amazon Database Manager Service.

The documentation is unclear on this point so here you go: you can't use an Aurora RDS PostgreSQL database as the source database because Aurora doesn't support the concept of replication slots, which is how Amazon DMS migrates data from one database to another.

Better luck with other migration tools!

Liked what you read? I am available for hire.

Going Solo, Successfully

Three years ago I quit my job and started consulting full time. It's worked out really well. I get to spend more time doing things I like to do and I've been able to deliver great products for clients. I wanted to go over some tips for starting a successful consulting business.

  • Charge more - Everyone says it and it's true. I started out charging a monthly rate that was close to my full time salary / 12. This is not a good idea because you have overhead that your employer is no longer covering - health care probably the biggest one, you don't have paid vacations, there may be unpaid downtime between contracts and also companies might not pay you. You need to be charging a lot more to just break even.

    I dread "what's your rate" conversations every time and they haven't gotten easier. Before I quote my rate I reread the details of the High Tech Employee Antitrust case to pump myself up - it reminds me that I'm negotiating with a company that couldn't care less really and I am the only one who's going to stand up for myself. If you think you don't need the extra money - get it anyway, and then donate more to charities at the end of the year/buy CD's/put it in the stock market/give it to the government. Amazon just made $11 billion and paid $0 in taxes; you are going to spend an additional dollar better than Amazon's executives will.

    If you are not sure how much to charge, quote each new client more than the last. Your quote is often a signal of your quality so it's not even really the case that demand slopes downwards as you charge more.

    If you are working with a client and they are very happy with your work and want to extend your contract consider asking for a higher rate. "Now that you know how good I am," etc.

  • Get the money - Signed contracts, work performed, don't matter until the money hits your bank account. I learned this the hard way. If a company is going under your invoices are worthless. You can hold onto the company IP but that's probably also worthless. You can sue but at the end you will win a judgment that you can collect money from a company that doesn't have any to pay you.

    Try to get as much of the contract as you can paid up front - I generally ask for half or more up front. If a company offers Net 30 ask if you can shorten it to Net 5 or 10 or submit your invoices in advance. Submit invoices on time - it's a very costly mistake and you won't learn its importance until it's too late.

    Try as hard as you can to figure out the financial health of the company - if you can do your homework in the press or ask questions to your primary point of contact, like how much cash they are burning, how many months of runway do you have. If a company is not forthcoming with this information it's a red flag that they may not be able to pay you.

    If you see any red flags - the company wants to cut the contract short, people start leaving, company suddenly cuts back on perks - tell your contact that you need to be paid upfront or you are not going to work anymore. If they push back on this they may not have the cash to pay you at all. It's a crappy situation but better to cut your losses than to work for someone that can't actually pay.

  • Don't charge by the hour - I have never actually done this so I can't speak to how bad it is but don't do this. You don't want a client to cut you loose at 3pm and suddenly you lose three hours you were counting on. Charge per week.

  • Get a lawyer - Get a lawyer to review every contract you sign. Read through them, flag concerning things to the lawyer. They will suggest language. Send the language to the company. You are not being difficult when you do this, the company does this all the time. Lawyers are expensive, expect to pay north of $400 per hour and contract review can take 30-60 minutes. This money is worth it.

    A good clause to try to push for is limitation of liability. You don't want to be in a situation where $2 million of damages occurred or a high value client left the company because of an error you pushed and the company is suddenly coming after everything you own. Similarly the company may want to protect against you trying to sue them for a high amount of damages to your reputation, future business etc. Limiting the total liability to the size of the contract, or a multiple of the size of the contract - on both sides - can be a good move.

  • Register as a Company - Consult with the lawyer you hired on what kind of company you want to be. Generally the more "company-like" you are the harder it is for companies to try to take your personal assets. I don't have employees or shareholders so I have a single member LLC that is disregarded for tax purposes — read this description from the IRS. Sometimes businesses are confused what this means when I tell them or try to sign up for things. Still, it is a good fit for me. It may not be for you - I am not a lawyer, you should talk with one, learn the tradeoffs and see what makes sense for your business.

  • Make Sure Contracts Are Signed With the Company - The contracts you sign should be between the client you are working with and your company NOT between the client and you personally. Discuss this with your lawyer.

  • Get an accountant - As a small business a lot of stuff is tax deductible - a home office, client travel, for example, even if it's just across town - and you want to make sure you are getting ~35% off on everything that you can. An accountant will help you with this.

  • Market yourself - Not necessarily ads or sponsorships, but: everyone you've worked with full time should know they can hire you now. If they don't then reach out to people and let them know. Put up a website that engineers can send to their boss. My website isn't fancy but it is effective. Order business cards - VistaPrint is garbage, order from moo.com. If you have a website or open source projects put a note at the bottom advertising that you're available for hire, like the one at the bottom of this post.

  • Set up separate accounts for everything - Open separate accounts for your business. Get a business credit card or just a separate cash back card on your personal account. I don't have a checking account registered for the business but I opened a second checking account that I call my "business account". Clients pay into that account and I pay the business credit card out of that account. I even have a separate Clipper card that I use for business travel.

    There are two reasons for this. It makes tax accounting a lot easier. I know that every transaction on the business Clipper card is work travel and can be expensed; I don't have to try to remember what I was doing when I paid $2 to SamTrans at 5:34pm on July 27.

    Second, if you don't keep good records for the business - if you "commingle" funds between your personal life and the business - it makes it much easier for clients to go after your personal assets, what's called "piercing the veil." Separate accounts (and discipline about transfers!) make it much easier to argue that your business income and spending and personal income and spending are separate even if you don't necessarily have the legal structures to back them up.

    I also set up a new Github account for every company I work with. This avoids any issues with emails going to the wrong place, or the need to grant/revoke permissions to any 3rd party tools a company uses. I use github.com/kevinburke/swish to swap SSH settings between my Github accounts:

    $ cat $(which notion-github)
    #!/usr/bin/env bash
    ${GOPATH}/bin/swish --identity-file ${HOME}/.ssh/github_notion_ed25519 --user kevinburkenotion
  • Balancing multiple clients: If you can do this or do things like charge retainers, great. I find it really hard to switch contexts so I work with one client at a time and treat it as a full time job. Do what works for you.

  • Give back to the tools that make you successful - I give a percentage of my earnings every year to support software tools that help me do my job - iTerm2, Vim, Go, Postgres, Node.js, Python, nginx, various other open source projects. You should consider doing this too. (If you are an open source maintainer reading this - tell me how I can pay you!!)

Liked what you read? I am available for hire.

AWS’s response to ALB internal validation failures

Last week I wrote about how AWS ALB's do not validate TLS certificates from internal services. Colm MacCárthaigh, the lead engineer for Amazon ELB, writes:

I’m the main author of Amazon s2n, our Open Source implementation of TLS/SSL, and a contributor to the TLS/SSL standards. Hopefully I’m qualified to chime in!

You’re right that ALB does not validate the certificates on targets, but it’s important to understand the context that ALBs run in to see why this is still a pending item on our roadmap, rather than something we’ve shipped already as a “must have”.

The role that server certificates play in TLS is to authenticate the server, so that it can’t be impersonated or MITM. ALBs run exclusively on our Amazon VPC network, a Software Defined Network where we encapsulate and authenticate traffic at the packet level. We believe that this protection is far stronger than certificate authentication. Every single packet is being checked for correctness, by both the sender and the recipient, even in Amazon-designed hardware if you’re using an Enhanced Networking interface. We think it’s better than the ecosystem where any CA can issue a certificate at any time, with still limited audit controls (though certificate transparency is promising!).

The short of it is that traffic simply can’t be man-in-the-middled or spoofed on the VPC network, it’s one of our core security guarantees. Instances, containers, lambda functions, and Elastic Network Interfaces can only be given IPs via the secure and audit-able EC2 APIs. In our security threat model, all of this API and packet level security is what plugs in the role performed by server certificates.

This contrasts with the older EC2 classic network, a big shared network, which is why classic load balancers do support backend authentication.

We actually find that many customers actually load their targets and backends with “invalid” certificates that are self-signed or expired, because it’s so operationally hard to stay up-to-date and it’s hard to automate, even with projects like LetsEncrypt, when your instances are inherently unreachable on the internet.

All that said, we’ll be adding support for certificate validation, probably including pinning and private CAs! Used well with good operational controls it can be a measure of defense in depth, and it’s important for cases such as targets hosted on less secure private networks such as on-premesis data-centers.

Liked what you read? I am available for hire.

Amazon’s ALB’s do not validate TLS certificates from internal services

If you are using an Amazon Application Load Balancer, and forwarding traffic to internal services using HTTPS, the ALB will not validate the certificate presented by the internal service before forwarding the traffic.

So we're clear here, let's say you are running a web server on Amazon ECS. The webserver is configured to present TLS certificates to incoming requests, receive encrypted TLS traffic. The web server is part of a ELB v2 Target Group. There are two hops in this flow:

  • Customer iPhone/laptop/whatever connects to Amazon ALB. You can upload a certificate to Amazon to present to the customer. The customer's iPhone/browser/whatever will (hopefully) verify that certificate before sending requests to the ALB.

  • The ALB forwards the request to your webserver. ALB will look up the right Listener for the request, and then forward it to a ELB v2 Target Group. You can configure the Target Group to receive requests over HTTP or HTTPS.

  • If you choose HTTPS, the ALB will establish a connection and request a certificate from a random host in the Target Group. It will not validate that certificate; it will just send the traffic.

Here is the configuration for a Target Group. There's no check box for "validate HTTPS traffic from internal service."

The entire point of HTTPS is to encrypt traffic. Otherwise, a random person snooping on the network could present a weak certificate and send back whatever data it wants. We learned in 2014 that the NSA was doing this, between and inside data centers and at key points in the US.

It's unacceptable for a major Internet service in 2018 to blindly accept certificates presented by an internal service without validating them.

For the moment, I suggest using the Network Load Balancer type, which forwards the raw TCP traffic to your machine. You don't get any of the nice features of an ALB, but at least you will have the ability to reject raw traffic. If you know of other providers that offer load balancers with TLS certificate validation, please send me an email.

Update: Please read the reply from AWS.

Liked what you read? I am available for hire.

Profile Anything in Any Language in Under a Minute

Want to know how to get rough profiling of any tool written in any language, whether you control the source or not? Keep reading.

For example, here's the output you get when you compile Go from source code. Each of these lines prints out a few seconds apart. How would you go about getting timings for each individual step, or how would you notice if one step was suddenly executing much more slowly than it did three days ago?

$ GOROOT_BOOTSTRAP=~/go1.10 ./make.bash
Building Go cmd/dist using /Users/kevin/go1.10.
Building Go toolchain1 using /Users/kevin/go1.10.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for darwin/amd64.
---
Installed Go for darwin/amd64 in /Users/kevin/go
Installed commands in /Users/kevin/go/bin

There are a few different methods you can use to get a sense for this, but I wanted to share my favorite one. Don't reach for a profiler, don't try to set a global variable, or do start/stop timing in code. Don't even start figuring out how to configure a logger to print timestamps, and use a log output format.

Then when you've done all this, rerun the command and pass it through a command that attaches timestamps to every output line. For example, I have a tool called tss (based on ts in the moreutils package) that does this. Pipe your command to tss and you'll get output like this:

$ GOROOT_BOOTSTRAP=~/go1.10 ./make.bash | tss
     11ms         Building Go cmd/dist using /Users/kevin/go1.10.
    690ms   679ms Building Go toolchain1 using /Users/kevin/go1.10.
  11.435s 10.745s Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
  18.483s  7.048s Building Go toolchain2 using go_bootstrap and Go toolchain1.
  40.788s 22.305s Building Go toolchain3 using go_bootstrap and Go toolchain2.
 1m0.112s 19.324s Building packages and commands for darwin/amd64.
1m19.053s 18.942s ---
1m19.053s      0s Installed Go for darwin/amd64 in /Users/kevin/go
1m19.053s      0s Installed commands in /Users/kevin/go/bin

The first column is the amount of time that has elapsed since the tool was invoked. The second column is the amount of time that has elapsed since the previous line was printed.

The good thing about this technique is you can use it in pretty much any situation - you don't have to know anything about the code besides:

  • how to invoke it
  • a rough idea of which code paths get hit
  • how to print things to stdout

Just start adding print lines around interesting bits of code.

For example, you might have this in your test suite in Ruby to empty the database. How much time does this add to every test in your suite? Put print lines before and after to find out.

config.after(:each) do
  puts "database clean start"
  DatabaseCleaner.clean
  puts "database clean end"
end

Similarly, if you wanted to find out why it takes so long for your test suite to start running or your web framework to boot, you can annotate your spec_helper or bootstrap file, the top of the test file, before and after require lines, and the start and end of the first test in the suite. You can even drill down into third party libraries, just find the source file on disk and start adding print statements.

You can also use it with strace! strace will show you a list of the syscalls being opened by your program. Passing the output through tss can show you which syscall activity is taking a lot of time, or when your program is doing a lot of stuff without going to the disk.

Caveats

If you pipe program a to program b, and program a exits with a non-zero return code, Bash will by default report a return code of zero for the entire operation. This means if you are piping output to tss or ts you may accidentally change a failing program to a passing one. Use set -o pipefail in Bash scripts to ensure that Bash will return a non-zero return code if any part of a pipe operation fails. Or add this to a Makefile:

SHELL = /bin/bash -o pipefail

Second, the program being profiled may not flush output to stdout, that is, it may print something but it may not appear on the screen at the same time it was printed. If the timings on screen don't seem to match up with your intuition, check that the program is flushing print statements to the stdout file descriptor after printing them.

Liked what you read? I am available for hire.

You Shouldn’t Use Faker (or other test randomization libraries)

It's a common, and distressing, pattern to have factories in tests that call out to a library like Faker to "randomize" the returned object. So you might see something like this:

User.create({
    id: uuid(),
    email: faker.random.email(),
    first_name: faker.random.firstName(),
    last_name: faker.random.lastName(),
    address: faker.random.address(),
});

package = Package.create({
    width: faker.math.range(100),
    height: faker.math.range(200),
    length: faker.math.range(300),
})

This is a bad practice and it's distressing to see it in such widespread use. I want to explain some reasons why it's bad.

The corpus is not large enough to avoid uniqueness failures

Let's say you write a test that creates two different users, and for whatever reason, the test fails if the users have the same name (due to a database constraint failure, write to same map key, array having 1 value instead of 2, &c). Faker has about three thousand different possible values for a first name. This means that every 3000th run on average, the randomly chosen names will collide, and your test will fail. In a CI environment or on a medium size team, it's easy to run the tests 3000 times in a week. Alternatively, if the same error exists in ten tests, it will emerge on average once every 300 runs.

The onus on catching failures falls on your team

Because the error only appears once in every 3000 test runs, the odds are you haven't fully or partially probed the space of possible test values when you submit your code and tests for review. This means that the other members of your team will be the ones who run into the error caused by your tests. They might lack the context necessary to effectively debug the problem. This is a very frustrating experience for your teammates.

It's hard to reproduce failures

If you don't have enough logging to see the problem on the first failure, it's going to be difficult to track down a failure that appears only once every 3,000 test runs. Run the test ten or even fifty times in a loop and it still won't appear.

An environment where tests randomly fail for unknown reasons is corrosive to build stability. It encourages people to just hit the "Rebuild" button and hope for a pass on the next try. Eventually this attitude will mask real problems with the build. If your test suite takes a long time to run (upwards of three minutes) it can be demoralizing to push the deploy or rebase button and watch your tests fail for an unrelated reason.

It's a large library (and growing)

It's common for me to write Node.js tests and have an entire test file with 50 or 100 tests exit in about 10 milliseconds.

By definition a "fixture" library needs to load a lot of fake data. In 2015 it took about 60 milliseconds to import faker. It now takes about 300 milliseconds:

var a = Date.now()
require('faker');
console.log(Date.now() - a);
$ node faker-testing/index.js
303

You can hack around this by only importing the locale you need, but I've never seen anyone do this in practice.

What you need instead

So faker is a bad idea. What should you use? It depends on your use case.

  • Unique random identifiers. If you need a random unique identifier, just use a uuid generator directly. If you create 103 trillion objects in a given test, the odds of a uuid collision are still only one in a billion, so the odds of a collision are much lower than using the faker library. Alternatively, you can use an incrementing integer each time you call your factory, so you get "Person 0", "Person 1", "Person 2", etc.

  • A fuzzer, or QuickCheck. Faker has tools for randomizing values within some input space, say a package width that varies between 0 and 100. The problem is that the default set up is to only exhaust one of those values each time you run the test, so you're barely covering the input space.

    What you really want is a fuzzer, that will generate a ton of random values within a set of parameters, and test a significant percentage of them each time the test runs. This gives a better guarantee that there are no errors. Alternatively, if there are known edge cases, you can add explicit tests for those edge cases (negative, 0, one, seven, max, or whatever). Tools like Haskell's QuickCheck or afl-fuzz make it easy to add this type of testing to your test suite.

If you want random human-looking data I suggest adding it yourself and using some scheme that makes sense for your company. For example, users named Alice and Bob are always friends that complete actions successfully, and Eve always makes errors. Riders with Android phones always want to get picked up in Golden Gate Park and taken to the Mission; riders with iPhones want to get picked up in the Tenderloin and taken to Russian Hill (or something). This scheme will provide more semantic value, and easily pinpoint errors (a "Bob" error means we should have submitted something successfully but failed to, for example).

That's all; happy testing, and may your tests never be flaky!

Liked what you read? I am available for hire.

Proxying to a subcommand with Go

There are a number of Unix commands that manipulate something about the environment or arguments in some way before starting a subcommand. For example:

  • xargs reads arguments from standard input and appends them to the end of the command.

  • chpst changes the process state before calling the resulting command.

  • envdir manipulates the process environment, loading variables from a directory, before starting a subcommand

Go is a popular language for writing tools that shell out to a subprocess, for example aws-vault or chamber, both of which have exec options for initializing environment variables and then calling your subcommand.

Unfortunately there are a number of ways that you can make mistakes when proxying to a subcommand, especially one where the end user is specifying the subcommand to run, and you can't determine anything about it ahead of time. I would like to list some of them here.

First, create the subcommand.

cmd := exec.Command("mybinary", "arg1", "-flag")
// All of our setup work will go here
if err := cmd.Run(); err != nil {
    log.Fatal(err)
}

Standard Output/Input/Error

You probably want to proxy through the subcommand's standard output and standard error to the parent terminal. Just reassign stdout/stderr to point at the parent command. Obviously, you can redirect these later.

cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
// If you need
cmd.Stdin = os.Stdin

Environment Variables

By default the subcommand gets all of the environment variables that belong to the Go process. But if you reassign cmd.Env it will only get the variables you provide, so you have to be careful with your implementation.

cmd.Env = []string{"TZ=UTC", "MYVAR=myval"}

Signals

If someone sends a signal to the parent process, you probably want to forward it through to the child. This gets tricky because you can receive multiple signals, that all should be sent to the client and handled appropriately.

You also can't send signals until you start the process, so you need to use a goroutine to handle them. I've copied this code from aws-vault, with some modifications.

if err := cmd.Start(); err != nil {
    log.Fatal(err) // Command not found on PATH, not executable, &c.
}
// wait for the command to finish
waitCh := make(chan error, 1)
go func() {
    waitCh <- cmd.Wait()
    close(waitCh)
}()
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan)

// You need a for loop to handle multiple signals
for {
    select {
    case sig := <-sigChan:
        if err := cmd.Process.Signal(sig); err != nil {
            // Not clear how we can hit this, but probably not
            // worth terminating the child.
            log.Print("error sending signal", sig, err)
        }
    case err := <-waitCh:
        // Subprocess exited. Get the return code, if we can
        var waitStatus syscall.WaitStatus
        if exitError, ok := err.(*exec.ExitError); ok {
            waitStatus = exitError.Sys().(syscall.WaitStatus)
            os.Exit(waitStatus.ExitStatus())
        }
        if err != nil {
            log.Fatal(err)
        }
        return
    }
}

Just Use Exec

The above code block is quite complicated. If all you want to do is start a subcommand and then exit the process, you can avoid all of the above logic by calling exec. Unix has a notion of parent and children processes - when you start a process with cmd.Run, it becomes a child process. exec is a special system call that sees the child process become the parent. Consider the following program:

cmd := exec.Command("sleep", "3600")
if err := cmd.Run(); err != nil {
    log.Fatal(err)
}
fmt.Println("terminated")

If I run it, the process tree will look something like this:

 | |   \-+= 06167 kevin -zsh
 | |     \-+= 09471 kevin go-sleep-command-proxy
 | |       \--- 09472 kevin sleep 3600

If I run exec, the sleep command will replace the Go command with the sleep command. In Go, this looks something like this:

 | |   \-+= 06167 kevin -zsh
 | |     \--= 09862 kevin sleep 3600

This is really useful for us, because:

  • We don't need to proxy through signals or stdout/stderr anymore! Those are automatically handled for us.

  • The return code of the exec'd process becomes the return code of the program.

  • We don't have to use memory running the Go program.

Another reason to use exec: even if you get the proxying code perfectly right, proxying commands (as we did above) can have errors. The Go process can't handle SIGKILL, so there's a chance that someone will send SIGKILL to the Go process and then the subprocess will be "orphaned." The subprocess will be reattached to PID 1, which makes it tougher to collect.

I should note: Syscalls can vary from operating system to operating system, so you should be careful that the arguments you pass to syscall.Exec make sense for the operating system you're using. Rob Pike and Ian Lance Taylor also think it's a bad idea.

Liked what you read? I am available for hire.

Running Bazel tests on Travis CI

Bazel is a build system that was recently open sourced by Google. Bazel operates on configuration files - a WORKSPACE file for your entire project, and then per-directory BUILD.bazel files. These declare all of the dependencies for your project, using a language called Skylark. There are a number of Skylark rules for things like downloading and caching HTTP archives, and language-specific rules.

This can be a pain to set up for a new project, but the advantage is you get builds that are perfectly hermetic. In addition, if your project has a lot of artifacts, Bazel has useful tools for caching those artifacts, and builds a dependency graph that makes it easy to see when things change. For example, if you compile Javascript from Typescript, you can cache the compiled Javascript. If the Typescript source file changes, Bazel knows that only the file that changed needs to be regenerated.

Anyway there are some gotchas about running Bazel on Travis CI that I wanted to cover, and when I was looking at instructions they weren't great.

1) Install Bazel

The official instructions ask you to add the Bazel apt repository, and then install from there. You can shave a few seconds by only updating that repository, instead of all of them.

echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
curl --silent https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
sudo apt-get update -o Dir::Etc::sourcelist="sources.list.d/bazel.list" \
    -o Dir::Etc::sourceparts="-" \
    -o APT::Get::List-Cleanup="0"
sudo apt-get install openjdk-8-jdk bazel

You can shave a few more seconds by just getting and installing the deb directly, though this requires you to manually update the version number when Bazel releases a new version. (This is in a Makefile, you'll need to change the syntax for Bash or a travis.yml file).

BAZEL_VERSION := "0.7.0"
BAZEL_DEB := "bazel_$(BAZEL_VERSION)_amd64.deb"

curl --silent --output /tmp/$(BAZEL_DEB) "https://storage.googleapis.com/bazel-apt/pool/jdk1.8/b/bazel/$(BAZEL_DEB)"
sudo dpkg --force-all -i /tmp/$(BAZEL_DEB)

2) Run Tests

Bazel ships with a ton of flags but these are probably the ones you want to know about/use:

  • --batch: Run in batch mode, as opposed to client-server mode.

  • --noshow_progress / --noshow_loading_progress: Travis pretends to be a TTY so it can show colors, but this causes Bazel to dump a whole bunch of progress messages to the screen. It would be nice if you could say "I can handle ANSI escape sequences relating to colors, but not screen redraws" but you only get the binary "I'm a TTY" / "I'm not", which is unfortunate. Anyway these options turn off the "show progress" log spam.

  • --test_output=errors: By default Bazel logs test failures to a log file. This option will print them to the screen instead.

  • --features: Enable features specific to a test runner. In particular for Go you may want the --features=race rule to run tests with the race detector enabled.

3) Cache Results

Bazel is much faster when it can load the intermediate results from a cache; on one library I maintain, tests run in 13 seconds (versus 47 seconds) when they are run with a full cache.

However, the Bazel caches are enormous. Caches for Go tests include the Go binary and source tree, which runs about 95MB alone. Add on other artifacts (including the Java JDK's used to run Bazel) and it's common to see over 200MB. If you are uploading or downloading caches from somewhere like S3, this can be a really slow operation. Also, it's difficult to remove large items from Bazel's cache without removing everything; the file structure makes the cache a little tough to remove parts of.

In particular, Travis's cache does go to S3, which means it skips the local caching proxies Travis sets up for e.g. apt and npm. Travis times out the cache download after 3 minutes, and Bazel quits if the cache is corrupt. In my experience Travis will not complete a 200MB cache download in 3 minutes, so your builds will fail.

That's it

For now it's probably best to eat the cold cache startup time. That's really unfortunate though; I hope we can find a better solution in the future. One way may be for Travis to run a local "remote cache", which responds to the WebDAV protocol. This would allow closer caching of Bazel objects.

It would also be nice if Travis had a "language: bazel" mode, which would come with the right version of Bazel installed for you.

Liked what you read? I am available for hire.