Technology Culture: The Famous (Infamous?) Netflix Culture Deck

Sheryl Sandberg called the Netflix Culture Deck possibly “the most important document ever to come out of the Valley.” For my part, I dig it, and I think it’s a powerful draw to potential employees. So when Adrian Cockroft was tweeting about #netflixculture with gems from the deck, I sent him a tongue-in-cheek DM saying “Hey! If you keep doing that you’ll make it impossible for the rest of us to hire good people!” His sensible reply was, “If it’s good, why aren’t more people copying?” This is an excellent question, and I’ve been trying to formulate a good concise answer. That’s been really hard, so I’m going to start with what like about it.

There’s a lot to digest in the deck, so here’s my summary:

  • Field the best team that you can
  • Give the people on that team the freedom (and responsibility) to get shit done

The second bullet really resonates with me. My philosophy on management can be summed up in one word: “empowerment.” The Netflix formula is essentially the same as mine: give people the context they need to understand the problems that need to be solved, give them the resources to solve those problems, and then get out of their way. This is not a new or revolutionary idea. I developed a lot of my ideas from reading the literature on self-directed or high-performing work teams that was popular in the 80s. Netflix is applying a lot of these same essential ideas.

So let’s get back to Adrian’s question. Why don’t more people do this?

Freedom and Responsibility

Here’s my summary of these ideas from their deck:

  • Focus on results
  • Fix errors quickly, rather than preventing them
  • Avoid rules and policies

Focus on Results

I think a lot of people actually do do this. This is the basis of the self-directed work teams I mentioned earlier. In the olden days this was referred to as “management by objectives.” More recently people are referring to this as a Results Only Work Environment (ROWE). I try to set my teams up this way, by using metrics like “deliver projects on time” and “have less than X minutes of downtime per quarter” rather than “generate Y lines of code per month” or “process Z tickets per week.” Of course, not everyone does this. Why not? The first and most obvious reason is that it’s harder to manage. Measuring the quantity of work done (or time spent) is easy. How many hours does someone work? How many tickets do they handle? How many times do they break the rules? It’s very easy to generate reports showing these numbers. Judging by real business-relevant results is much harder, both in definition and measurement. Another reason organizations don’t do this is that it requires competent people who already know how to get results. In this way, it aligns nicely with Netflix’ culture of hiring and keeping only the best performers, but not every organization can hire and keep the best performers (more on this later). Focusing on results also requires effective communication from the top down: about the vision, goals, and strategy – as well as transparent communication about the actual results obtained. This communication introduces overhead and has a cost in efficiency – there’s a reason the Marines don’t operate as a democracy. And finally, Focus on Results requires trust in both directions (up and down) as well.

Back to the question at hand: why aren’t more people copying the ‘Focus on Results’ aspect of Netflix culture? In this specific area, I think the answer is that many people are. I think Netflix has advantages here though, where other parts of their culture (compensation, performance management, investment in communication) reinforce and enable a Focus on Results strategy that’s more holistic and more effective than at many other companies.

Fix Errors Quickly Rather than Preventing Them

Everyone in software development knows that it’s far better to find and fix bugs earlier in the cycle than later. Finding a bug after release to production has very high costs, both in impact to the business and in resources to fix (the person fixing the bug will have to context switch back to the code they wrote in most cases a long time in the past, and then the fix will have to go through the full lifecycle of integration/test/deployment). Finding the same bug during the QA phase reduces these costs, and finding it during development largely avoids them. Ergo, the more bugs you can prevent from getting out of the development phase, the better.

Like other things that everyone knows, you periodically need to re-evaluate this assumption and make sure it’s still true and applicable. The Netflix philosophy says that this is not true, at least not universally and definitely not in their specific case. I agree. I believe that techniques like continuous deployment and automated testing have reduced both of the costs mentioned above (the context switching, and the cost of the lifecycle) to the point where this assumption doesn’t apply in many cases.

So why aren’t more people doing this? I think this realization (that the landscape around the cost of fixing bugs is changing) is starting to percolate through the industry, but like anything else that everybody knows, it will take a long time for this to happen – and probably a generational turnover that we’re only now beginning. And it will never be universally applicable because bugs in different industries have very different impacts – the cost of a bug that interrupts people’s ability to watch “She-Ra” on their Xbox is not at all comparable to the cost of a bug that causes a laser scalpel to cut the wrong artery during brain surgery.

Avoid Rules and Policies

This one seems to get the most attention. For example, there is no vacation policy at Netflix. The expense policy is 5 words, vs the 25 pages I’ve seen at other companies. The idea here seems to be twofold. The first is that rules and policies can be confining and in some cases get in the way of getting things done, so when you get rid of them you give people more freedom. The second is that it specifically frees up managers’ time. At other companies, they spend part of their days dealing with PTO requests, approving expense reports, and enforcing compliance with policies. Getting rid of those policies frees up managers to spend their time on leadership, which is a far more leveraged way to spend their time. While this has caught the most eyes and generated the most discussion, I think in terms of real impact this is the least important. Symbolically, of course, it is very important.

Why aren’t more people doing this? People are starting to adopt the no-vacation policy, which I think is the easiest to adopt and probably a good marker for companies that will be implementing more of Netflix culture over time. I imagine there’s more reluctance on the expense policy – there’s a lot of emphasis these days on financial reporting and there’s been a lot of high-profile expense abuse in the news. I also think there’s a generational thing here though, and it’s related to the “fix errors rather than preventing them” idea above. My favorite quote from _Rework_ is that policies are “organizational scar tissue,” kind of like the old adage that behind every FAA regulation is a plane crash. When something goes wrong and people say “we can’t let this happen again!” the answer is generally some new rule or policy. That’s how it’s been done for a long long time, and getting away from that will be very difficult for a lot of folks. And just like the difference between showing videos and brain surgery, there are industries where a no-vacation policy doesn’t make sense – for example in a bank or financial institution where corruption/graft are significant risks, forced vacations are used as a risk mitigation, because such schemes often can’t be sustained when the primary agent is away from their position for a week or two. Finally, for this to really work you have to have what the deck calls “responsible people.” Not every company can hire exclusively responsible people – in this case, as above, other elements of Netflix culture enable them to have the right workforce to support this element of their culture.

Field the Best Team You Can

Complementing the freedom and responsibility side of their culture, the other key element is having a workplace filled with nothing but “stunning colleagues.” They make a very useful analogy to a professional sports team, where the objective is to always have a superstar at every position. This leads to specific strategies and techniques:

  • Pay top of market
  • Don’t settle for adequate performance

Pay Top of Market

Netflix’ philosophy is to pay more than anyone else would. The yardsticks used are how much that person would be paid somewhere else, how much Netflix would pay someone else to come in and do the same job, and how much Netflix would be willing to pay to prevent someone from leaving. It’s the manager’s job at Netflix to make sure that each employee is compensated at the level that meets the above criteria, proactively. This simply short circuits a lot of problems that managers need to deal with at other companies around recruiting, retention, and incentivization. This is largely based on a belief that the best employees (in creative/inventive positions) contribute significantly more value (like 2x-10x) than the average employee – so that paying them more is in reality a bargain.

Why don’t more people do this? Well, obviously not everyone can pay top of market, because then it wouldn’t be top of market anymore. But plenty of companies have explicit polices to pay at mid-market, or at the 75th percentile, etc. I think a big part of the answer here is that there’d be a lot of disagreement that the best employees can be 10x more valuable than the average employee. This is actually something I would love to see more data on – the culture deck simply states this as a fact without supporting citations. In my experience, I think this is a reasonable statement – I’m really curious what other people have to think (please tell me in the comments!)

Don’t Settle For Adequate Performance

I believe this generates the most controversy from the culture deck. Basically at Netflix, if you do an OK job, you are given a nice severance package and shown the door so that a superstar can be brought in to do a better job. At most companies, you are expected to do an OK job, and you’re only shown the door if you’re not doing an OK job. Think of the sports team analogy – if your first baseman is batting .270 with 50 RBIs, he’s doing OK, but if you have the opportunity to bring in a first baseman who bats .320 with 95 RBIs – wouldn’t you do it?

Why don’t more people do this? You can only do this if you are convinced you can bring in a superstar to replace the person leaving, and that assumption requires some of the other Netflix culture charactersitics: namely paying at top of market and having a workplace that’s already filled with “stunning colleagues.” So this is something you can really only do if you’ve already adopted most of the rest of #netflixculture.


What really jumped out at me from writing this post is that a lot of the components of #netflixculture leverage and reinforce each other. For example, paying top of market lets you fill your company with responsible, awesome people, which lets you grow without adding tons of rules and policies. The real question is how much of what Netflix has done is unique to their position in their industry, and how much is generalizable to other companies? I’m looking forward to seeing other companies adopt this culture and seeing how it turns out for them. If your company is doing so, let me know!

Netflix teaches everyone how to host a tech meetup

Netflix has once again set the bar. Not with their technology this time – but with their organizing. I just got back from the first meetup of the NetflixOSS group – and it was spectacular. Let me walk you through it.

The intro was given by Ruslan Meshenberg, and contained a wonderful story about the email exchange that started the ball rolling on Netflix’s open-source efforts: one of their developers had something he wanted to open-source, so he asked about the policy for doing so. He was told “Our policy is we have no policies. Go for it!” To many of the developers in the audience this was a religious experience, because they had experienced very different responses at their companies to the same question. (When Ruslan earlier asked the audience how long it typically took to open-source something at other companies, the loudest response from the audience was “FOREVER!”) At this point the audience was warmed up and already on Netflix’s side, because they’d shown a) that they’re supporting open-source b) that they don’t put bullshit in your developers’ way and c) that they have smart, cool, funny people working for them. Win x3.

Next up was Cloud Architect Extraordinaire Adrian Cockroft, who gave an overview of all the (really cool) platform pieces that Netflix has open-sourced, along with an explanation of why they are doing so in the first place. Adrian’s presentations at conferences are always packed and for good reason – he’s entertaining, charming, and (like Ruslan) knows how to hit developers in their hearts. His talk had enough anecdotes about quirky naming methods, reinforcement of the “stay out of developers’ way” culture at Netflix, and enough hard-won insights from experiences in production (and amusing digs at other platform solutions) to keep the audience’s interest piqued, and at the end everyone was hungry for more details about the components he had introduced. Adrian’s reinforcement of the “the whole is greater than the sum of the parts” message also reminded everyone that the software they’re talking about all works together to accomplish far more than any individual component could ever hope to – which presented a fantastic analogy for the audience to infer about the team and process at Netflix as well.

Then came the brilliant part. There was a series of several lightning talks where one of the developers of each of those platform services talked briefly about what it did and why they needed it. This really stoked the audience’s interest and I know everyone (like I did) had one or two things they heard about that they were super eager to get more information on. This also made even more evident that there a lot of smart, experienced, and accomplished people at Netflix who are working on interesting problems and who have been given the freedom (and responsibility) to come up with the right solutions. Again, a tempting picture to paint for the developers in the audience.

The final stage was the demo room, where food and booze were provided and then those same engineers who had given the lightning talks were standing in front of stations where they would answer your questions and demo the software they had described earlier. So you could find the engineer who talked about what you had found so interesting during the lightning talks, and you could drill in with them. I spent a few minutes talking with Ben Christensen about Hystrix, a library that provides mechanisms for isolating failures in distributed systems so problems in one area don’t cascade throughout the system, as well as providing a dashboard view into the metrics generated by said library. This was fascinating to me for many reasons as I’ve long been interested in how we monitor and manage the connections between systems (sooooo much harder than dealing with the systems themselves) and I thought Hystrix did a very good job of presenting the information in a dashboard that was information-dense and yet at the same time very helpful at highlighting the things that were important. The point being that I found someone I could have a great conversation with about something I found very interesting – and I’m willing to bet that so did the majority of other attendees.

Overall, I had a great time and think it was a fantastic experience. It’s definitely one of the best tech meetups I’ve ever been to (and I’ve been to a lot). Kudos to the Netflix team – I think everyone had a good time, and I suspect a majority of the audience went home tonight thinking to themselves “Wow that’d be a cool place to work!” 

Which, I suspect, was the goal all along…..

What Public Cloud SLAs ARE good for

When HP announced their cloud was going GA, lots of the coverage said that the cloud came with a 99.95% SLA. While that is technically accurate, it is also very misleading. People may assume (and even some coverage indicates this) that this means if you launch a virtual server inside HP’s cloud, they are guaranteeing that instance will have 99.95% uptime. This is emphatically not true. What they are really guaranteeing is that that instance will be up or you will be able to launch a replacement instance. That’s a really important or. Lydia Leong over at Gartner has a much more detailed analysis of this that you probably want to read. For my part, I’d like to illustrate some theoretical situations that would not be considered violations of your SLA to try to make this a little more real and concrete for those of you with infrastructure running in the public cloud. (AWS’ SLA is similar to HP’s for all these cases).

The Examples

Here we go. Each of these situations is not considered to be a breach of HP’s SLA:

  • The only VM you are running in the cloud blows up and is unrecoverable. This does not count against the SLA.
  • The only VM you are running in the cloud loses all network connectivity. This does not count against the SLA.
  • Every VM you have in every availability zone in the region blows up at exactly the same time. They are all unrecoverable. This does not count against the SLA (because you can launch replacement VMs for all of them).

Amazon is no better. The examples above all would be the same for AWS – not a breach of their SLA. Here’s one that would be covered by HP’s SLA, but not AWS’:

  • Let’s say you have 200 VMs in each of 3 different AZes in a region. In each of those AZes, 199 of those instances die and you cannot launch any replacement instances. So you are left with 3 of your 600 instances running (99.5% of your capacity is lost). This does not count against the SLA.

What to do?

I’m really not ragging on public cloud here. I don’t think SLAs have much value as contracts. What they are good for is helping you do risk management. A provider is going to sum up lots of information about how they’ve designed and architected their services in their SLA. Think about new cars: if one comes with a 12 month limited warranty and another comes with a 10 year bumper to bumper warranty, doesn’t that make you think they are designed and built differently? So both HP’s and AWS’ SLAs tell you a lot about how their systems are architected. Mostly they tell you that individual instances are ephemeral and you better account for that when running your applications on top of them.

This is also why any SLA that says “100% uptime” is completely useless, because that number is obviously divorced from any match to the reality of the design and architecture of the system underneath the SLA. No system can achieve 100% uptime, and by pretending theirs does, service providers actually make it more difficult for their customers to do intelligent risk management.

My advice: make sure you understand your providers’ SLAs. Not because you’re going to use those SLAs to wrangle compensation out of them for outages (maybe I’ll post later about how such compensation is also a joke). You want to understand your provider’s SLAs so that you can do intelligent risk management for yourself. Knowledge is power. Get it and use it!

Incident Management and Problem Management – Web Ops style

I’ve noticed that a lot of people find this blog by searching for “incident management versus problem management.” This puzzled me, because I figured there must be way better places on the internets to find that answer. So I looked at the google results for that search and read the top five or so. And now I get it. Because those all suck. They have too much ITIL. And they’re too focused on service desk setups versus web ops setups. So let me try to put this in the simplest terms I can:

  • Incident Management is about fixing something that’s broken
  • Problem Management is about fixing something that’s broken

Confusing, right? Well, the two really are different. Let’s start with some analogies.

The Analogies

Let’s say your house is on fire. “Incident Management” is the firefighters putting out the fire. “Problem Management” is the investigation that happens afterwards to figure out why the fire started and what to do about it.

Let’s say you’re on a plane, and all the engines fail. “Incident Management” is the flight crew gliding the plane to an emergency landing. “Problem Management” is the investigation that happens afterwards to determine why those engines all failed and what should be changed in the future so that doesn’t happen again.

In a Web Ops world, “Incident Management” is what you do when all the servers in one of your datacenters have gone tango uniform – perhaps you take all the incoming traffic and redirect it over to a different datacenter. “Problem Management” would be figuring out what happened to all those servers – why did they fail together and what do you have to do to make sure that doesn’t happen again.

The Definitions

“Incident Management,” then, is when you are immediately addressing an urgent situation. Timeframes are generally short and pressure is high, and you’ll do whatever it takes to make things better – even things that in normal situations would be too costly, too risky, or against policy. For example, if your engines are out you might land your plane on a freeway – which is something you’d never do under normal circumstances. And importantly, you don’t do investigations into what caused things while you’re doing incident management. You really don’t care at that point why all your engines failed – you just want to get the plane down safely.

“Problem Management,” in contrast, is a more measured activity where you figure out what was really behind the incident and then what to do about it. The situation isn’t as urgent and the techniques are different. Root cause analysis is a key tool for Problem Management. And I don’t mean the mythical quest for a single root cause, I mean real root cause analysis. Which is a topic for another post.

The two are connected, of course. The inputs to Problem Management are The Facts: what happened when, who did what, what results did those actions have, etc. Those Facts came out of the Incident Management process, and ideally are recorded automatically without the people doing Incident Management having to do anything special – think of the “black box” that commercial aircraft have. In a Web Ops world, automatically recording crisis phone bridges and chatrooms can serve a similar purpose.

Who Cares?

It’s important to recognize that these are two different activities, and treat them as such. Most everyone does Incident Management, out of necessity (I mean, it’s not like you’re going to decide NOT to emergency land your plane when all your engines fail). What’s sometimes lacking, however, is the followup through Problem Management. It’s tempting after an incident to say “We know what happened, and what to do about it.” It’s tempting because it sounds right. And because it’s a lot easier than doing real root cause analysis, and a helluva lot easier than actually prioritizing and executing all the actions that are really required to address the underlying problems that led to the incident. But in the long run you’re way better off investing that time and effort.

Here’s how to tell you need to step up your Problem Management efforts. Do you ever say or hear things like this?

  • “Oh yeah, I remember when this happened last time.”
  • “Call Bobby Ray, he fixed this before.”
  • “Why does this keep happening?”
  • “We’ve gotten really good at dealing with this problem.”

If so – try some Problem Management.

What to do?

  • Make sure you’re recording useful information from your incidents – get The Facts
  • Have an official post-mortem investigation process for your incidents (and use it).
  • Periodically review the actions you’ve identified to fix things and see how you’re doing at getting them accomplished.

You don’t need a separate Problem Management team. (In fact, I’d recommend against it). You don’t need some expensive Problem Management software. All you need is an incident, some data, and a desire to make things better. I know you’ve got that!

Is Ubuntu a good server OS? – firewall edition

Earlier, I posted about upstart and how it illustrated how Ubuntu is not a good operating system for running servers. Today I wanted to provide another example: UFW, the “Uncomplicated FireWall” that is installed by default on Ubuntu servers.

Linux firewalling and UFW

Firewalls in Linux are implemented with iptables. In a nutshell, iptables evaluates packets against lists of rules to determine whether they should be accepted or rejected. iptables is typically used on servers through scripts that set up all these rules the way you want them – these days this process is generally managed through whatever configuration management system you’re using (e.g. chef or puppet).  UFW is a tool distributed with Ubuntu that provides a different way of setting up these iptables rules, using command line tools.

How UFW works

UFW keeps its own set of iptables rules in /etc/ufw. When ufw is turned on, it flushes out all the existing iptables rules and replaces them with the rules from /etc/ufw. You can then use the command line tools to add ports to allow or deny, which will update both the running iptables rules as well as the copies stored in /etc/ufw.

Note that all UFW works with is what’s in /etc/ufw – it doesn’t know or care what the running iptables rules are.

A Totally Theoretical Example

Let’s pretend you have a big distributed system where the nodes are all running Ubuntu. It’s a big system, and it’s part of an even larger IT environment, so there’s a large IT support organization that’s staffed with people who were hired because they have fair bit of experience with Linux systems – but some of those folks have more at-home, desktop Linux experience vs datacenter, server Linux experience. Also these folks don’t know the ins and outs of the design of this particular distributed system because they are responsible for all of the varied IT environments at this large organization. The hero in our story is one of these selfless and courageous sysadmins. A report comes in from the users of this distributed system that they’re having trouble reaching some of the resources it provides. Initial troubleshooting leads our hero to posit that the problem is a misconfigured firewall. His experience is with Ubuntu and he’s always used ufw, so his first step is to disable the firewall and see if that helps:

ufw disable

There’s no change, so he figures that this is unrelated, so he turns the firewall back on:

ufw enable

Now he moves on to further troubleshooting.

Here’s the problem: The ‘ufw disable’ command above didn’t actually do anything, as ufw wasn’t enabled to begin with. No harm, no foul. However, running ‘ufw enable’ turns on ufw, and configures it in default mode – which denies all incoming connections. As the server in question provides many key network services for this distributed system, suddenly all of the other nodes in this system cannot reach these central services, and the entire distributed system starts to fall over!

So what went wrong here?

UFW is not a front-end for iptables

It may seem like UFW is a front-end for iptables – because you run ufw commands and it makes changes to iptables rules. But what it’s really doing is throwing away whatever is currently in iptables, and replacing it with what’s been configured in ufw. A real front-end would allow you to inspect and modify iptables stuff that already exists.

How this plays into this situation is that lots of carefully-crafted iptables rules for access, masquerading, etc got silently nuked when the ‘ufw enable’ command was run. It does not take into account the starting state of the system before it does stuff. That’s bad on a server.

UFW’s defaults are appropriate for desktops, not servers

I love ‘deny all’ as a default for firewalls. I could even concede that ‘deny all’ is the best default for servers, although I think that could be debated. However, let’s look at some of the specifics that ufw’s defaults do allow:

  • ICMP (this is OK)
  • multicast DNS (hmmm….)
  • UPnP (!!)

OK, this should go without saying, but does it seem to you like your server operating system should enable UPnP discovery by default? Makes sense for a desktop (maybe) – but not for a server. No way, no how.

UFW is an Ubuntu-specific thing (and makes iptables complicated)

If you have a ‘firewall administrator’ on staff, and you ask them to look at a server, how likely is it that they’re going to know anything about UFW? IMO, very unlikely. They’re probably familiar with iptables, and the first thing they’ll do is look at iptables directly, and they’re going to see dozens of rules that point at other rules and they’re going to be very confused. At best, they’ll refuse to touch it and give you a big lecture about how if packets get to your server’s TCP/IP stack before hitting your firewall, you’re doing it wrong. At worst, they’ll start making changes which due to the complexity of these rules will likely result in more problems, not less.

Make your Ubuntu servers better

If you are going to run servers on Ubuntu, here’s one simple suggestion for how to make them better:

apt-get remove ufw

After this, at least your beleaguered sysadmins will have one less possible way to screw up your systems while doing their jobs.

Is Ubuntu a good server OS?

My “Openstack IRL” presentation informs the audience that we at Cloudscaling use Ubuntu for the systems we deploy. When I present this live and we get to that slide, I usually say something like this:

We use Ubuntu for our systems. This is somewhat ironic because at least once a month in our office we have a big discussion about how terrible Ubuntu is as a server operating system…

Funny. But is it true? Is Ubuntu a terrible operating system for servers? Let’s look at one data point: Upstart.

Upstart’s raisons d’etre

My distillation of the reasons upstart was created to replace the traditional rcX.d init structure from SysV and BSD is:

  1. the traditional system is serial rather than parallel, meaning reboots take longer than they have to – and people reboot their systems a lot more these days
  2. the traditional system doesn’t deal well with new hardware being plugged in and out on a regular basis – and people are constantly plugging stuff in and out of their systems these days

Do those sound like conditions that affect your servers? Me neither. They are desktop-centric concerns. And there’s nothing wrong with that – unless you’re trying to run a server.

Why does it matter?

From the perspective of a crotchety old-time unix sysadmin (a hypothetical one of course!), upstart is a PITA. Let me try to illustrate why:

Checking what order stuff starts in

In the traditional world, here’s what you have to do to find out what order things start in:

ls /etc/rc2.d

That’s it. The output of that command provides you with everything you need to see at a glance that (for instance) apache is going to start after syslog.

Here’s how you do it in the upstart world:

well I wish I cold give you a simple way to do that, but you can’t. You have to open up the conf file for the service you’re interested in in /etc/init and look at what events it depends on for starting. If one of those events is the startup of another service, then you know your service will start after it. However, if there is no dependency listed on another service, then you don’t know what order they will startup in. Yours might start after the other one, or the other one might start before yours does, or they may both be starting up at the same time. You don’t know, and it isn’t guaranteed that they will start in the same order every time the system boots. This makes crotchety old unix sysadmins nervous, and leads to the second point….

Defining what order stuff starts in

In the traditional world, this is done with 2 digit numbers. You have a 2 digit number (part of the name of the file in /etc/rcX.d) and the scripts are run in the order of the numbers in their filename. So if you want one script to start later than another, just change its number to be larger than that other one. Easy to understand, and all you have to know to do it is how to use mv. And there are no hard dependencies here – if you build one server that doesn’t contain a particular service, that init file won’t be installed, and none of the other init files will be affected and startup will go as you expect.

In the upstart world, you do this by specifying startup dependencies between the jobs that start services. Each job emits an event when it completes that you can use in the conf files for other services. So say you have two services, s1 and s2, and you want to be sure s2 starts after s1. You do this by putting a line like this into /etc/init/s2.conf:

start on started s1

So aside from the crochety old sysadmin spending 45 minutes perusing ‘man upstart’ to figure this out in the first place, the problem you run into here is with distributed systems that can be deployed in varied configurations. For example, sometimes s1 and s2 are on the same node, and sometimes they are on different nodes. If you put the above line into /etc/init/s2.conf by default, guess what happens if you deploy in a config where s1 isn’t on the same node? s2 will never ever ever start.


My take on this is that upstart is a great thing for desktop systems. For server systems, it’s adding a bunch of complexity and brittleness without providing any actual benefits. And it’s one check mark in the “ubuntu isn’t a good OS for servers” category.

Openstack LA meetup

Last week I took a little trip down to LA to attend the #OSLAX meetup. I’m a little ashamed to admit that I was pleasantly surprised to find a big group of knowledgeable, friendly, and fun Openstack enthusiasts. (Us NorCal types are kinda snobby that way). I had the opportunity to present my talk on “Openstack IRL,” which seemed to be very well received. The video of my talk is up on vimeo if you’re interested in hearing the actual presentation, or you can see the slides on Slideshare. The message was the same as when I presented to #lspe earlier this year – you can build a production cloud using Openstack (we’ve done it!). This time around I was able to provide a little more detail about some of the specifics of what Cloudscaling is doing in our Open Cloud System, especially around resiliency for internal Openstack components, networking, and AWS compatibility.

At about 9:44 in the presentation, I’m talking about how excited network engineers get when they can build a CLOS network that provides loads of bisectional bandwidth. I have been assured multiple times that what everyone is actually hearing is “bisexual bandwidth” – I’m not sure exactly what that is but it’s probably more interesting than a network diagram. Check it out for yourself and see what you think I’m saying….

OpenStack LA Meetup Jul 26 – Paul Guth from DreamHost on Vimeo.