My Ideal Engineering Organization
Disclaimer
What this describes is how I, Yi Zhang, prefer an engineering organization to be structured. There are certainly trade-offs, but this is what I believe to be a good balance in optimizing for the areas I care more about, without overly sacrificing others.
Of course, these opinions and preferences are heavily biased by my personal experience, education, readings, and mentorship I received over the years.
High Level Goals
The end results that we’re trying to achieve with an organizational structure are:
Alignment: Make it easy to communicate what Success means for everyone
Speed: Help everyone to move towards Success as fast as possible
Risk: Minimize the risk of failure to achieve Success
Principles
These are general principles that should help guide how organizations are structured:
Autonomy: Encourage people closest to the work to make as many decisions as possible. They are most familiar with what’s involved. They have the most local context and information. Let them be creative and find the best ways to do their jobs.
Self-Sufficiency: On the other hand, make sure people doing the work have easy access to everything they need. Notably, domain/technical expertise, stakeholder, tooling, higher level decision making rationales, etc., should all be easily available.
Organizational Structure
With these goals and principles in mind, here are the building blocks of how I prefer what the organizational structure looks like.
Team
A Team is a permanent group of people (cross-functional) who share a common customer-impacting goal. This could be in the form of a Mission Statement for the long term, and in the form of OKRs for the short term.
The mandate of a Team is to do everything it can to accomplish the goal.
Teams are owners of resources (e.g. code, engineers, cloud infra), and are the context within which all work is prioritized and carried out.
Every engineer has to belong to a single Team, and is primarily dedicated to achieving the Team’s goals. Therefore everything they do should be reasonably justifiable in terms of the Team’s goal. This helps Alignment.
Each Team sets their own goals and priorities, with discussion open and feedback taken from anyone interested to contribute, especially their stakeholders. This promotes Autonomy.
Teams can be cross-functional, including members from other functional groups (e.g. product manager, customer success, etc.), but each Team has to include a lead, who serves as the ultimate responsible party for the Success of the Team. Typically, these would be “Engineering Managers”.
Guild
A Guild is a semi-permanent group of people (typically only engineers) who share a common interest across multiple Teams.
The primary function of a Guild is to share knowledge across Teams. Typically, engineers who work on similar technologies can form a Guild (e.g. mobile engineers, DBAs, SREs, or Testing). Guild membership is voluntary but generally beneficial for relevant engineers. This systematic spread of knowledge helps manage Risk.
Guilds should hold regular meetings to gather, synthesize, and disseminate Guild specific knowledge. Guild meetings should also be the arena for organizing cross-team initiatives, including goal-setting, decision-making, progress report, coordination and so on.
However individual Guild members should advocate, prioritize, and carry out this work in their own Teams’ context. This preserves the Teams’ Autonomy while promoting Self-Sufficiency by connecting some member(s) of the Team to their peers.
Each Guild should have a lead, who serves as the coordinator of these knowledge sharing, and when necessary, the driver of these cross-team initiatives. This is generally a safe but challenging ground for testing the leadership qualities of Senior Engineers looking to move into a Staff Engineer or Engineering Manager role.
Guilds are not advised to own long-living resources (such as code). The primary beneficiary Team of such resources should typically be the maintainer. Though occasional exceptions can be made with some combination of permanent Guild, very light-weight code, and/or strong Guild leadership.
Task Force
A Task Force is a temporary group of people (cross-functional) who are set out to accomplish a very specific task in a pre-defined time frame.
The necessity of the Task Force arises from non-trivial, cross-team pieces of work that cannot be accomplished within the scopes of a single Team, but demand greater dedicated effort than feasible for a Guild. Typical examples can include significant testing infra in a shared code base, a light-weight internal framework, or a data pipeline prototype.
Task Forces should be formed thoughtfully with the timeline and members being very good matches for the specific tasks. Ideally all members carry out work exclusively for the Task Force on a full time basis for the duration of their membership. This facilitates the Speed of execution of these tasks, and allows the Task Force’s Self-Sufficiency.
A Task Force should be converted to a Team if the scope of work proves to be long-lasting and customer-impacting. Otherwise it should disband upon the completion of the tasks.
Members would typically resume prior positions in their respective Teams after the Task Force’s dissolution. And long-living resources (such as code) can be distributed across Teams or are passed to the primary beneficiary Team.
Tribes (Optional)
A Tribe is simply a collection of Teams grouped by the proximity of their ownership area.
As organizations get large, it becomes difficult to keep track of all the Teams on an individual basis. Furthermore, Teams may naturally form cliques in terms of the frequency and bandwidth of their communications.
It might be worthwhile to formally recognize the proximity of certain Teams compared to others to help further with Alignment on their shared goals. This also promotes sound Software Architecture per Conway’s Law (see below).
Tribes may have leads to facilitate communication, goal alignment, as well as possibly driving engineering initiatives at the Tribe level. They are commonly given the title “Director of Engineering”.
Conclusion
Always remember Conway’s Law:
Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.
As soon as people are in different groups, their level of communication will be reduced, and the systems they work on will diverge.
Always have everyone building the same (sub)system stay in the same room.
FAQ
Q: What’s the whole point of this?
A: I really really hate blocking other teams and getting blocked by other teams. The number 1 thing this structure optimizes for is allowing Teams to be as independent as absolutely possible, so they can move as fast and make as best decisions as possible.
Q: What’s the expected size of a Team?
A: Conventional wisdom dictates 5-10 people. General principle is that as soon as there are too many people to have shared goals that everyone on the team can truly meaningfully contribute to, it should split.
Q: What happens when a team is formed without a sufficient mix of skills?
A: What else can we do other than hiring people with the right skills to the team? Short term perhaps you can borrow or share these people with other teams, but I can’t imagine that being a pleasant solution for anyone involved.
Q: What’s wrong with a centralized SRE team that behaves as internal consultancy to help product teams?
A: Two things. One, the central SRE team is further removed from the customers, if there’s no customer representative on that team, they’re less likely to empathize with customers and prioritize work that’s more customer impactful, and I really really believe we should put customers as high priority. And two, internal consultancy has some of the same drawbacks as external consultancy. From the product team’s point of view, with the internal consultant SRE not as intimately tied to their goal, and not having the same level of context, they can’t do as good of a job.
Q: Who would build these deployment tooling, log aggregation, monitoring, and all the shared utility things that everyone will need?
A: Great question. First of all, I challenge that everyone will want/need the same thing, and I challenge that a central SRE team would build these better than people within these teams can. Practically this can be developed by a Task Force and maintained by the SRE Guild. If a Team has capacity, they would be welcome to build and maintain it too.
Q: So EVERY SINGLE team has to be customer facing?
A: That’s the idea that I’d like to take as far as conceivably possible. I concede that at some point when the company is big enough, we might want/need an internal tooling Team. And I want it to be because at that point, some people have volunteered to have built some great tooling that we’d like them to maintain rather than any other reason.
Q: What if none of the Guild members can convince their Team members to prioritize the initiative of their Guild?
A: The idea is that if they are important enough and they’ll benefit the Team, the Team should recognize that and would be willing to spend time on it. But only as much as their own context allows though.
Q: What if there’s something really important that all teams really need to do, like for Compliance and InfoSec?
A: Establish a Guild that drives it, and let the Teams know that this is important. If they need to be mandated to occupy the top of the backlog, so be it.
Q: What if a Team needs some help on SRE, but not full time. Should I hire an SRE for the team?
A: If you can make do without one and accomplish your goals, then great. There’s also nothing stopping you from trying to consult with other SREs on other Teams, just know you’re not their top priority. And honestly, if you hire a great person, they’ll find enough work for themselves.
Q: What if the only SRE on my team goes on vacation?
A: I think it’s pretty reasonable to ask someone else on the Team to learn the basics to cover for them on a temporary basis.
Q: This structure seems too extreme, too rigid, are there exceptions?
A: Yes there are. There are certain functions that seem to make more sense to be centralized. Security might be one, because it has to be more standardized across the board, and there’s definitely not enough work for every Team to have a security person. On top of that, it’s a domain that’s more specialized and harder to train a generalist engineer to do. But I struggle to think of any other ones.