Technical incompetence at Wayfair -- is there anything worse??
Is there anything worse than Wayfair at this scale in terms of sheer technical incompetence and poor engineering practices?
Wayfair's codebase and infrastructure are stuck in a pathetic, sorry state. The PHP repo, the databases (which are all mirrors in some way), the vertical scaling -- it needs to be blasted into orbit if the company wants to save itself.
- There are directors and managers who have only ever worked at Wayfair, believe they are God, believe they hire top talent, and Wayfair has a top engineering culture. But they refuse to give teams access to the cloud to spin up new infrastructure.
- There is so much red tape just to use a new framework, such as an ORM that's been around for 20 years.
- The main DB has 3 TB of RAM and *still* has load trouble. Nobody on the cloud offers a database with 3TB of RAM because they assume companies will just scale properly and horizontally.
- The DBAs advise all developers to read from the database WITH (NOLOCK), which is essentially a TRANSACTION READ UNCOMMITTED, meaning you are reading UNCOMMITTED DATA FROM THE DATABASE!!!!!!!!!!!!! They say it's for "performance reasons", and, sadly, it's the only way Wayfair could "scale" the database so it could handle the reads and writes. Hmm, I wonder why some customers end up with duplicate couches on their doors or why they don't get an order coming at all..
- SQL tables are being used as queues!!!!!!
- Static methods all over the place that open and close database connections and sessions.
- SQL queries that span hundreds of lines because according to a senior engineer, he said that it's better to shove everything into one SQL query instead of making a few separate calls to the database.
- Most of the PHP teams have never heard of a simple model with setters and getters and if you add it to the code, they will freak out and ask what it is and tell you to consult a working group about it. Forget using the builder pattern, too. They will tell you to get rid of it and to instead shove everything into the model. There are models that span 12000 lines of code.
- Has anybody heard of pagination ever???????? A manager I talked to never heard of the offset keyword for a database query and didn't know what index-based pagination was. There are queries that don't do any sort of pagination and just grab all items in one go.
- Can't use SQL Server's Change Data Capture due to "performance reasons" for event-based architecture.
- Speaking of Change Data Capture, Wayfair uses triggers for everything! They use it for audit history and business logic. If DBA cares so much about performance, why are they letting people use triggers for everything but the competent developers who know what they're doing can't use the CDC for their database?
- No foreign keys because "performance reasons"!
- Kafka has been destroyed. I think all of Wayfair is using the same damn Kafka instance and teams just split it up by topics. Also, messages can only live in the queue for a few days days due to "performance reasons."
- I read somewhere that Wayfair implemented cookie-based database sharding for customers and ran into problems and had to change it. Whose idea was this, and did anyone ever Google anything on how to shard?????
- Their Kubernetes strategy is completely stupid. You need to get through 3 different internal tools to get a Kubernetes application up. You can't just use the damn cloud.
- Their Kubernetes architecture is on-premise and bastardized. There was one team who accidentally took up the resources of another team's app and brought down that entire app. I can think of one way how that could happen.
- Jenkins. They use Jenkins jobs for everything!!!!! For one PHP doesn't do multithreading, so they start multiple Jenkins nodes to do concurrent work. I heard even the non-PHP teams are using Jenkins for concurrent work.
- Jenkins Jenkins Jenkins!!!!!!!!!!! A great system design question for a Wayfair engineer: "how would you handle sending billions of emails to customers in a concurrent fashion?" I'll bet the engineers, including the directors and tech leads who haven't researched anything, would say "spin up thousands of Jenkins jobs and read from a SQL database!"
- The executives and directors here don't believe in microservices or having separate repositories and think because Google and Facebook can scale a monorepo, Wayfair can too.
- Testing. Most teams don't test their code and can't tell the difference between an integration test and unit test!!!!! There was one team that broke a runner because their unit tests were creating real files. They had to hire an entire team to hose down bad tests, so code can pass the pipelines and get into production. There is less than 1% code coverage. For the teams that do test, the tests are completely useless and break all the time because they are just testing spaghetti code. There is no point in testing spaghetti code.
If you want to work here, then do it. Maybe you have a family to support and care about other things in life, but a lot of the engineering decisions could affect you!!!!
Do you really want to wake up at 4 in the morning for a critical because some incompetent engineer forgot to screw in the hard drive when they upgraded the database?
Do you want to be told my management that you can't use setters and getters? Forget about using the builder pattern, too. Most teams don't know what it is and will freak out and tell you to consult a useless working group about it.
Do you want to see a serious scaling problem that could easily be solved by being on Cloud with managed services (Big Query, Spanner, Big Table) and be told by incompetent directors you have to use the monolith of the 3TB RAM databases, Jenkins, and SQL Server as queues????
There are honestly happy engineers here who like it and are completely oblivious to the problems and are OK with adding band-aids to gashing wounds and think the weekly criticals and post-mortems are a normal part of being a software engineer. Surely there are some teams doing better than others, but they can't unlock their true potential with upper management micromanaging them and telling them they can't use the cloud or can't download a library package.
Junior engineers... none of this is their fault. They are well-intentioned and want to learn and have potential, but they have incompetent directors and managers teaching them bad practices. They will need to do research outside of work or reach out to knowledgeable people within Wayfair (can be hard to find) if they want to learn what competent companies do. Some are so far gone and have drunk too much of the purple kool-aid though and they won't listen even if it's plain common sense.
The only hope the company has is to listen to and empower the competent engineering teams (there are a few, and a lot of the members end up leaving in frustration), but a lot of those employees happen to be in the Berlin office. Unfortunately, the US office isn't listening to their advice and block their advice with red tape and working groups that end up bastardizing a process. There are competent engineers scattered in the US teams, but they seem to have no power to make changes, haven't been at Wayfair for very long, or get blocked by upper management, and leave in frustration.
I heard managers went to the Berlin office and told all the competent engineers there that they can't use the Cloud and have to wait until management is done evaluating their strategy! They were pissed!
The strategy is to have a hybrid cloud with on-premise servers as the source of truth and to mirror the data to GCP. This means you can't use the Cloud APIs to spin up a Kubernetes cluster if you want. You can't use serverless Application Engine and serverless architecture. Everything still has to go through Wayfair's internal tools!! Some things on the cloud, but they are mirrors or clusters being hit by a reverse proxy (not sure if they have HaProxy or Nginx on the cloud).
With the "move" to the cloud, they are even building their own version of Terraform that will spin up on-premise servers or whatever.
I really thought that one time when their internal emergency SMS system went down was bad. There was a water pipe that broke at the Boston office. The office water was fucked and full of shit. Wayfair attempted to send emergency texts to employees early in the morning to tell them to not come into work. Their new emergency SMS system wasn't tested, and it broke and didn't send messages. They had to use their legacy SMS tool, and it went off in an O(n) fashion (probably through a Jenkins job lolzzzz!!!!). Employees got emergency texts by the afternoon when the water pipe was already fixed. Imagine if something really bad happened, like a fire or shooting in the mall!!!!!!!
But anyways I'm sure after that team's post mortem, the execs said "great job everyone! Thanks for getting it fixed!"