Technical incompetence at Wayfair -- is there anything worse??

Mar 16, 2019 179 Comments

Is there anything worse than Wayfair at this scale in terms of sheer technical incompetence and poor engineering practices?

Wayfair's codebase and infrastructure are stuck in a pathetic, sorry state. The PHP repo, the databases (which are all mirrors in some way), the vertical scaling -- it needs to be blasted into orbit if the company wants to save itself.

- There are directors and managers who have only ever worked at Wayfair, believe they are God, believe they hire top talent, and Wayfair has a top engineering culture. But they refuse to give teams access to the cloud to spin up new infrastructure.

- There is so much red tape just to use a new framework, such as an ORM that's been around for 20 years.

- The main DB has 3 TB of RAM and *still* has load trouble. Nobody on the cloud offers a database with 3TB of RAM because they assume companies will just scale properly and horizontally.

- The DBAs advise all developers to read from the database WITH (NOLOCK), which is essentially a TRANSACTION READ UNCOMMITTED, meaning you are reading UNCOMMITTED DATA FROM THE DATABASE!!!!!!!!!!!!! They say it's for "performance reasons", and, sadly, it's the only way Wayfair could "scale" the database so it could handle the reads and writes. Hmm, I wonder why some customers end up with duplicate couches on their doors or why they don't get an order coming at all..

- SQL tables are being used as queues!!!!!!

- Static methods all over the place that open and close database connections and sessions.

- SQL queries that span hundreds of lines because according to a senior engineer, he said that it's better to shove everything into one SQL query instead of making a few separate calls to the database.

- Most of the PHP teams have never heard of a simple model with setters and getters and if you add it to the code, they will freak out and ask what it is and tell you to consult a working group about it. Forget using the builder pattern, too. They will tell you to get rid of it and to instead shove everything into the model. There are models that span 12000 lines of code.

- Has anybody heard of pagination ever???????? A manager I talked to never heard of the offset keyword for a database query and didn't know what index-based pagination was. There are queries that don't do any sort of pagination and just grab all items in one go.

- Can't use SQL Server's Change Data Capture due to "performance reasons" for event-based architecture.

- Speaking of Change Data Capture, Wayfair uses triggers for everything! They use it for audit history and business logic. If DBA cares so much about performance, why are they letting people use triggers for everything but the competent developers who know what they're doing can't use the CDC for their database?

- No foreign keys because "performance reasons"!

- Kafka has been destroyed. I think all of Wayfair is using the same damn Kafka instance and teams just split it up by topics. Also, messages can only live in the queue for a few days days due to "performance reasons."

- I read somewhere that Wayfair implemented cookie-based database sharding for customers and ran into problems and had to change it. Whose idea was this, and did anyone ever Google anything on how to shard?????

- Their Kubernetes strategy is completely stupid. You need to get through 3 different internal tools to get a Kubernetes application up. You can't just use the damn cloud.

- Their Kubernetes architecture is on-premise and bastardized. There was one team who accidentally took up the resources of another team's app and brought down that entire app. I can think of one way how that could happen.

- Jenkins. They use Jenkins jobs for everything!!!!! For one PHP doesn't do multithreading, so they start multiple Jenkins nodes to do concurrent work. I heard even the non-PHP teams are using Jenkins for concurrent work.

- Jenkins Jenkins Jenkins!!!!!!!!!!! A great system design question for a Wayfair engineer: "how would you handle sending billions of emails to customers in a concurrent fashion?" I'll bet the engineers, including the directors and tech leads who haven't researched anything, would say "spin up thousands of Jenkins jobs and read from a SQL database!"

- The executives and directors here don't believe in microservices or having separate repositories and think because Google and Facebook can scale a monorepo, Wayfair can too.

- Testing. Most teams don't test their code and can't tell the difference between an integration test and unit test!!!!! There was one team that broke a runner because their unit tests were creating real files. They had to hire an entire team to hose down bad tests, so code can pass the pipelines and get into production. There is less than 1% code coverage. For the teams that do test, the tests are completely useless and break all the time because they are just testing spaghetti code. There is no point in testing spaghetti code.

If you want to work here, then do it. Maybe you have a family to support and care about other things in life, but a lot of the engineering decisions could affect you!!!!

Do you really want to wake up at 4 in the morning for a critical because some incompetent engineer forgot to screw in the hard drive when they upgraded the database?

Do you want to be told my management that you can't use setters and getters? Forget about using the builder pattern, too. Most teams don't know what it is and will freak out and tell you to consult a useless working group about it.

Do you want to see a serious scaling problem that could easily be solved by being on Cloud with managed services (Big Query, Spanner, Big Table) and be told by incompetent directors you have to use the monolith of the 3TB RAM databases, Jenkins, and SQL Server as queues????

There are honestly happy engineers here who like it and are completely oblivious to the problems and are OK with adding band-aids to gashing wounds and think the weekly criticals and post-mortems are a normal part of being a software engineer. Surely there are some teams doing better than others, but they can't unlock their true potential with upper management micromanaging them and telling them they can't use the cloud or can't download a library package.

Junior engineers... none of this is their fault. They are well-intentioned and want to learn and have potential, but they have incompetent directors and managers teaching them bad practices. They will need to do research outside of work or reach out to knowledgeable people within Wayfair (can be hard to find) if they want to learn what competent companies do. Some are so far gone and have drunk too much of the purple kool-aid though and they won't listen even if it's plain common sense.

The only hope the company has is to listen to and empower the competent engineering teams (there are a few, and a lot of the members end up leaving in frustration), but a lot of those employees happen to be in the Berlin office. Unfortunately, the US office isn't listening to their advice and block their advice with red tape and working groups that end up bastardizing a process. There are competent engineers scattered in the US teams, but they seem to have no power to make changes, haven't been at Wayfair for very long, or get blocked by upper management, and leave in frustration.

I heard managers went to the Berlin office and told all the competent engineers there that they can't use the Cloud and have to wait until management is done evaluating their strategy! They were pissed!

The strategy is to have a hybrid cloud with on-premise servers as the source of truth and to mirror the data to GCP. This means you can't use the Cloud APIs to spin up a Kubernetes cluster if you want. You can't use serverless Application Engine and serverless architecture. Everything still has to go through Wayfair's internal tools!! Some things on the cloud, but they are mirrors or clusters being hit by a reverse proxy (not sure if they have HaProxy or Nginx on the cloud).

With the "move" to the cloud, they are even building their own version of Terraform that will spin up on-premise servers or whatever.

I really thought that one time when their internal emergency SMS system went down was bad. There was a water pipe that broke at the Boston office. The office water was fucked and full of shit. Wayfair attempted to send emergency texts to employees early in the morning to tell them to not come into work. Their new emergency SMS system wasn't tested, and it broke and didn't send messages. They had to use their legacy SMS tool, and it went off in an O(n) fashion (probably through a Jenkins job lolzzzz!!!!). Employees got emergency texts by the afternoon when the water pipe was already fixed. Imagine if something really bad happened, like a fire or shooting in the mall!!!!!!!

But anyways I'm sure after that team's post mortem, the execs said "great job everyone! Thanks for getting it fixed!"



Want to comment? LOG IN or SIGN UP
TOP 179 Comments
  • Uber bobaboi
    Did you type this on mobile?
    Mar 16, 2019 22
    • Wayfair / Eng cplusplus1
      I can vouch for almost everything the person mentioned above. I have been asked questions like "What is git?" "Can pandas parse Excels?" from L3 and a Tech Lead respectively. Recently I read an email where they hired a guy for an SDE role who was a "chef" until 2 months ago. I don't know what's happening really. People put on busy masks by going into meetings and spending 1h talking vaguely about things, eventually coming out of the meeting having achieved nothing. At the end of the day, they advertise saying they're the best in the business when it comes to tech. Hahaha, sure sure.
      May 9, 2019
    • Snapchat / Other dudeLA
      I gave interview to Wayfair and it was the stupidest interview ever in my life.
      Jun 26, 2019
  • Google GodOfAllah
    If you cannot write a blind post with optimal length, how dare you judge anything else.
    Mar 16, 2019 7
    • OP, I think you can ignore the man who calls himself God on blind.
      Mar 18, 2019
    • Amazon WicH21
      What a load of bull! I loved the post. Hopefully you get out op!
      Mar 20, 2019
  • Wayfair 47th
    Man, I’m not even a developer but I hear about this shit on the day to day and it seems this is pretty spot on. The entire engineering department is led by chickens with their heads cut off.
    Mar 17, 2019 10
    • Wayfair CUpR17
      They should of asked employees what the value in benefits of JIRA is compared to PT. PT makes people quit.
      Mar 17, 2019
    • Salesforce / Mgmt GQch66
      PT makes me sad. Of course, I came from Salesforce, which uses a custom version of Salesforce for its issue tracker. It's bad, but not as terrible as PT.

      The best part of being acquired by Salesforce was getting to jettison our GitHub-based robust CI/CD and JIRA integration. :(
      Mar 18, 2019
  • Wayfair olYP50
    Trust me there internal wheels in motion to start solving these issues. The problem is at our scale it's hard to turn the ship on a dime. The cloud is not the winner you seem to think it is. We have smart engineering who seem to forget the basics to scaling. We are working on changing the culture one piece at a time.
    Mar 17, 2019 9
    • Wayfair KpJP81
      What slack channel was this discussed in?
      Mar 26, 2019
    • Wayfair WNEm45
      Jun 1, 2019
  • SpaceX Elon'sMusk
    Damn I wasn't planning on reading a novel tonight
    Mar 16, 2019 1
    • Wayfair str(10)
      It's not fiction tho...
      Mar 17, 2019
  • Bloomberg ezJV08
    Tldr: TC or Gtfo
    Mar 16, 2019 1
    • Wayfair CUpR17
      Mar 16, 2019
  • Monster unholy!
    Love love love love this post.
    Mar 16, 2019 0
  • Rubrik PartyStick
    OP please practice what your preach and shard your post properly
    Mar 16, 2019 1
    • And please do it based on client-side cookies
      Mar 20, 2019
  • Amazon / R&D

    Amazon R&D

    ⭐️ Post of the Year ⭐️
    Apr 8, 2019 0
  • Honestly first time hearing someone using Read Uncommitted transaction isolation level.

    3TB of RAM is REALLY a lot. We have done someone spiky (very high traffic) events in the world and still our primary database has only 64G RAM (in spite of one of the major application don’t have connection pooling).
    Mar 16, 2019 6
    • SAP / R&D

      SAP R&D

      Mostly human
      This is exactly like my job! With (nolock) is necessary when you have hundreds of people reading the row. Normal reads take a shared read lock, which doesn't block other reads, but DOES block updates. Do you want people reading a company data row to block an admin updating a value there? You don't. With nolock shouldn't be used on data that might have _just_ been inserted. But on the 99.9% of the data that is more than a second old it's fine
      Jan 31
    • SAP / R&D

      SAP R&D

      Mostly human
      Our server only has 1 TB ram. DB is 8 TB. We have: 1000 tables, 5000 stored procedures, many complex triggers.
      Jan 31


    Real time salary information from verified employees