Technical incompetence at Wayfair -- is there anything worse??

Wayfair CUpR17
Mar 16 169 Comments

Is there anything worse than Wayfair at this scale in terms of sheer technical incompetence and poor engineering practices?

Wayfair's codebase and infrastructure are stuck in a pathetic, sorry state. The PHP repo, the databases (which are all mirrors in some way), the vertical scaling -- it needs to be blasted into orbit if the company wants to save itself.

- There are directors and managers who have only ever worked at Wayfair, believe they are God, believe they hire top talent, and Wayfair has a top engineering culture. But they refuse to give teams access to the cloud to spin up new infrastructure.

- There is so much red tape just to use a new framework, such as an ORM that's been around for 20 years.

- The main DB has 3 TB of RAM and *still* has load trouble. Nobody on the cloud offers a database with 3TB of RAM because they assume companies will just scale properly and horizontally.

- The DBAs advise all developers to read from the database WITH (NOLOCK), which is essentially a TRANSACTION READ UNCOMMITTED, meaning you are reading UNCOMMITTED DATA FROM THE DATABASE!!!!!!!!!!!!! They say it's for "performance reasons", and, sadly, it's the only way Wayfair could "scale" the database so it could handle the reads and writes. Hmm, I wonder why some customers end up with duplicate couches on their doors or why they don't get an order coming at all..

- SQL tables are being used as queues!!!!!!

- Static methods all over the place that open and close database connections and sessions.

- SQL queries that span hundreds of lines because according to a senior engineer, he said that it's better to shove everything into one SQL query instead of making a few separate calls to the database.

- Most of the PHP teams have never heard of a simple model with setters and getters and if you add it to the code, they will freak out and ask what it is and tell you to consult a working group about it. Forget using the builder pattern, too. They will tell you to get rid of it and to instead shove everything into the model. There are models that span 12000 lines of code.

- Has anybody heard of pagination ever???????? A manager I talked to never heard of the offset keyword for a database query and didn't know what index-based pagination was. There are queries that don't do any sort of pagination and just grab all items in one go.

- Can't use SQL Server's Change Data Capture due to "performance reasons" for event-based architecture.

- Speaking of Change Data Capture, Wayfair uses triggers for everything! They use it for audit history and business logic. If DBA cares so much about performance, why are they letting people use triggers for everything but the competent developers who know what they're doing can't use the CDC for their database?

- No foreign keys because "performance reasons"!

- Kafka has been destroyed. I think all of Wayfair is using the same damn Kafka instance and teams just split it up by topics. Also, messages can only live in the queue for a few days days due to "performance reasons."

- I read somewhere that Wayfair implemented cookie-based database sharding for customers and ran into problems and had to change it. Whose idea was this, and did anyone ever Google anything on how to shard?????

- Their Kubernetes strategy is completely stupid. You need to get through 3 different internal tools to get a Kubernetes application up. You can't just use the damn cloud.

- Their Kubernetes architecture is on-premise and bastardized. There was one team who accidentally took up the resources of another team's app and brought down that entire app. I can think of one way how that could happen.

- Jenkins. They use Jenkins jobs for everything!!!!! For one PHP doesn't do multithreading, so they start multiple Jenkins nodes to do concurrent work. I heard even the non-PHP teams are using Jenkins for concurrent work.

- Jenkins Jenkins Jenkins!!!!!!!!!!! A great system design question for a Wayfair engineer: "how would you handle sending billions of emails to customers in a concurrent fashion?" I'll bet the engineers, including the directors and tech leads who haven't researched anything, would say "spin up thousands of Jenkins jobs and read from a SQL database!"

- The executives and directors here don't believe in microservices or having separate repositories and think because Google and Facebook can scale a monorepo, Wayfair can too.

- Testing. Most teams don't test their code and can't tell the difference between an integration test and unit test!!!!! There was one team that broke a runner because their unit tests were creating real files. They had to hire an entire team to hose down bad tests, so code can pass the pipelines and get into production. There is less than 1% code coverage. For the teams that do test, the tests are completely useless and break all the time because they are just testing spaghetti code. There is no point in testing spaghetti code.

If you want to work here, then do it. Maybe you have a family to support and care about other things in life, but a lot of the engineering decisions could affect you!!!!

Do you really want to wake up at 4 in the morning for a critical because some incompetent engineer forgot to screw in the hard drive when they upgraded the database?

Do you want to be told my management that you can't use setters and getters? Forget about using the builder pattern, too. Most teams don't know what it is and will freak out and tell you to consult a useless working group about it.

Do you want to see a serious scaling problem that could easily be solved by being on Cloud with managed services (Big Query, Spanner, Big Table) and be told by incompetent directors you have to use the monolith of the 3TB RAM databases, Jenkins, and SQL Server as queues????

There are honestly happy engineers here who like it and are completely oblivious to the problems and are OK with adding band-aids to gashing wounds and think the weekly criticals and post-mortems are a normal part of being a software engineer. Surely there are some teams doing better than others, but they can't unlock their true potential with upper management micromanaging them and telling them they can't use the cloud or can't download a library package.

Junior engineers... none of this is their fault. They are well-intentioned and want to learn and have potential, but they have incompetent directors and managers teaching them bad practices. They will need to do research outside of work or reach out to knowledgeable people within Wayfair (can be hard to find) if they want to learn what competent companies do. Some are so far gone and have drunk too much of the purple kool-aid though and they won't listen even if it's plain common sense.

The only hope the company has is to listen to and empower the competent engineering teams (there are a few, and a lot of the members end up leaving in frustration), but a lot of those employees happen to be in the Berlin office. Unfortunately, the US office isn't listening to their advice and block their advice with red tape and working groups that end up bastardizing a process. There are competent engineers scattered in the US teams, but they seem to have no power to make changes, haven't been at Wayfair for very long, or get blocked by upper management, and leave in frustration.

I heard managers went to the Berlin office and told all the competent engineers there that they can't use the Cloud and have to wait until management is done evaluating their strategy! They were pissed!

The strategy is to have a hybrid cloud with on-premise servers as the source of truth and to mirror the data to GCP. This means you can't use the Cloud APIs to spin up a Kubernetes cluster if you want. You can't use serverless Application Engine and serverless architecture. Everything still has to go through Wayfair's internal tools!! Some things on the cloud, but they are mirrors or clusters being hit by a reverse proxy (not sure if they have HaProxy or Nginx on the cloud).

With the "move" to the cloud, they are even building their own version of Terraform that will spin up on-premise servers or whatever.

I really thought that one time when their internal emergency SMS system went down was bad. There was a water pipe that broke at the Boston office. The office water was fucked and full of shit. Wayfair attempted to send emergency texts to employees early in the morning to tell them to not come into work. Their new emergency SMS system wasn't tested, and it broke and didn't send messages. They had to use their legacy SMS tool, and it went off in an O(n) fashion (probably through a Jenkins job lolzzzz!!!!). Employees got emergency texts by the afternoon when the water pipe was already fixed. Imagine if something really bad happened, like a fire or shooting in the mall!!!!!!!

But anyways I'm sure after that team's post mortem, the execs said "great job everyone! Thanks for getting it fixed!"

145 VOTES SELECT ONLY ONE ANSWER
VOTE VIEW RESULT

comments

Want to comment? LOG IN or SIGN UP
TOP 169 Comments
  • Uber bobaboi
    Did you type this on mobile?
    Mar 16 22
    • Wayfair topUser
      Lol. I have interviews lined up for the rest of the week. Start applying - lots of opportunities out there.
      Mar 19
    • Wayfair / Eng
      jwxs26

      Wayfair Eng

      PRE
      Microsoft
      jwxs26more
      The new European PM was asking about microservices on SF today. Poor guy, doesn’t understand how fucked we are.
      Mar 19
    • Wayfair CUpR17
      OP
      Ahahahah, I saw there was some new guy in Germany who had an Express app with NodeJS and wanted Kubernetes on GCP. He was gonna be in for some huge surprise. 🤣
      Mar 19
    • Wayfair / Eng cplusplus1
      I can vouch for almost everything the person mentioned above. I have been asked questions like "What is git?" "Can pandas parse Excels?" from L3 and a Tech Lead respectively. Recently I read an email where they hired a guy for an SDE role who was a "chef" until 2 months ago. I don't know what's happening really. People put on busy masks by going into meetings and spending 1h talking vaguely about things, eventually coming out of the meeting having achieved nothing. At the end of the day, they advertise saying they're the best in the business when it comes to tech. Hahaha, sure sure.
      May 9
    • Snapchat / Other dudeLA
      I gave interview to Wayfair and it was the stupidest interview ever in my life.
      Jun 26
  • Wayfair 47th
    Man, I’m not even a developer but I hear about this shit on the day to day and it seems this is pretty spot on. The entire engineering department is led by chickens with their heads cut off.
    Mar 17 10
    • Wayfair CUpR17
      OP
      Just popped into the php-framework-wg channel. Wow, they are talking about uploading a very large file. They have no internal tool to handle it and are currently building one. If they used Amazon S3 or GCP's equivalent, they would have no problem and can use their API. I'll bet none of them have heard of multi-part uploads either.
      Mar 17
    • Wayfair CUpR17
      OP
      Came back here just to say it pisses me off so much they're hiring a project manager for PT and can't give us JIRA.
      Mar 17
    • Wayfair WNEm45
      Someone did a cost benefit analysis for PT vs JIRA. I think we all understand which side won that battle.
      Mar 17
    • Wayfair CUpR17
      OP
      They should of asked employees what the value in benefits of JIRA is compared to PT. PT makes people quit.
      Mar 17
    • Salesforce / Mgmt GQch66
      PT makes me sad. Of course, I came from Salesforce, which uses a custom version of Salesforce for its issue tracker. It's bad, but not as terrible as PT.

      The best part of being acquired by Salesforce was getting to jettison our GitHub-based robust CI/CD and JIRA integration. :(
      Mar 18
  • Google GodOfAllah
    If you cannot write a blind post with optimal length, how dare you judge anything else.
    Mar 16 7
    • Wayfair str(10)
      He might be delusional, but he's not wrong. The ineptitude at Wayfair really is shocking. It has to be experienced to be believed.
      Mar 17
    • Amazon leetN00b
      I agree with @God, how can you communicate clearly if you lose your audience along the way?
      Mar 17
    • Wayfair CUpR17
      OP
      This is Blind. I am venting. I could make the post 50% less, but it seems a lot of people enjoyed my rant, and now it seems too many people read it and an edit would be weird. Too late. If I have to post another tirade I will keep it in mind.
      Mar 17
    • Yelp tinsantin
      OP, I think you can ignore the man who calls himself God on blind.
      Mar 18
    • Amazon WicH21
      What a load of bull! I loved the post. Hopefully you get out op!
      Mar 20
  • Wayfair olYP50
    Trust me there internal wheels in motion to start solving these issues. The problem is at our scale it's hard to turn the ship on a dime. The cloud is not the winner you seem to think it is. We have smart engineering who seem to forget the basics to scaling. We are working on changing the culture one piece at a time.
    Mar 17 9
    • Wayfair CUpR17
      OP
      Git in particular is not built for a 20-30GB repository in mind with 2000 engineers pushing and pulling. Even Atlassian's blog says it is against Git best practices. But Wayfair wants to ignore experts' recommendations and continue on with the monolith and make it bigger with the "decoupled" many repos in one repo approach.

      Composer is not built for a monorepo too, so they are going against what Composer is for.

      I don't know what it is going to take to make them realize what a disaster they're setting themselves up for. It might take the site going down for 30 days over something, or the main database exploding, or nobody able to push code due to Git servers crashing.
      Mar 17
    • Salesforce / Mgmt GQch66
      As a previous poster mentioned, there are efforts to fix a lot of these things, especially on the infra side. It's slow going and the tight coupling (aka spaghetti) makes it really hard to unwind. There bad been a concerted effort to get Engineering leadership with actual experience, but it will take a while to unwind. There is a lot of WTF to navigate through to get there.
      Mar 18
    • Wayfair CUpR17
      OP
      They found the post and began defending Wayfair arch in the Slack channels. But this is a systematic problem, it is not any one person's fault.

      They defended using SQL tables as queues, using bare metal servers, and said the cloud is a fad. Someone said the OP doesn't know what he's taking about because he guessed wrong about what Wayfair used to send legacy SMS messages. It wasn't Jenkins, it was a PowerShell script.

      Where is the GIF with the cartoon character, and the point flies over his head?

      All the new people with experience who spoke up about the post were kindly defending it. A guy who has been here for a few months said the post resonated with him and asked why the codebase and infrastructure are in the state described in this post. Nobody could satisfactorily answer his question.
      Mar 24
    • Wayfair KpJP81
      What slack channel was this discussed in?
      Mar 26
    • Wayfair WNEm45
      Nonwork
      Jun 1
  • SpaceX Elon'sMusk
    Damn I wasn't planning on reading a novel tonight
    Mar 16 1
    • Wayfair str(10)
      It's not fiction tho...
      Mar 17
  • Monster unholy!
    Love love love love this post.
    Mar 16 0
  • Bloomberg ezJV08
    Tldr: TC or Gtfo
    Mar 16 1
  • Rubrik PartyStick
    OP please practice what your preach and shard your post properly
    Mar 16 1
    • And please do it based on client-side cookies
      Mar 20
  • Amazon / R&D
    ps.ily

    Amazon R&D

    PRE
    Google
    ps.ilymore
    ⭐️ Post of the Year ⭐️
    Apr 8 0
  • Wayfair CUpR17
    OP
    I cannot believe this company is getting $8 billion in revenue If you asked me 3-4 years ago, I would have never believed it. If there's anything I learned here, it's that you can get away with a lot in software ... but not without severe costs. I am pretty sure Wayfair wouldn't have to pile on tons of engineers if they wanted to get anything done and if the code wasn't full of spaghetti, and the infrastructure wasn't completely fucked. This process is just unsustainable and eventually they won't be able to keep up with the current technical direction.

    Oh I forgot... the public engineering blog. I wish I read it before I joined. They need to hire an editor to hide all the embarrassing things they talk about. They built their own version of React because the real React wasn't performant enough. It took management 3 years to realize to use React. Since they're supposedly a "data-driven company," management probably told someone they had to gather data and prove their version of React was bad before they could actually use React. They make people gather data for some of the most common sense things.

    This place ... we are spinning wheels with fucking SQUARES!
    Mar 16 3
    • Uber / Eng
      AywB51

      Uber Eng

      PRE
      Apple, Microsoft
      AywB51more
      But dude, it’s *revenue*. Not profit. Furniture is expensive.
      Mar 20
    • Wayfair CUpR17
      OP
      Yes, I know it's revenue, not profit.
      Mar 21
    • Amazon Northerner
      Good point about revenue, not profit from Uber 🙃
      Jun 23
  • Wayfair bnKI45
    I can confirm that everything in the post is true unfortunately. Most of the external hires I speak to who have any experience are in the same situation where they are just gritting their teeth and waiting for 1 year mark to make their escape. It's sad really, never worked anywhere as frustrating as this
    Mar 17 2
    • why do people wait for 1 year mark?
      Mar 18
    • Wayfair bnKI45
      They have to pay back the signing bonus and don't get to keep any of the equity if they leave before the year
      Mar 19
  • Proofpoint / Other oooooool
    OP add the ENG tag on this post so it lives forever in eng tagged posts because its a prime example of all the things that can go wrong and they should be on lookout for when choosing a new company.
    Mar 17 1
  • Amazon atmw35
    Every company has there own version of tech decisions that look irrational from an outsiders perspective, but have a rational reason to exist. The key to making positive change is understanding the "why" of the current architecture and finding a way to satisfy that along with new cases. Hate Jenkins? What job scheduler would you use instead? Why would it be better? Why can't you make Jenkins better?
    Mar 27 3
    • Wayfair topUser
      Jenkins isn't a job scheduler, it's a build automation tool. Using it to run regular batch processing in production is pure stupidity. If you MUST do scheduling that way, you can use Airflow or something because you can define the logical steps required in code and easily trace failures. Next?

      You've got the right idea but the problem we all face is the extreme bias toward building extremely bad systems because it's "easy" and "fast". It's a culture defined by people who went to business school and learned about MVPs from a book but never got to the chapter that told you that you're supposed to continue iterating on the MVP after you build it. Every team at Wayfair is working on at least one MVP, and almost no team that I've seen is working on a v1 or v2 of the same product. They replace MVPs with other MVPs.
      Mar 29
    • Wayfair CUpR17
      OP
      Easier said than done. I guarantee you every Wayfair employee who complains on Blind has made a suggestion to fix severe problems but we all get shutdown with "this isn't how we do things here" or "is there a ticket for talking about the problem?" or "use sqlserver to solve the problem".

      There is a heavy reliance on sql server for everything. They use it for queues. They use it for caching. Teams have asked for real queues like the AWS SQS or the GCP equivalent, but they get shut down. I heard we have RabbitMQ, but the SRE team limits the amount of messages sent, and they always have bottleneck problems.

      Teams spin up multiple Jenkins nodes to do concurrent work. They don't leverage multithreading. The other problem is 90% of company code is on PHP, and it doesn't support multithreading.
      Mar 29
    • SpaceX uevh5
      @topUser - hang on there buddy. Three days ago I discovered my team is using our build automation (Azure DevOps) as a task scheduler without any reasons. The difference is that when I pointed it out they replied "Yeah, that's messed up, would you fix it when you have time?"
      Apr 2
  • Twitter / Eng
    tremendous

    Twitter Eng

    PRE
    Microsoft, LinkedIn, Google
    tremendousmore
    I love this post OP but I’m concerned you might have a stroke soon please take care of yourself kk
    Mar 18 0
  • New naduruk
    Honestly first time hearing someone using Read Uncommitted transaction isolation level.

    3TB of RAM is REALLY a lot. We have done someone spiky (very high traffic) events in the world and still our primary database has only 64G RAM (in spite of one of the major application don’t have connection pooling).
    Mar 16 4
    • Wayfair CUpR17
      OP
      Because doing it the proper way and reading committed data locks up the rows. It's the only way they could "scale". 🙄 They say "performance reasons."
      Mar 16
    • SpaceX uevh5
      WITH NOLOCK is a common practice when your teammates knows no SQL, your DBAs are even worse and your DB is a sentient mountain of shit, but you are the one responsible for a query that is performance critical. When used by a competent developer it's basically their way to say "Fcuk this shit, I am going home"

      OP....I... I don't know what to say ....if there is Heaven you'll be on God's right side one day.
      Mar 16
    • Wayfair CUpR17
      OP
      Thank you 🙏
      Mar 16
    • Wayfair str(10)
      The reason we have this monstrosity is because so much business logic is coded into database code. Hundreds of stored procedures, all fighting for resources at once, querying against huge tables and doing tons of expensive aggregations and joinsand full table scans. So one of these sprocs might load 500GB of data into memory, and another might load 200GB, Etc etc. Rather than refactoring those sprocs into microservices or some other lighter weight method, they instead just built a bigger server. It's amazing that they've gotten this far.
      Mar 17
  • New Xobpord
    Thank you for taking the time to write this!! I interviewed at Wayfair a year ago. They were hounding me for months. I am a technical director and they told me they will offer me an associate director role. I thought it would be a good interview practice. I spoke to three other smug associate directors who were right out of college and thought they were the shit. They said they build everything in house coz of the Wayfair scale. I thought he was joking.. but nope.. he was serious. Everyone else was more interested in finding out how good am I at firing people. They have no process and they are bad role models for junior engineers. I didn’t respond to their email to discuss numbers. It’s bad if you are junior, it’s even worse if you are a senior engineer or manager.
    Mar 17 2
    • Wayfair CUpR17
      OP
      They think Wayfair's scale is so unique ... but Netflix uses the cloud, AirBnB uses the cloud, even Amazon uses their own cloud.....!!
      Mar 17
    • Microsoft staple
      Of course, what really scales is internal tools that weren't designed for widespread use, written by people who left a long time ago, and maintained by whoever broke it last.
      Mar 21
  • Tutanota randodata
    Well u see leetcode doesn’t have this stuff so what they are doing is clearly right
    Mar 16 1
    • Wayfair str(10)
      That's why we get to reject the best candidates - see, we're being Selective!
      Mar 17
  • Liberty Mutual Insurance whatsup99
    What group are you in? Currently interviewing for a developer role at Wayfair. Any groups to avoid and ones with more modern tech?
    Mar 19 5
    • Wayfair / Eng
      jwxs26

      Wayfair Eng

      PRE
      Microsoft
      jwxs26more
      All that have to do with writing code
      Mar 19
    • Wayfair WNEm45
      We pay better than Liberty Mutual. So I would tried to get paid more than now. Just don't be on a team that is on call.
      Mar 19
    • Wayfair CUpR17
      OP
      I will tell you that all the team's codes are dumpster fires in some way. If you care about sound system design, SOA, or micro services ... RUN!!!

      Every team member has to be on call in some way. You usually don't get put on until the first 6 months. But I was put on call a few months in.
      Mar 19
    • Liberty Mutual Insurance hgyifr55
      Most places probably pay better than Liberty, but I heard they pay OK at the higher levels. There’s gotta be teams with more modern tech or at least insulated more, it’s the same at Liberty, lots of groups in a 3k IT person department, but some groups are really carving out their own and pushing for better stuff. Interesting with the on-call stuff, we now have that too, I see more places drive the same kind of thinking.
      Mar 20
    • Wayfair / Eng deeeznutss
      My team isn’t on call. Sure we have criticals, but no one is required to work outside 9-5 at any point. Just saying it’s not everyone
      Mar 21
  • Microsoft ufhudv
    Great post! I have an interview with them soon. Now I’m having second thoughts. How long did you try making a difference until you realized it wasn’t going to help?
    Mar 17 4
    • Wayfair CUpR17
      OP
      Hello fellow Wayfairian!!! People almost didn't believe me with the amount of technical BS here!!!!!!!!
      Mar 17
    • Wayfair CUpR17
      OP
      The CTO is an ex-consultant who has never written a line of code and doesn't believe in microservices or ORMs.
      Mar 17
    • Wayfair str(10)
      For me it took about 3 months to realize that I wasn't going to be able to make a difference. It dawned on me when I realized that none of the PMs had ever worked in Product anywhere else and most engineers had joined straight out of undergrad so nobody knew any better.
      Mar 17
    • Wayfair CUpR17
      OP
      I don't blame the junior engineers. I think they have potential, but the higher ups are teaching them terrible practices, and they won't listen to anyone else.

      Management is clueless and delusional, have no experience with the cloud or have no interest in learning about it, and a lot of them started organically at Wayfair and never googled how to do things properly (so in that case yeah, junior engineers who grow into management positions become a problem).

      And yes the product managers are clueless too. There was one page that took 30 seconds to load due to a SQL query that spanned hundreds of lines. I asked a PM if I should refactor it. He said, "maybe. Put it in the backlog and we can talk about it with our tech lead." It never got done. 🙀

      Still SMFH over their decision 3 years ago to not use React, Node and PHP Rest Services for "performance reasons". Instead they built Tungsten, Turbine, etc., and now they are stuck with the consequences. Very few people know how to maintain the monstrosities, and no JS developer is gonna wanna join Wayfair so they can work on a legacy, homegrown JS framework.

      I still don't know what Tungsten and Turbine do and how it server-side renders JS code .....
      Mar 17
  • Rubrik PartyStick
    OP doesn’t seem happy here
    Mar 16 0