I’m curious how this lasted so long. Reading the postmortem, I’m noticing that there’s talk of memory pressure, but this is later determined to be due to running out of threads, or rather file handles. The JV will throw an out of memory exception with the message no more native threads available, when it can’t create a thread. Did the Kinesis engineers not know that this OOM is a open files limit issue and think the processes were running out of memory? https://aws.amazon.com/message/11201/
It seems that they got it wrong and handled it as OOM at the first time, that’s why it took longer time. Just guessing, don’t consider this as internal information.
One of my final exams got postponed on that day because of this. Now I have two exams on the same day! Bless sé Bezoz
Amazon systems are one of the world's most sophisticated systems with best engineers of the world. The main issue I understand by reading the article is to be sure that memory issue was the root cause. More like correlation and causation. Because a lot of errors start coming at once without understanding the root cause and taking steps in a panic state might have lead to more down time and that's why it took so much time to fix.
Dead on ^ And you don't just react when it's burning down or everything can collapse around you and lose the info of what truly went wrong. You determine the root cause and you truly to make sure its the real root cause not just another symptom. Then you make a plan and fix it.
> At 9:39 AM PST, we were able to confirm a root cause, and it turned out this wasn’t driven by memory pressure. Rather, the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. This is the key section to me, memory pressure was mistaken for running out of threads. I don’t see how you read this other than, they saw out of memory exceptions and didn’t look at the error messages.