Industries

Tech

Financial Services

Healthcare

Job Groups

Software Engineering

Product Management

Finance

General Topics

Software Engineering CareerNov 28, 2020

HubspotSFYC50

Amazon Kinesis Outage

I’m curious how this lasted so long. Reading the postmortem, I’m noticing that there’s talk of memory pressure, but this is later determined to be due to running out of threads, or rather file handles. The JV will throw an out of memory exception with the message no more native threads available, when it can’t create a thread. Did the Kinesis engineers not know that this OOM is a open files limit issue and think the processes were running out of memory? https://aws.amazon.com/message/11201/

Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region

Amazon Web Services, Inc.

@Amazon

Sort by :

Google YioU58 Nov 28, 2020

Amazon systems are one of the world's most sophisticated systems with best engineers of the world. The main issue I understand by reading the article is to be sure that memory issue was the root cause. More like correlation and causation. Because a lot of errors start coming at once without understanding the root cause and taking steps in a panic state might have lead to more down time and that's why it took so much time to fix.

Nintendo false. Nov 28, 2020

Dead on ^ And you don't just react when it's burning down or everything can collapse around you and lose the info of what truly went wrong. You determine the root cause and you truly to make sure its the real root cause not just another symptom. Then you make a plan and fix it.

Hubspot SFYC50 OP Nov 28, 2020

> At 9:39 AM PST, we were able to confirm a root cause, and it turned out this wasn’t driven by memory pressure. Rather, the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. This is the key section to me, memory pressure was mistaken for running out of threads. I don’t see how you read this other than, they saw out of memory exceptions and didn’t look at the error messages.

Amazon apollo35 Nov 28, 2020

It seems that they got it wrong and handled it as OOM at the first time, that’s why it took longer time. Just guessing, don’t consider this as internal information.

Intel bitecoin Nov 28, 2020

One of my final exams got postponed on that day because of this. Now I have two exams on the same day! Bless sé Bezoz

Hide company name

0 credits left

Sort by : ...

Software Engineering CareerNov 28, 2020

HubspotSFYC50

Amazon Kinesis Outage

Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region

Amazon Web Services, Inc.

@Amazon

Sort by :

Google YioU58 Nov 28, 2020

Nintendo false. Nov 28, 2020

Hubspot SFYC50 OP Nov 28, 2020

Amazon apollo35 Nov 28, 2020

It seems that they got it wrong and handled it as OOM at the first time, that’s why it took longer time. Just guessing, don’t consider this as internal information.

Intel bitecoin Nov 28, 2020

One of my final exams got postponed on that day because of this. Now I have two exams on the same day! Bless sé Bezoz

Hide company name

0 credits left

Industries

Job Groups

General Topics

Amazon Kinesis Outage

Sponsored

Most Read

Amazon Kinesis Outage