[HN Gopher] AWS us-east-1 outage
___________________________________________________________________
 
AWS us-east-1 outage
 
Author : judge2020
Score  : 1361 points
Date   : 2021-12-07 15:42 UTC (7 hours ago)
 
web link (status.aws.amazon.com)
w3m dump (status.aws.amazon.com)
 
| sbr464 wrote:
| booker.com/mindbody random areas are affected
 
| tonyhb wrote:
| Our services that are in us-east-2 are up, but I'm wondering how
| long that will hold true.
 
| technics256 wrote:
| our EKS/EC2 instances are OK
 
| AH4oFVbPT4f8 wrote:
| Unable to log into the console for us-east-1 for me too
 
| dimitar wrote:
| https://status.hashicorp.com/incidents/3qc302y4whqr - seems to
| have affected Hashicorp Cloud too
 
| singlow wrote:
| I am able to access the console for us-west-2 by going to a
| region specific URL: https://us-west-2.console.aws.amazon.com/
| 
| It does tack me to a non-region specific login page which is up,
| and then redirects back to us-west-2 which works.
| 
| If I go to my bookmarked login page it breaks because,
| presumably, it hits something that is broken in us-east-1.
 
| CoachRufus87 wrote:
| Does there exist a resource that tracks outages on a per-region
| basis over time?
 
| soheil wrote:
| Can confirm I can once again login to the Console and everything
| seems to be back to normal in us-east-2.
 
| crescentfresh wrote:
| The majority of our errors stem from:
| 
| - writing to Firehose (S3-backed)
| 
| - publishing to eventbridge
| 
| - terraform commands to ECS' API are stuck/hanging
| 
| Other spurious errors involving kinesis but nothing alarming. us-
| east-1
 
| hiyer wrote:
| Payments in Amazon India is down - likely because of this.
 
| dalrympm wrote:
| Contrary to what the status page says, CodePipeline is not
| working. Hitting the CLI I can start pipelines but they never
| complete and I get a lot of:
| 
| Connection was closed before we received a valid response from
| endpoint URL: "https://codepipeline.us-east-1.amazonaws.com/".
 
  | alexatalktome wrote:
  | Rumor is that our internal pipelines are the root cause. The
  | CICD pipelines (not tests, the literal pipeline infrastructure)
  | failed to block certain commits and pushed them to production
  | when not ready.
  | 
  | We've been told to manually disable them to ensure integrity of
  | our services when it recovers
 
| AzzieElbab wrote:
| Betting on dynamo again
 
| markus_zhang wrote:
| Just curious does it still make sense to claim that up time is X
| numbers of 9? (e.g. 99.999%)
 
  | throwanem wrote:
  | Yep. In this case, zero nines.
 
| [deleted]
 
| biohax2015 wrote:
| Getting 502 in Parameter Store. Cloudformation isn't returning
| either -- and that's how we deploy code :(
 
| rickreynoldssf wrote:
| EC2 at least seems fine but the console is definitely busted as
| of 16:00UTC
 
  | Bedon292 wrote:
  | I am still getting 'Invalid region parameter' for resources in
  | us-east-1, the others are fine.
 
| [deleted]
 
| SubiculumCode wrote:
| Must be why I can't seem to access my amazon account. I thought
| my account had gotten compromised.
 
| imstil3earning wrote:
| cant scale our Kubernetes cluster due to 500s from ECR :(
 
| jonnylangefeld wrote:
| Does anyone know why Google is showing the same spike on down
| detector as everything else? How does Google depend on AWS?
| https://downdetector.com/status/google/
 
  | NobodyNada wrote:
  | It's because Down Detector works off of user reports rather
  | than automatically detecting outages somehow. So, every time a
  | major service goes down (whether infrastructure like AWS or
  | Cloudflare, or user-facing like YouTube or Facebook), some
  | users will blame Google, ISPs, cellular providers, or some
  | other unrelated service.
 
  | gmm1990 wrote:
  | Some google sheets functions aren't updating in a timely manner
  | for me. Maybe people google as a backup for aws and they have
  | to throttle certain service from a higher load
 
| joelbondurant wrote:
| They should put everything in the cloud so hardware issues can't
| happen.
 
| megakid wrote:
| I live in London and I can't launch my Roomba vacuum to clean up
| after dinner because of this. Hurry up AWS, fix it!
 
| jrochkind1 wrote:
| This is effecting heroku.
| 
| While my heroku apps are currently up, I am unable to push new
| versions.
| 
| Logging in to heroku dashboard (which does work), there is a
| message pointing to this heroku status incident for "Availability
| issues with upstream provider in the US region":
| https://status.heroku.com/incidents/2390
| 
| How can there be an outage severe enough to be effecting
| middleman customers like heroku, but the AWS status page is still
| all green?!?!
| 
| If whoever runs the AWS status page isn't embaressed, they really
| ought to be.
 
  | VWWHFSfQ wrote:
  | AWS management APIs in the us-east-1 region is what is
  | affected. I'm guessing Heroku uses at least the S3 APIs when
  | deploying new versions, and those are failing
  | (intermittently?).
  | 
  | I advise not touching your Heroku setup right now. Even
  | something like trying to restart a dyno might mean it doesn't
  | come back since the slug is probably stored on S3 and that will
  | fail.
 
| valeness wrote:
| This is more than just east. I am seeing the same error on us-
| west-2 resources.
 
  | singlow wrote:
  | I am seeing some issues but only with services that have global
  | aspects such as s3. I can't create an s3 bucket even though I
  | am in us-west-2 because I think the names are globally unique
  | and creating them depends on us-east-1.
 
| herodoturtle wrote:
| Can't access Lightsail console even though our instances are in a
| totally different Region.
 
| avsteele wrote:
| Getting strange errors trying to manage my amazon account right
| now, could this be related?
| 
| 494 ERROR and "We're sorry Something went wrong with our website,
| please try again later."
 
| ricardobayes wrote:
| This was funny at first but now I can't even play Elder Scrolls
| Online :(
 
| numberwhun wrote:
| Amazon is having an outage in us-east-1, and it is bleeding over
| elsewhere, like eu: https://status.aws.amazon.com/
 
  | cyanydeez wrote:
  | Is that fail-failover?
 
    | hvgk wrote:
    | I think its a factorial fail.
 
      | the-dude wrote:
      | It is failures all the way down.
 
        | hvgk wrote:
        | A disc of failures resting on elephants of failures
        | resting on a giant turtle of failures? Or are you more a
        | turtles of failures all the way down sort of person?
 
  | sharpy wrote:
  | Those 2 services that are being marked as having problems in
  | other regions have fairly hard dependency on us-east-1. So that
  | would be why.
 
| blueside wrote:
| I was currently in the process of buying some tickets on
| Ticketmaster and the entire presale event had to be postponed for
| at least 4 hours due to this AWS outage.
| 
| I'm not complaining, I enjoyed the nostaliga - sometimes the web
| still feels like the late 90s
 
  | kingcharles wrote:
  | I'm complaining. I could not get my Happy Meal at McDonald's
  | this morning. Bad start to the day.
 
| _moof wrote:
| This is your regularly scheduled reminder:
| https://www.whoownsmyavailability.com/
 
| iso1210 wrote:
| My raspberry pi is still working just fine
 
  | marginalia_nu wrote:
  | Yeah, strange, my self-hosted server isn't affected either.
 
    | iso1210 wrote:
    | Seems "the cloud" had a major outage less than a month ago,
    | my laptop has a higher uptime.
    | 
    | $ 16:04 up 46 days, 7:02, 9 users, load averages: 3.68 3.56
    | 3.18
    | 
    | US East 1 was down just over a year ago
    | 
    | https://www.theregister.com/2020/11/25/aws_down/
    | 
    | Meanwhile I moved one of my two internal DNS servers to a
    | second site on 11 Nov 2020, and it's been up since then. One
    | of my monitoring machines has been filling, rotating and
    | deleting logs for 1,712 days with a load average in the c. 40
    | range for that whole time, just works.
    | 
    | If only there was a way to run stuff with an uptime of 364
    | days a year without using the cloud /s
 
      | nfriedly wrote:
      | I think the point of the could isn't increased uptime - the
      | point is that when it's down, bring it back up is _someone
      | else 's problem_.
      | 
      | (Also, OpEx vs CapEx financial shenanigans...)
      | 
      | All the same, I don't disagree with your point.
 
        | iso1210 wrote:
        | > the point is that when it's down, bring it back up is
        | someone else's problem.
        | 
        | When it's down, it's my problem, and I can't do anything
        | about it other than explain why I have no idea the system
        | is broken and can't do anything about it.
        | 
        | "Why is my dohicky down? When will it be back?"
        | 
        | "Because it's raining, no idea"
        | 
        | May be accurate, it's also of no use.
        | 
        | But yes, Opex vs Capex, of course that's why you can
        | lease your servers. It's far easier to spend company
        | money with another $500 a month on AWS than spend $500 a
        | year for a new machine.
 
      | debaserab2 wrote:
      | so does my toaster, oven and microwave. so what? they get
      | used a few times a day, but my production level equipment
      | serves millions in an hour.
 
        | iso1210 wrote:
        | My lightswitch is used twice a day, yet it works every
        | time. In the old days it would occasionally break (bulb
        | goes), I would be empowered to fix it myself (change the
        | bulb).
        | 
        | In the cloud you're at the mercy of someone who doesn't
        | even know you exist to fix it, without the protections
        | that say an electric company has with supplying domestic
        | users.
        | 
        | This thread has people unable to turn their lights on[0],
        | it's hilarious how people tie their stuff to dependencies
        | that aren't needed, with a history of constant failure.
        | 
        | If you want to host millions of people, then presumably
        | your infrastructure can cope with the loss of a single AZ
        | (and ideally the loss of Amazon as a whole). The vast
        | majority of people will be far better off without their
        | critical infrastructure going down in the middle of the
        | day in the busiest sales season going.
        | 
        | [0] https://news.ycombinator.com/item?id=29475499
 
  | jaywalk wrote:
  | Cool. Now let's have a race to see who can triple their
  | capacity the fastest. (Note: I don't use AWS, so I can actually
  | do it)
 
    | iso1210 wrote:
    | Why would I want to triple my capacity?
    | 
    | Most people don't need to scale to a billion users overnight.
 
      | jaywalk wrote:
      | Many B2B-type applications have a lot of usage during the
      | workday and minimal usage outside of it. No reason to keep
      | all that capacity running 24/7 when you only need most of
      | it for ~8 hours per weekday. The cloud is perfect for that
      | use case.
 
        | dijit wrote:
        | idk man, idle hardware doesn't use all that much power.
        | 
        | https://www.thomas-krenn.com/en/wiki/Processor_P-
        | states_and_...
        | 
        | Which is an implementation of:
        | 
        | https://web.eecs.umich.edu/~twenisch/papers/asplos09.pdf
 
        | iso1210 wrote:
        | Is it really? How much does that scaling actually cost?
        | 
        | And what's a workday anyway, surely you operate globally?
 
        | jaywalk wrote:
        | Scaling itself costs nothing, but saves money because
        | you're not paying for unused capacity.
        | 
        | The main application I run operates in 7 countries
        | globally, but the US is the only one that has enough
        | usage to require additional capacity during the workday.
        | So out of 720 hours in a 30 day month, cloud scaling
        | allows me to pay for additional capacity for only the
        | (roughly) 160 hours that it's actually needed. It's a
        | _significant_ cost saver.
        | 
        | And because the scaling is based on actual metrics, it
        | won't scale up on a holiday when nobody is using the
        | application. More cost savings.
 
        | vp8989 wrote:
        | You are (conveniently or not) incorrectly assuming that
        | the unit price of provisioned vs on-demand capacity is
        | the same. It's not.
 
        | jaywalk wrote:
        | Nice of you to assume that I don't understand the pricing
        | of the services I use. I can assure you that I do, and I
        | can also assure you that there is no such thing as
        | provisioned vs on-demand pricing for Azure App Service
        | until you get into the higher tiers. And even in those
        | higher tiers, it's cheaper _for me_ to use on-demand
        | capacity.
        | 
        | Obviously what I'm saying will not apply to all use
        | cases, but I'm only talking about mine.
 
        | [deleted]
 
| joelhaasnoot wrote:
| Amazon.com is also throwing errors left and right
 
| endisneigh wrote:
| Azure, Google Cloud, AWS and others need to have a "Status
| alliance" where they determine the status of each of their
| services by a quorum using all cloud providers.
| 
| Status pages are virtually useless these days
 
  | LinuxBender wrote:
  | Or just modify sites like DownDetector to show who is hosting
  | each site. When{n} number of sites are down hosted on {x} one
  | could draw a conclusion. It won't be as detailed as "xyz
  | services failed" but rather the overall operational chain is
  | broken. There could be a graph that shows _99% of sites hosted
  | on amazon US East 1 down_ it would be hard to hide that. This
  | could also paint a picture of what companies are not active-
  | active-multi-cloud.
 
    | avaika wrote:
    | Cloud providers are not just about web hosting. There are
    | dozens of tools hidden from end users. E.g. services used for
    | background operations / maintenance (e.g. aws codecommit /
    | codebuild / or even aws web console like today). This kind of
    | outages won't bring down your web site, but still might break
    | your normal workflow and even cost you some money.
 
  | DarthNebo wrote:
  | https://en.wikipedia.org/wiki/Mexican_standoff
 
    | beamatronic wrote:
    | It's not the prisoner's dilemma?
 
  | smt88 wrote:
  | They can do this without an alliance. They very intentionally
  | choose not to do it.
  | 
  | Every major company has moved away from having accurate status
  | pages.
 
    | arch-ninja wrote:
    | Steam has a great status page, companies like that and
    | cloudfare will eat Alphabet's lunch in the next 17-18 years.
 
      | ExtraE wrote:
      | That's a tight time frame a long way off. How'd you arrive
      | at 17-18?
 
      | moolcool wrote:
      | Where do you anticipate Steam to compete with Alphabet?
 
    | dentemple wrote:
    | It's because none of these companies are held responsible for
    | missing their actual SLAs, as opposed to their self-reported
    | SLA compliance.
    | 
    | So unless regulation gets implemented that says otherwise,
    | there's zero incentive for any company to maintain an
    | accurate status page.
 
      | soheil wrote:
      | How did you find a way to bring regulations into this?
      | There are monitoring services you can pay for to keep an
      | eye on your SLAs and your vendors'.
      | 
      | If not happy with the results switch.
 
        | ybloviator wrote:
        | Technically, there are already regulations. SLA lies are
        | fraud.
        | 
        | But I'm leery of any business who's so dishonest they
        | fear any outside oversight that brings repercussions for
        | said dishonesty.
        | 
        | "If not happy, switch" is silly - it's not the customer's
        | problem. And if you're a large customer and have invested
        | heavily in getting staff trained on AWS, you can't just
        | move.
 
        | soheil wrote:
        | A) don't build a business that relies solely on existence
        | of another B) switch to another vendor if not happy with
        | current vendor
        | 
        | Really not that complicated.
 
      | iso1210 wrote:
      | The only uptimes I'm concerned with are my own services
      | that my own monitoring keeps on top of, this varies - if
      | the monitoring page goes down for 10 seconds I'm not
      | worried, if one leg of a smpte-2022-7 is down for a second
      | that's fine, if it keeps going down for a second that's a
      | concern, etc.
      | 
      | If something I'm responsible goes down to the point that my
      | stakeholders are complaining (which is something seriously
      | wrong), they are not going to be happy with "oh the cloud
      | was down, not my fault"
      | 
      | If AWS is down or not it meaningless to me, if my service
      | running on AWS is down or not is the key metric.
      | 
      | If a service is down and I can't get into it, then chatter
      | on things like outages mailing list, or HN, will let me
      | know if it's yet another cloud failure, or if it's
      | something that's affecting my machine only.
 
      | cwkoss wrote:
      | I wonder if there could be profitable play where an
      | organization monitors SLA compliance, and then produces a
      | batch of lawsuits or class action suit on behalf of all of
      | its members when the SLA is violated.
 
        | pm90 wrote:
        | This is a neat idea. Install a simple agent in all
        | customers' environments, select AWS dependencies, then
        | monitor uptime over time. Aggregate across customers, and
        | then go to AWS with this data.
 
      | xtracto wrote:
      | >It's because none of these companies are held responsible
      | for missing their actual SLAs, as opposed to their self-
      | reported SLA compliance.
      | 
      | Right, there should be an "alliance" of customers from
      | different large providers (something like a Union but
      | instead of workers, it would be customers). They are the
      | ones that should measure SLAs and hold the provider
      | accountable.
 
    | beamatronic wrote:
    | In a broader world sense, we live in the post-truth era.
 
| andrew_ wrote:
| This is also effecting Elastic Beanstalk and Elastic Container
| Registry. We're getting 500s via API for both services.
 
| torbTurret wrote:
| Any leads on the cause? Some services on us-east-1 seem to be
| working just fine. Others not.
 
| sirfz wrote:
| AWS ec2 API requests are failing and ECR also seems to be down
| (error 500 on pull)
 
| mohanmcgeek wrote:
| Same happening right now with Goodreads
 
| zahil wrote:
| Is it just me or is it becoming more and more frequent, outages
| involving large websites?
 
| amir734jj wrote:
| My job (although 50% of time) at Azure is unit testing/monitoring
| services under different scenarios and flows to detect small
| failures that will be overlooked in public status page. Our tests
| run multiple times daily and we have people constantly monitoring
| logs. It concerns me when I see all AWS services are 100% green
| when I know there is an outage.
 
  | jamesfmilne wrote:
  | Oh, sweet summer child.
  | 
  | The reason you care about your status page being 100% accurate
  | is that your stock price is not directly linked to your status
  | page.
 
  | joshocar wrote:
  | I don't know how accurate this information is, but I'm hearing
  | that the monitor can't be updated because the service is in the
  | region that is down.
 
    | Rapzid wrote:
    | Kinda hard to believe after they were blasted for that very
    | situation during/after the S3 outage way back.
    | 
    | If that's the case, it's 100% a feature. They want as little
    | public proof of an outage after it's over and to put the
    | burden on customers completely to prove they violated SLAs.
 
| [deleted]
 
| fnord77 wrote:
| can't wait for the postmortem on this
 
| [deleted]
 
| andreyazimov wrote:
| My payment processor Paddle also stopped working.
 
| skapadia wrote:
| I'm absolutely terrified by our reliance on cloud providers (and
| modern technology, networks, satellites, electric grids, etc. in
| general), but the cloud hasn't been around that long. It is
| advancing extremely fast, and every problem makes it more
| resilient.
 
  | skapadia wrote:
  | IMO, the cloud is not the overall problem. Our insatiable
  | desire to want things right HERE, right NOW is the core issue.
  | The cloud is just a solution to try to meet that demand.
 
  | SkyPuncher wrote:
  | Would you prefer to have more of something good that you can't
  | occasionally have OR less of something good that you can always
  | have?
  | 
  | ----
  | 
  | The answer likely depends on the specific thing, but I'd argue
  | most people would take the better version of something at the
  | risk of it not working 1 or 2 days per year.
 
| l72 wrote:
| Most of our services in us-east-1 are still responding although
| we cannot log into the console. However, it looks like dynamodb
| is timing out most requests for us.
 
| john37386 wrote:
| It seems a bit long to fix!
| 
| They probably paint themselves in a corner just like facebook few
| weeks ago.
| 
| This make me think;
| 
| Could it be that one day the internet will have a total global
| outage and it will take few days to recover?
 
  | ericskiff wrote:
  | This actually happened back in 2012 or so. Major AWS outage
  | that took down big services all over the place, took a few days
  | for some services to come fully back online.
  | https://aws.amazon.com/message/680342/
 
  | devenvdev wrote:
  | The only possible scenario I could come up with is someone
  | crashing internet on purpose, e.g. some crazy uncatchable
  | trojan that starts DDOSing everything. I doubt such scenario is
  | feasible though...
 
  | rossdavidh wrote:
  | If we have a total global outage, Stack Overflow will be
  | unavailable, and the internet will never be fixed. :) Mostly
  | joking, I hope...
 
    | jaywalk wrote:
    | Some brave soul at Stack Overflow will have to physically go
    | into the datacenter, roll up a cart with a keyboard, monitor
    | and printer and start printing off a bunch of Networking
    | answers.
 
      | lesam wrote:
      | The StackOverflow datacenter is famously tiny - like, 10
      | Windows servers. So even if the rest of the internet goes
      | down hopefully they stay up. They might have to rebuild the
      | internet out from their server room though.
 
  | iso1210 wrote:
  | I'm not sure how you get a total global outage in a distributed
  | system. Lets say a major transit provider (Century link for
  | example) advertises traffic as a "go via me", but then drops
  | the traffic, lets also assume it drops the costs of routes to
  | pretty much zero. That would certainly have a major effect,
  | until their customers/peers stop their peers.
  | 
  | That might be tricky if they are remote and not on the same AS
  | as their router access points and have no completely out of
  | band access, but you're still talking hours at most.
 
| eoinboylan wrote:
| Can't SSM into ec2 in us-east-1
 
| johnsimer wrote:
| Everything seems to be functioning normally for me now
 
| ChrisArchitect wrote:
| We have always been at war with us-east-1.
 
| ComputerGuru wrote:
| The AWS status page no longer loads for me. /facepalm
 
| techthumb wrote:
| https://status.aws.amazon.com hasn't been updated to reflect
| outage yest
 
  | cyanydeez wrote:
  | Probably cached for good measure
 
  | hvgk wrote:
  | There's probably a lambda somewhere supposed to update it that
  | is screaming into the darkness at the moment.
  | 
  | According to an internal message I saw, their monitoring stuff
  | is fucked too.
 
| cbtacy wrote:
| It's funny but when I saw "AWS Outage" breaking, my first thought
| was "I bet it's US-east-1 again."
| 
| I know it's cheap but seriously... not worth it. Many of us have
| the scars to prove this.
 
| wrren wrote:
| Looks like their health check logic also sucks, just like mine.
 
| whoknowswhat11 wrote:
| Anyone understand why these services go down for so long?
| 
| That's the part I find interesting.
 
| pixelmonkey wrote:
| Looks like Kinesis Firehose is either the root cause, or severely
| impacted:
| 
| https://twitter.com/amontalenti/status/1468265799458639877
| 
| Segment is publicly reporting issues delivering to Firehose, and
| one of my company's real-time monitors also triggered for Kinesis
| Firehose an hour ago.
| 
| Update:
| 
| By my sniff of it, some "core" APIs are down for S3 and EC2 (e.g.
| GET/PUT on S3 and node create/delete on EC2). Systems like
| Kinesis Firehose and DynamoDB rely on these APIs under the hood
| ("serverless" is just "a server in someone else's data center").
| 
| Further update:
| 
| There is a workaround available for the AWS Console login issue.
| You can use https://us-west-2.console.aws.amazon.com/ to get in
| -- it's just the landing page that is down (because the landing
| page is in the affected region).
 
| muttantt wrote:
| Running anything on us-east-1 is asking for trouble...
 
| [deleted]
 
| anovikov wrote:
| Haha my developer called me in panic telling that he crashed
| Amazon - was doing some load tests with Lambda
 
  | imstil3earning wrote:
  | thats cute xD
 
  | Justsignedup wrote:
  | thank you for that big hearty laugh! :)
 
  | xtracto wrote:
  | How can you own a developer? is it expensive to buy one?
 
    | DataGata wrote:
    | Don't get nitty about saying "my X". People say "my plumber"
    | or "my hairstylist" or whatever all the time.
 
  | rossdavidh wrote:
  | If he actually knows how to crash Amazon, you have a new
  | business opportunity, albeit not a very nice one...
 
  | politelemon wrote:
  | It'd be hilarious if you kept that impression going for the
  | duration of the outage.
 
  | tgtweak wrote:
  | Postmortem: unbounded auto-scaling of lambda combined with
  | oversight on internal rate limits caused unforseen internal
  | ddos.
 
| kuya11 wrote:
| The blatant status page lies are getting absolutely ridiculous.
| How many hours does a service need to be totally down until it
| gets properly labelled as a "disruption"?
 
  | bearjaws wrote:
  | Yeah, we are seeing SQS, API Gateway (both socket and non
  | websocket) and S3 all completely unavailable. Status page shows
  | nothing despite having gotten several updates.
 
| nemothekid wrote:
| Some sage advice I learned a while ago: "Avoid us-east-1 as much
| as possible".
 
  | dr-detroit wrote:
  | If you need to be up all the time don't you use more than 1
  | region or do you need the ultra low ping for running 9-11
  | operator phone systems that calculate real time firing
  | solutions to snipe ICBMs out of low orbit?
 
  | soco wrote:
  | But if you use CloudFront, there you go.
 
| [deleted]
 
| alex_young wrote:
| EDIT: As pointed out below, I missed that this was for the Amazon
| Connect service, and not an update for all of US-EAST-1.
| Preserved for consistency, but obviously just a comprehension
| issue on my side.
| 
| At least the updates are amusing:
| 
| "9:18 AM PST We can confirm degraded Contact handling by agents
| in the US-EAST-1 Region. Agents may experience issues logging in
| or being connected with end-customers."
| 
| WTF is "contact handling", an "agent" or an "end-customer"?
| 
| How about something like "We are confirming that some users are
| not able to connect to AWS services in us-east-1. We're looking
| into it."
 
  | dastbe wrote:
  | that's an update for amazon connect, which is a customer
  | support related service.
 
  | detaro wrote:
  | Amazon Connect is a call-center product, so that report makes
  | sense.
 
| adwww wrote:
| Left a big terraform plan running while I put the kids to bed,
| checked back now and Amazon is on fire.... was it me?!
 
| mohanmcgeek wrote:
| I don't think it's a console outage. Goodreads has been down for
| a while
 
| jrs235 wrote:
| Search on Amazon.com seems to be broken too. This doesn't appear
| to just be affecting their AWS revenue.
 
  | authed wrote:
  | I've had issues login-in at amazon.com too... and IMDB is also
  | down
 
| bamboozled wrote:
| I don't think AWS knows what's going on judging by their updates,
| yes DynamoDB might be having issues, but so is IAM it seems,
| we're getting issues terminating resources for example.
 
| jgworks wrote:
| Someone posted this on our company slack:
| https://stop.lying.cloud/
 
| mbordenet wrote:
| I suspect the ex-Amazonian PragmaticPulp cites was let go from
| Amazon for a reason. The COE process works, provided the culture
| is healthy and genuinely interested in fixing systemic problems.
| Engineers who seek to deflect blame are toxic and unhelpful.
| Don't hire them!
 
| woshea901 wrote:
| N. Virginia consistently has more problems than other zones. Is
| it possible this zone is also hosting government computers/could
| it be a more frequent target for this reason?
 
  | longhairedhippy wrote:
  | The real reason is us-east-1 was the first and by far the
  | biggest region, the same reason that new services always launch
  | there but other regions are are not necessarily required (some
  | services have to launch in every region).
  | 
  | The us-east-1 region is consistently pushing the limits of
  | scale for the AWS services, thus is has way more problems than
  | other regions.
 
| exabrial wrote:
| Just a reminder that Colocation is always an option :)
 
| lgylym wrote:
| So reinvent is over. Time to deploy.
 
| filip5114 wrote:
| Can confrim us-east-1
 
  | romanhotsiy wrote:
  | Can confirm lambda is down in us-east-1. Other services seems
  | to work for us.
 
  | imnoscar wrote:
  | STS or console login not working either.
 
| all_usernames wrote:
| 25 Regions, 85 Availability Zones in this global cloud service
| and I can't login because of a failure in a single region (their
| oldest).
| 
| Can't login to AWS console at signin.aws.amazon.com:
| Unable to execute HTTP request: sts.us-east-1.amazonaws.com.
| Please try again.
 
| ipmb wrote:
| Looks like they've acknowledged it on the status page now.
| https://status.aws.amazon.com/
| 
| > 8:22 AM PST We are investigating increased error rates for the
| AWS Management Console.
| 
| > 8:26 AM PST We are experiencing API and console issues in the
| US-EAST-1 Region. We have identified root cause and we are
| actively working towards recovery. This issue is affecting the
| global console landing page, which is also hosted in US-EAST-1.
| Customers may be able to access region-specific consoles going to
| https://console.aws.amazon.com/. So, to access the US-WEST-2
| console, try https://us-west-2.console.aws.amazon.com/
 
  | jabiko wrote:
  | Yeah, but I still have a different understanding what
  | "Increased Error Rates" means.
  | 
  | IMHO it should mean that the rate of errors is increased but
  | the service is still able to serve a substantial amount of
  | traffic. If the rate of errors is bigger than, let's say, 90%
  | that's not an increased error rate, that's an outage.
 
    | thallium205 wrote:
    | They say that to try and avoid SLA commitments.
 
      | jiggawatts wrote:
      | Some big customers should get together and make an
      | independent org to monitor cloud providers and force them
      | to meet their SLA guarantees without being able to weasel
      | out of the terms like this...
 
  | guenthert wrote:
  | Uh, four minutes to identify the root cause? Damn, those guys
  | are on fire.
 
    | czbond wrote:
    | :) I imagine it went like this theoretical Slack
    | conversation:
    | 
    | > Dev1: Pushing code for branch "master" to "AWS API". >
    |  Your deploy finished in 4 minutes > Dev2: I can't
    | react the API in east-1 > Dev1: Works from my computer
 
    | tonyhb wrote:
    | It was down as of 7:45am (we posted in our engineering
    | channel), so that's a good 40 minutes of public errors before
    | the root cause was figured out.
 
    | Frost1x wrote:
    | Identify or to publicly acknowledge? Chances are technical
    | teams knew about this and noticed it fairly quickly, they've
    | been working on the issue for some time. It probably wasn't
    | until they identified the root cause and had a handful of
    | strategies to mitigate with confidence that they chose to
    | publicly acknowledge the issue to save face.
    | 
    | I've broken things before and been aware of it, but didn't
    | acknowledge them until I was confident I could fix them. It
    | allows you to maintain an image of expertise to those outside
    | who care about the broken things but aren't savvy to what or
    | why it's broken. Meanwhile you spent hours, days, weeks
    | addressing the issue and suddenly pull a magic solution out
    | of your hat to look like someone impossible to replace.
    | Sometimes you can break and fix things without anyone even
    | knowing which is very valuable if breaking something had some
    | real risk to you.
 
      | sirmarksalot wrote:
      | This sounds very self-blaming. Are you sure that's what's
      | really going through your head? Personally, when I get
      | avoidant like that, it's because of anticipation of the
      | amount of process-related pain I'm going to have to endure
      | as a result, and it's much easier to focus on a fix when
      | I'm not also trying to coordinate escalation policies that
      | I'm not familiar with.
 
    | flerchin wrote:
    | Outage started at 731 PST from our monitoring. They are on
    | fire, but not in a good way.
 
  | giorgioz wrote:
  | I'm trying to login in the AWS Console from other regions but
  | I'm getting HTTP 500. Anyone managed to login in other regions?
  | Which ones?
  | 
  | Our backend is failing, it's on us-east-1 using AWS Lambda, Api
  | Gateway, S3
 
  | bobviolier wrote:
  | https://status.aws.amazon.com/ still shows all green for me
 
    | banana_giraffe wrote:
    | It's acting odd for me. Shows all green in Firefox, but shows
    | the error in Chrome even after some refreshes. Not sure
    | what's caching where to cause that.
 
  | dang wrote:
  | Ok, we've changed the URL to that from https://us-
  | east-1.console.aws.amazon.com/console/home since the latter is
  | still not responding.
  | 
  | There are also various media articles but I can't tell which
  | ones have significant new information beyond "outage".
 
  | jesboat wrote:
  | > This issue is affecting the global console landing page,
  | which is also hosted in US-EAST-1
  | 
  | Even this little tidbit is a bit of a wtf for me. Why do they
  | consider it ok to have _anything_ hosted in a single region?
  | 
  | At a different (unnamed) FAANG, we considered it unacceptable
  | to have anything depend on a single region. Even the dinky
  | little volunteer-run thing which ran
  | https://internal.site.example/~someEngineer was expected to be
  | multi-region, and was, because there was enough infrastructure
  | for making things multi-region that it was usually pretty easy.
 
    | alfiedotwtf wrote:
    | Maybe has something to do with CloudFront mandating certs to
    | be in us-east-1?
 
      | tekromancr wrote:
      | YES! Why do they do that? It's so weird. I will deploy a
      | whole config into us-west-1 or something; but then I need
      | to create a new cert in us-east-1 JUST to let cloudfront
      | answer an HTTPS call. So frustrating.
 
        | jamesfinlayson wrote:
        | Agreed - in my line of work regulators want everything in
        | the country we operate from but of course CloudFront has
        | to be different.
 
    | sheenobu wrote:
    | I think I know specifically what you are talking about. The
    | actual files an engineer could upload to populate their
    | folder was not multi-region for a long time. The servers
    | were, because they were stateless and that was easy to multi-
    | region, but the actual data wasn't until we replaced the
    | storage service.
 
    | ehsankia wrote:
    | Forget the number of regions. Monitoring for X shouldn't even
    | be hosted on X at all...
 
    | stevehawk wrote:
    | I don't know if that should surprise us. AWS hosted their
    | status page in S3 so it couldn't even reflect its own outage
    | properly ~5 years ago.
    | https://www.theregister.com/2017/03/01/aws_s3_outage/
 
    | tekromancr wrote:
    | I just want to serve 5 terabytes of data
 
      | mrep wrote:
      | Reference for those out of the loop:
      | https://news.ycombinator.com/item?id=29082014
 
      | [deleted]
 
    | all_usernames wrote:
    | Every damn Well-Architected Framework includes multi-AZ if
    | not multi-region redundancy, and yet the single access point
    | for their millions of customers is single-region. Facepalm in
    | the form of $100Ms in service credits.
 
      | cronix wrote:
      | > Facepalm in the form of $100Ms in service credits.
      | 
      | It was also greatly affecting Amazon.com itself. I kept
      | getting sporadic 404 pages and one was during a purchase.
      | Purchase history wasn't showing the product as purchased
      | and I didn't receive an email, so I repurchased. Still no
      | email, but the purchase didn't end in a 404, but the
      | product still didn't show up in my purchase history. I have
      | no idea if I purchased anything, or not. I have never had
      | an issue purchasing. Normally get a confirmation email
      | within 2 or so minutes and the sale is immediately
      | reflected in purchase history. I was unaware of the greater
      | problem at that moment or I would have steered clear at the
      | first 404.
 
        | jjoonathan wrote:
        | Oh no... I think you may be in for a rough time, because
        | I purchased something this morning and it only popped up
        | in my orders list a few minutes ago.
 
      | vkgfx wrote:
      | >Facepalm in the form of $100Ms in service credits.
      | 
      | Part of me wonders how much they're actually going to pay
      | out, given that their own status page has only indicated
      | _five_ services with moderate ( "Increased API Error
      | Rates") disruptions in service.
 
    | ithkuil wrote:
    | One region? I forgot how to count that low
 
  | stephenr wrote:
  | When I brought up the status page (because we're seeing
  | failures trying to use Amazon Pay) it had EC2 and Mgmt Console
  | with issues.
  | 
  | I opened it again just now (maybe 10 minutes later) and it now
  | shows DynamoDB has issues.
  | 
  | If past incidents are anything to go by, it's going to get
  | worse before it gets better. Rube Goldberg machines aren't
  | known for their resilience to internal faults.
 
  | jeremyjh wrote:
  | They are still lying about it, the issues are not only
  | affecting the console but also AWS operations such as S3 puts.
  | S3 still shows green.
 
    | packetslave wrote:
    | IAM is a "global" service for AWS, where "global" means "it
    | lives in us-east-1".
    | 
    | STS at least has recently started supporting regional
    | endpoints, but most things involving users, groups, roles,
    | and authentication are completely dependent on us-east-1.
 
    | lsaferite wrote:
    | It's certainly affecting a wider range of stuff from what
    | I've seen. I'm personally having issues with API Gateway,
    | CloudFormation, S3, and SQS
 
      | pbalau wrote:
      | > We are experiencing _API_ and console issues in the US-
      | EAST-1 Region
 
        | jeremyjh wrote:
        | I read it as console APIs. Each service API has its own
        | indicator, and they are all green.
 
      | midasuni wrote:
      | Our corporate ForgeRock 2FA service is apparently broken.
      | My services are behind distributed x509 certs so no
      | problems there.
 
    | Rantenki wrote:
    | Yep, I am seeing failures on IAM as well:
    | aws iam list-policies              An error occurred (503)
    | when calling the ListPolicies operation (reached max retries:
    | 2): Service Unavailable
 
      | silverlyra wrote:
      | Same here. Kubernetes pods running in EKS are
      | (intermittently) failing to get IAM credentials via the
      | ServiceAccount integration.
 
| alde wrote:
| S3 bucket creation is failing across all regions for us, so this
| isn't an us-east-1 only issue.
 
| kp195_ wrote:
| The AWS console seems kind of broken for us in us-west-1
| (Northern California), but it seems like the actual services are
| working
 
| jacobkg wrote:
| AWS Connect is down, so our customer support phone system is down
| with it
 
  | nostrebored wrote:
  | Highly recommend talking to your account team to recommend
  | regional failovers and DR for Amazon Connect! With enough
  | feedback from customers, stuff like this can get prioritized.
 
    | jacobkg wrote:
    | Thanks, will definitely do that!
 
| misthop wrote:
| Getting 500 errors across the board on systems trying to hit s3
 
  | abaldwin99 wrote:
  | Ditto
 
| booleanbetrayal wrote:
| We're seeing load balancer failure in us-east-1 AZ's, so we are
| not exactly sure why this is being characterized as Console
| outage ...
 
| [deleted]
 
| whalesalad wrote:
| I can still hit EC2 boxes and networking is okay. DynamoDB is
| 100% down for the count, every request is an Internal Server
| Error.
 
  | l72 wrote:
  | We are also seeing lots of failures with DynamoDB across all
  | our services in us-east-1.
 
  | snewman wrote:
  | DynamoDB is fine for us. Not contradicting your experience,
  | just adding another data point. There is definitely something
  | hit-or-miss about this incident.
 
| jonnycomputer wrote:
| Ah, this might explain why my AWS requests were so slow, or
| timing out, this afternoon.
 
| sayed2020 wrote:
| Need google voice unlimited,
 
| artembugara wrote:
| I think there should be some third-party status checker alliance.
| 
| It's a joke. Each time AWS/Azure/GCP is down their status page
| says all is fine.
 
  | cphoover wrote:
  | Want to build a startup?
 
    | artembugara wrote:
    | already running one.
 
| _of wrote:
| ...and imdb
 
  | soco wrote:
  | And Netflix.
 
    | MR4D wrote:
    | And Venmo, and McDonald's, and....
    | 
    | This one is pretty epic (pun intended). Bad enough that Down
    | Detector [0] shows " _Reports indicate there may be a
    | widespread outage at Amazon Web Services, which may be
    | impacting your service._ " in a red alert bar at the top.
    | 
    | [0] - https://downdetector.com/
 
    | mrguyorama wrote:
    | It occurs to me that it's very nice that Netflix, Youtube,
    | and other streaming services tend to be on separate
    | infrastructure so they don't all go down at once
 
    | ec109685 wrote:
    | I'm surprised Netflix is down. They are multi-region:
    | https://netflixtechblog.com/active-active-for-multi-
    | regional...
 
| synergy20 wrote:
| No wonder I could not read books from amazon all of a sudden,
| what about their cloud-based redundancy design?
 
  | doesnotexist wrote:
  | Can also confirm that the kindle app is failing for me and has
  | been for the past few hours.
 
  | tgtweak wrote:
  | The book preview webservice or actual ebooks (kindle, etc)?
 
    | doesnotexist wrote:
    | For me, it's been that I am unable to download books to the
    | kindle app on my computer
 
| hvgk wrote:
| Well I got to bugger off home early so good job Amazon.
| 
| Edit: to be clear this is because I'm utterly helplessly unable
| to do anything at the moment.
 
  | AnIdiotOnTheNet wrote:
  | Yep, that's a consideration of going with cloud tech: if
  | something goes wrong you're often powerless. At least with on-
  | prem you know who to wake up in the middle of the night and
  | you'll get straight-forward answers about what's going on.
 
    | hvgk wrote:
    | Depends which provider you host your crap with. I've had real
    | trouble trying to get a top tier incident even acknowledged
    | by one of the pre cloud providers.
    | 
    | To be fair when it's AWS when something goes snap it's not my
    | problem which I'm happy about (until some wise ass at AWS
    | hires me) :)
 
      | AnIdiotOnTheNet wrote:
      | > Depends which provider you host your crap with.
      | 
      | That's what I'm saying: you host it yourself in facilities
      | owned by your company if you're not willing to have
      | everyone twiddle their thumbs during this sort of event.
      | Your DR environment can be co-located or hosted elsewhere.
 
| temuze wrote:
| Friends tell friends to pick us-east-2.
| 
| Virginia is for lovers, Ohio is for availability.
 
  | blahyawnblah wrote:
  | Lots of services are only in us-east-1. The sso system isn't
  | working 100% right now so that's where I assume it's hosted.
 
    | skwirl wrote:
    | Yeah, there are "global" services which are actually secretly
    | us-east-1 services as that is the region they use for
    | internal data storage and orchestration. I can't launch
    | instances with OpsWorks (not a very widely used service, I'd
    | imagine) even if those instances are in stacks outside of us-
    | east-1. I suspect Route53 and CloudFront will also have
    | issues.
 
    | johnsimer wrote:
    | Yeah I can't log in with our external SAML SSO to our AWS
    | dashboard to manage our us-east-2 resources. . . . Because
    | our auth is apparently routed thru us-east-1 STS
 
    | jhugo wrote:
    | You can pick the region for SSO -- or even use multiple. Ours
    | is in ap-southeast-1 and working fine -- but then the console
    | that it signs us into is only partially working presumably
    | due to dependencies on us-east-1.
 
  | bithavoc wrote:
  | Sometimes you can't avoid us-east-1; an example is AWS ECR
  | Public. It's a shame. Meanwhile, DockerHub is up and running
  | even when it's in EC2 itself.
 
  | vrocmod wrote:
  | No one was in the room where it happened
 
  | mountainofdeath wrote:
  | us-east-1 is a cursed region. It's old, full of one-off patches
  | to work and tends to be the first big region released to.
 
  | more_corn wrote:
  | This is funny, but true. I've been avoiding us-east-1 simply
  | because thats where everyone else is. Spot instances are also
  | less likely to be expensive in less utilized regions.
 
  | politician wrote:
  | Can I get that on a license plate?
 
  | kavok wrote:
  | Didn't us-east-2 have an issue last week?
 
  | stephenr wrote:
  | Friends tell Friends not to use Rube Goldberg machines as their
  | infrastructure layer.
 
  | johnl1479 wrote:
  | This is also a clever play on the Hawthorne Heights song.
 
  | PopeUrbanX wrote:
  | I wonder why AWS has Ohio and Virginia but no region in the
  | northeast where a significant plurality of customers using east
  | regions probably live.
 
  | api wrote:
  | I live in Ohio and can confirm. If the Earth were destroyed by
  | an asteroid Ohio would be left floating out there somehow
  | holding onto an atmosphere for about ten years.
 
  | tgtweak wrote:
  | If you're not multi-cloud in 2021 and are expecting 5-9's, I
  | feel bad for you.
 
    | post-it wrote:
    | I imagine there are very few businesses where the extra cost
    | of going multi-cloud is smaller than the cost of being down
    | during AWS outages.
 
      | gtirloni wrote:
      | Also, going multi-cloud will introduce more complexity
      | which leads to more errors and more downtime. I'd rather
      | sit this outage out than deal with daily risk of downtime
      | because I'm infrastructure is too smart for its own good.
 
        | shampster wrote:
        | Depends on the criticality of the service. I mean you're
        | right about adding complexity. But sometimes you can just
        | take your really critical services and make sure it can
        | completely withstand any one cloud provider outage.
 
    | unethical_ban wrote:
    | If you're not multi-region, I feel bad for you.
    | 
    | If your company is shoehorning you into using multiple clouds
    | and learning a dozen products, IAM and CICD dialects
    | simultaneously because "being cloud dependent is bad", I feel
    | bad for you.
    | 
    | Doing _one_ cloud correctly from a current DevSecOps
    | perspective is a multi-year ask. I estimate it takes about 25
    | people working full time on managing and securing
    | infrastructure per cloud, minimum. This does not include
    | certain matrixed people from legacy network /IAM teams. If
    | you have the people, go for it.
 
      | tgtweak wrote:
      | There are so many things that can go wrong with a single
      | provider, regardless of how many availability zones you are
      | leveraging, that you cannot depend on 1 cloud provider for
      | your uptime if you require that level of up.
      | 
      | Example: Payment/Administrative issues, rogue employee with
      | access, deprecated service, inter-region routing issues,
      | root certificate compromises... the list goes on and it is
      | certainly not limited to single AZ.
      | 
      | A very good example, is that regardless of which of the 85
      | AZs you are in at aws, you are affected by this issue right
      | now.
      | 
      | Multi-cloud with the right tooling is trivial. Investing in
      | learning cloud-proprietary stacks is a waste of your
      | investment. You're a clown if you think you need 25 people
      | internally per cloud is required to "do it right".
 
        | unethical_ban wrote:
        | All cloud tech is proprietary.
        | 
        | There is no such thing as trivially setting up a secure,
        | fully automated cloud stack, much less anything like a
        | streamlined cloud agnostic toolset.
        | 
        | Deprecated services are not the discussion here. We're
        | talking tactical availability, not strategic tools etc.
        | 
        | Rogue employees with access? You mean at the cloud
        | provider or at your company? Still doesn't make sense.
        | Cloud IAM is very difficult in large organizations, and
        | each cloud does things differently.
        | 
        | I worked at fortune 100 finance on cloud security. Some
        | things were quite dysfunctional, but the struggles and
        | technical challenges are real and complex at a large
        | organization. Perhaps you're working on a 50 employee
        | greenfield startup. I'll hesitate to call you a clown as
        | you did me, because that would be rude and dismissive of
        | your experience (if any) in the field.
 
      | throwmefar32 wrote:
      | This.
 
      | ricardobayes wrote:
      | Someone start devops as a service please
 
    | ryuta wrote:
    | How do you become multi-cloud if your root domain is in
    | Route53? Have Backup domains on the client side?
 
      | tgtweak wrote:
      | dns records should be synced to secondary provider, and
      | that provider added to your secondary/tertiary domain dns.
      | 
      | Multi-provider dns is a solved problem.
 
    | temuze wrote:
    | If you're having SLA problems I feel bad for you son
    | 
    | I got two 9 problems cuz of us-east-1
 
      | ShroudedNight wrote:
      | > ~~I got two 9 problems cuz of us-east-1~~
      | 
      | I left my two nines problems in us-east-1
 
  | sbr464 wrote:
  | Ohio's actual motto funnily kind of fits here:
  | With God, all things are possible
 
    | shepherdjerred wrote:
    | Does this imply Virginia is Godless?
 
      | sbr464 wrote:
      | Maybe virginia is for lovers, ohio is for God/gods?
 
      | tssva wrote:
      | Virginia's actual motto is "Sic semper tyrannis". What's
      | more tyrannical than an omnipotent being that will condemn
      | you to eternal torment if you don't worship them and follow
      | their laws.
 
        | sneak wrote:
        | I mean, most people are okay with dogs and Seattle.
 
        | [deleted]
 
        | mey wrote:
        | I think I should add state motto to my data center
        | consideration matrix.
 
        | bee_rider wrote:
        | Virginia and Massachusetts have surprisingly aggressive
        | mottoes (MA is: "By the sword we seek peace, but peace
        | only under liberty", which is really just a fancy way of
        | saying "don't make me stab a tyrant," if you think about
        | it). It probably makes sense, though, given that they
        | came up with them during the revolutionary war.
 
| anonu wrote:
| I think its just the console - as my EC2 in us-east-1 are still
| reachable.
 
  | jrs235 wrote:
  | I think it's affecting more than just AWS. Try searching
  | amazon.com. That's broken for me.
 
| dylan604 wrote:
| The fun thing about these types of outages are seeing all of the
| people that depend upon these services with no graceful fallback.
| My roomba app will not even launch because of the AWS outage. I
| understand that the app gets "updates" from the cloud. In this
| case "updates" is usually promotional crap, but whatevs. However,
| for this to prevent the app launching in a manner that I can
| control my local device is total BS. If you can't connect to the
| cloud, fail, move on and load the app so that local things are
| allowed to work.
| 
| I'm guessing other IoT things suffer from this same short
| sitedness as well.
 
  | codegeek wrote:
  | "If you can't connect to the cloud, fail, move on and load the
  | app so that local things are allowed to work."
  | 
  | Building fallbacks require work. How much extra effort and
  | overhead is needed to build something like this ? Sometimes the
  | cost vs benefits says that it is ok not to do it. If AWS has an
  | outage like this once a year, maybe we can deal with it (unless
  | you are working with mission critical apps).
 
    | dylan604 wrote:
    | Yes, it is a lot of work to test if response code is OK or
    | not, or if a timeout limit has been reached. So much so, I
    | pretty much wrote the test in the first sentence. Phew. 10x
    | coder right here!
 
  | lordnacho wrote:
  | If you did that some clever person would set up their PiHole so
  | that their device just always worked, and then you couldn't
  | send them ads and surveil them. They'd tell their friends and
  | then everyone would just use their local devices locally.
  | Totally irresponsible what you're suggesting.
 
    | apexalpha wrote:
    | A little off-topic, but there are people working on it:
    | https://valetudo.cloud/
    | 
    | It's a little harder than blocking the DNS unfortunately. But
    | nonetheless it always brings a smile to my face to see that
    | there's a FOSS frontier for everything.
 
    | beamatronic wrote:
    | An even more clever person would package up this box, and
    | sell it, along with a companion subscription service, to help
    | busy folks like myself.
 
      | dylan604 wrote:
      | But this new little box would then be required to connect
      | to the home server to receive updates. Guess what? No
      | updates, no worky!! It's a vicious circle!!! Outages all
      | the way down
 
    | arksingrad wrote:
    | this is why everyone runs piholes and no one sees ads on the
    | internet anymore, which killed the internet ad industry
 
      | dylan604 wrote:
      | Dear person from the future, can you give me a hint on who
      | wins the upcoming sporting events? I'm asking for a friend
      | of course
 
        | 0des wrote:
        | Also, what's the verdict on John Titor?
 
  | sneak wrote:
  | Now think of how many assets of various governments' militaries
  | are discreetly employed as normal operational staff by FAAMG in
  | the USA and have access to _cause_ such events from scratch. I
  | would imagine that the US IC (CIA /NSA) already does some free
  | consulting for these giant companies to this end, because they
  | are invested in that Not Being Possible (indeed, it's their
  | job).
  | 
  | There is a societal resilience benefit to not having
  | unnecessary cloud dependencies beyond the privacy stuff. It
  | makes your society and economy more robust if you can continue
  | in the face of remote failures/errors.
  | 
  | It is December 7th, after all.
 
    | dylan604 wrote:
    | > I would imagine that the US IC (CIA/NSA) already does some
    | free consulting for these giant companies to this end,
    | 
    | Haha, it would be funny if the IC reaches out to BigTech when
    | failures occur to let them know they need not be worried
    | about data loses. They can just borrow a copy of the data IC
    | is siphoning off them. /s?
 
  | taf2 wrote:
  | I wouldn't jump to say it's short sitedness (it is shitty) but
  | it could be a matter of being pragmatic... It's easier to
  | maintain the code if it is loaded at run time (think thin
  | client browser style). This way your iot device can load the
  | lastest code and even settings from the cloud... (advantage
  | when the cloud is available)... I think of this less of short
  | sitedness and more a reasonable trade off (with shitty side
  | effects)
 
    | epistasis wrote:
    | I don't think that's ever a reasonable tradeoff! Network
    | access goes down all the time, and should be a fundamental
    | assumption of any software.
    | 
    | Maybe I'm too old, but I can't imagine a seasoned dev, much
    | less a tech lead, omitting planning for that failure mode
 
    | outime wrote:
    | Then you could just keep a local copy available as a fallback
    | in case the latest code cannot be fetched. Not doing the bare
    | minimum and screwing the end user isn't acceptable IMHO. But
    | I also understand that'd take some engineer hours and report
    | virtually no benefits as these outages are rare (not sure how
    | Roomba's reliability is in general on the other hand) so here
    | we are.
 
  | s_dev wrote:
  | >The fun thing about these types of outages are seeing all of
  | the people that depend upon these services with no graceful
  | fallback.
  | 
  | Whats a graceful fallback? Switching to another hosting service
  | when AWS goes down? Wouldn't that present another set of
  | complications for a very small edge case at huge cost?
 
    | rehevkor5 wrote:
    | Usually this refers to falling back to a different region in
    | AWS. It's typical for systems to be deployed in multiple
    | regions due to latency concerns, but it's also important for
    | resiliency. What you call "a very small edge case" is
    | occurring as we speak, and if you're vulnerable to it you
    | could be losing millions of dollars.
 
      | stevehawk wrote:
      | probably not possible for a lot mroe services than you'd
      | think because AWS Cognito has no decent failover method
 
      | simmanian wrote:
      | AWS itself has a huge single point of failure on us-east-1
      | region. Usually, if us-east-1 goes down, others soon
      | follow. At that point, it doesn't matter how many regions
      | you're deploying to.
 
        | scoopertrooper wrote:
        | My workloads on Sydney and London are unaffected. I can't
        | speak for anywhere else.
 
        | lowbloodsugar wrote:
        | "Usually"? When has that ever happened?
 
        | dylan604 wrote:
        | https://awsmaniac.com/aws-outages/
 
    | winrid wrote:
    | In this case, just connect over LAN.
 
      | politician wrote:
      | Or BlueTooth.
 
      | s_dev wrote:
      | Right -- I think I've misread OP as graceful fallback e.g.
      | working offline.
      | 
      | Rather than implement a dynamically switching backup in the
      | event of AWS going down which is not trivial.
 
    | [deleted]
 
    | itisit wrote:
    | > Wouldn't that present another set of complications for a
    | very small edge case at huge cost?
    | 
    | One has to crunch the numbers. What does a service outage
    | cost your business every minute/hour/day/etc in terms of lost
    | revenue, reputational damage, violated SLAs, and other
    | factors? For some enterprises, it's well worth the added
    | expense and trouble of having multi-site active-active setups
    | that span clouds and on-prem.
 
  | thayne wrote:
  | there's a reason it is called the _Internet_ of things, and not
  | the  "local network of things". Even if the latter is probably
  | what most customers would prefer.
 
    | 9wzYQbTYsAIc wrote:
    | There's also no reason for an internet connected app to crash
    | on load when there is no access to the internet services.
 
      | mro_name wrote:
      | indeed.
      | 
      | A constitutional property of a network is it's volatility.
      | Nodes may fail. Edges may. You may not. Or you may. But
      | then you're delivering no reliabilty but crap. Nice
      | sunshine crap, maybe.
 
| birdyrooster wrote:
| Gives me a flashback to the December 24th, 2012 outage. I guess
| not much changes in 9 years time.
 
| stephenr wrote:
| So, we're getting failures (for customers) trying to use amazon
| pay from our site. AFAIK there is no "status page" for Amazon
| Pay, but the _rest_ of Amazon 's services seem to be a giant Rube
| Goldberg machine so it's hard to imagine this isn't too.
 
  | itisit wrote:
  | http://status.mws.amazon.com/
 
    | stephenr wrote:
    | Thanks.. seems to be about as accurate as the regular AWS
    | status board is..
 
      | itisit wrote:
      | They _just_ added a banner. My guess is they don 't know
      | enough yet to update the respective service statuses.
 
        | stephenr wrote:
        | I have basically zero faith in Amazon at this point.
        | 
        | We first noticed failures because a tester happened to be
        | testing in an env that uses the Amazon Pay sandbox.
        | 
        | I checked the prod site, and it wouldn't even ask me to
        | login.
        | 
        | When I tried to login to SellerCentral to file a ticket -
        | it told me my password (from a pw manager) was wrong.
        | When I tried to reset, the OTP was ridiculously slow.
        | Clicking "resend OTP" gives a "the OTP is incorrect"
        | error message. When I finally got an OTP and put it in,
        | the resulting page was a generic Amazon "404 page not
        | found".
        | 
        | A while later, my original SellerCentral password, still
        | un-changed because I never got another OTP to reset it,
        | worked.
        | 
        | What the fuck kind of failure mode is that "services are
        | down, so password must be wrong".
 
        | itisit wrote:
        | Sorry to hear. If multi-cloud is the answer, I wouldn't
        | be surprised to see folks go back to owning and operating
        | their own gear.
 
| Tea418 wrote:
| If you have trouble logging in to AWS Console, you can use a
| regional console endpoint such as https://eu-
| central-1.console.aws.amazon.com/
 
  | jedilance wrote:
  | I also see same error here: Internal Error, Please try again
  | later.
 
  | throwanem wrote:
  | Didn't work for me just now. Same error as the regular
  | endpoint.
 
| judge2020 wrote:
| Got alerted to 503 errors for SES, so it's not just the
| management console.
 
| lowwave wrote:
| this is exact kinda over centralisation issues I was talking
| about. I'm one of first developers using AWS EC2, sure back when
| scaling is hard for small dev shops. In now day and age any one
| who is technically inclined, can figure out using the new
| technologies. Why even use AWS. Get something like Hetzner,
| Linodes please!
 
  | binaryblitz wrote:
  | How are those better than using AWS/Azure/GCP/etc? I'd say the
  | correct way to handle situations is to have things in multiple
  | regions, and potentially multiple clouds if possible.
  | Obviously, things like databases would be harder to keep in
  | sync on multi cloud, but not impossible.
 
  | pojzon wrote:
  | Some manager: But it does not web scale!
 
    | dr-detroit wrote:
    | the average HN poster: I run things on a box under my desk
    | and you cannot teach me why thats bad!!!
 
| saisundar wrote:
| Yikes, ring, the security system is also down.
| 
| Wonder if crime rates might eventually spike up if aws goes down,
| in an utopian world where Amazon gets everyone to use ring.
 
  | kingcharles wrote:
  | I'm now imagining a team of criminals sitting around in face
  | masks and hoodies refreshing AWS status page all day...
  | 
  | "AWS is down! Christmas came early boys! Roll out..."
 
| john37386 wrote:
| imdb seems down too and returning 503. Is it related? Here is the
| output. Kind of funny.
| 
| D'oh!
| 
| Error 503
| 
| We're sorry, something went wrong.
| 
| Please try again...wait...wait...yep, try reload/refresh now.
| 
| But if you are seeing this again, please report it here.
| 
| Please explain which page you were at and where on it that you
| clicked
| 
| Thank you!
 
  | takeda wrote:
  | IMDB belongs to Amazon, so likely on AWS too.
  | 
  | This also confirms it: https://downdetector.com/status/imdb/
 
    | binaryblitz wrote:
    | TIL Amazon owns IMDB
 
      | takeda wrote:
      | Yeah, I was also surprised when I learned this. Another
      | surprising thing is that they own it since 1998.
 
| whoisjuan wrote:
| us-east-1 is so unreliable that it probably should be nuked. It's
| the region with the worst track record. I guess it doesn't help
| that is one of the oldest.
 
  | zrail wrote:
  | US-EAST-1 is the oldest, largest, and most heterogeneous
  | region. There are many data centers and availability zones
  | within the region. I believe it's where AWS rolls out changes
  | first, but I'm not confident on that.
 
| duckworth wrote:
| After over 45 minutes https://status.aws.amazon.com/ now shows
| "AWS Management Console - Increased Error Rates"
| 
| I guess 100% is technically an increase.
 
  | Slartie wrote:
  | "Fixed a bug that could cause [adverse behavior affecting 100%
  | of the user base] for some users"
 
    | sophacles wrote:
    | "some" as in "not all". I'm sure there are some tiny tiny
    | sites that were unaffected because no one went to them during
    | the outage.
 
      | Slartie wrote:
      | "some" technically includes "all", doesn't it? It excludes
      | "none", I suppose, but why should it exclude "all" (except
      | if "all" equals "none")?
 
  | brasetvik wrote:
  | I can't remember seeing problems be more strongly worded than
  | "Increased Error Rates" or "high error rates with S3 in us-
  | east-1" during the infamous S3 outage of 2017 - and that was
  | after they struggled to even update their own status page
  | because of S3 being down. :)
 
    | schleck8 wrote:
    | During the Facebook outage FB wrote something along the lines
    | of "We noticed that some users are experiencing issues with
    | our apps" eventhough nothing worked anymore
 
| boldman wrote:
| I think now is a good time to reiterate the danger of companies
| just throwing all of their operational resilience and
| sustainability over the wall and trusting someone else with their
| entire existence. It's wild to me that so many high performing
| businesses simply don't have a plan for when the cloud goes down.
| Some of my contacts are telling me that these outages have teams
| of thousands of people completely prevented from working and tens
| of million dollars of profit are simply vanishing since the start
| of the outage this morning. And now institutions like government
| and banks are throwing their entire capability into the cloud
| with no recourse or recovery plan. It seems bad now but I wonder
| how much worse it might be when no one actually has access to
| money because all financial traffic is going through AWS and it
| goes down.
| 
| We are incredibly blind to just trust just 3 cloud providers with
| the operational success of basically everything we do.
| 
| Why hasn't the industry come up with an alternative?
 
  | xwdv wrote:
  | There is an alternative: A _true_ network cloud. This is what
  | Cloudflare will eventually become.
 
  | jacobsenscott wrote:
  | We have or had alternatives - rackspace, linode, digital ocean,
  | in the past there were many others, self hosting is still an
  | option. But the big three just do it better. The alternatives
  | are doomed to fail. If you use anything other than the big
  | three you risk not just more outages, but your whole provider
  | going out of business overnight.
  | 
  | If the companies at the scale you are talking about do not have
  | multi-region and multi service (aws to azure for example)
  | failover that's their fault, and nobody else's.
 
  | [deleted]
 
  | p1necone wrote:
  | Do you think they'd manage their own infra better? Are you
  | suggesting they pay for a fully redundant second implementation
  | on another provider? How much extra cost would that be vs
  | eating an outage very infrequently?
 
  | jessebarton wrote:
  | In my opinion there is a lack of talent in these industries for
  | building out there own resilient systems. IT people and
  | engineers get lazy.
 
    | grumple wrote:
    | We're too busy in endless sprints to focus on things outside
    | of our core business that don't make salespeople and
    | executives excited.
 
    | BarryMilo wrote:
    | No lazier than anyone else, there's just not enough of us, in
    | general and per company.
 
    | rodgerd wrote:
    | > IT people and engineers get lazy.
    | 
    | Companies do not change their whole strategy from a capex-
    | driven traditional self-hosting environment to opex-driven
    | cloud hosting because their IT people are lazy; it is
    | typically an exec-level decision.
 
  | tuldia wrote:
  | > Why hasn't the industry come up with an alternative?
  | 
  | We used to have that, some companies still have the capability
  | and know-how to build and run infrastructure that is reliable,
  | distributed across many hosting providers before "cloud" became
  | the "norm", but it goes along with "use or lose it".
 
  | adflux wrote:
  | Because your own datacenters cant go down?
 
  | uranium wrote:
  | Because the expected value of using AWS is greater than the
  | expected value of self-hosting. It's not that nobody's ever
  | heard of running on their own metal. Look back at what everyone
  | did before AWS, and how fast they ran screaming away from it as
  | soon as they could. Once you didn't have to do that any more,
  | it's just so much better that the rare outages are worth it for
  | the vast majority of startups.
  | 
  | Medical devices, banks, the military, etc. should generally run
  | on their own hardware. The next photo-sharing app? It's just
  | not worth it until they hit tremendous scale.
 
    | chucknthem wrote:
    | Agree with your first point.
    | 
    | On the second though, at some point, infrastructure like AWS
    | are going to be more reliable than what many banks, medical
    | device operators etc can provide themselves. asking them to
    | stay on their own hardware is asking for that industry to
    | remain slow, bespoke and expensive.
 
      | uranium wrote:
      | Agreed, and it'll be a gradual switch rather than a single
      | point, smeared across industries. Likely some operations
      | won't ever go over, but it'll be a while before we know.
 
      | hn_throwaway_99 wrote:
      | Hard agree with the second paragraph.
      | 
      | It is _incredibly_ difficult for non-tech companies to hire
      | quality software and infrastructure engineers - they
      | usually pay less and the problems aren 't as interesting.
 
  | seeEllArr wrote:
  | They have, it's called hosting on-premesis, and it's even less
  | reliable than cloud providers.
 
  | smugglerFlynn wrote:
  | Many of those businesses wouldn't have existed in the first
  | place without simplicity offered by cloud.
 
  | stfp wrote:
  | > tens of million dollars of profit are simply vanishing
  | 
  | vanishing or delayed six hours? I mean
 
    | naikrovek wrote:
    | money people think of money very weirdly. when they predict
    | they will get more than they actually get, they call it
    | "loss" for some reason, and when they predict they will get
    | less than they actually get, it's called ... well I don't
    | know what that's called but everyone gets bonuses.
 
    | lp0_on_fire wrote:
    | 6 hours of downtime often means 6 hours of paying employees
    | to stand around which adds up rather quickly.
 
  | nprz wrote:
  | So you're saying companies should start moving their
  | infrastructure to the blockchain?
 
    | tornato7 wrote:
    | Ethereum has gone 5 years without a single minute of
    | downtime, so if it's extreme reliability you're going for I
    | don't think it can be beaten.
 
  | rbetts wrote:
  | We're too busy working generating our own electricity and
  | designing our own CPUs.
 
  | commandlinefan wrote:
  | Well, if you're web-based, there's never really been any better
  | alternative. Even before "the cloud", you had to be hosted in a
  | datacenter somewhere if you wanted enough bandwidth to service
  | customers, as well as have somebody who would make sure the
  | power stayed on 24/7. The difference now is that there used to
  | be thousands of ISP's so one outage wouldn't get as much news
  | coverage, but it would also probably last longer because you
  | wouldn't have a team of people who know what to look for like
  | Amazon (probably?) does.
 
  | olingern wrote:
  | People are so quick to forget how things were before behemoths
  | like AWS, Google Cloud, and Azure. Not all things are free and
  | the outage the internet is experiencing is the risk users
  | signed up for.
  | 
  | If you would like to go back to the days of managing your own
  | machines, be my guest. Remember those machines also live
  | somewhere and were/are subject to the same BGP and routing
  | issues we've seen over the past couple of years.
  | 
  | Personally, I'll deal with outages a few times a year for the
  | peace of mind that there's a group of really talented people
  | looking into for me.
 
  | bowmessage wrote:
  | Because the majority of consumers don't know better / don't
  | care and still buy products from companies with no backup plan.
  | Because, really, how can any of us know better until we're
  | burned many times over?
 
  | 300bps wrote:
  | This appears to be a single region outage - us-east-1. AWS
  | supports as much redundancy as you want. You can be redundant
  | between multiple Availability Zones in a single Region or you
  | can be redundant among 1, 2 or even 25 regions throughout the
  | world.
  | 
  | Multiple-region redundancy costs more both in initial
  | planning/setup as well as monthly fees so a lot of AWS
  | customers choose to just not do it.
 
  | george3d6 wrote:
  | This seems like an insane stance to have, it's like saying
  | businesses should ship their own stock, using their own
  | drivers, and their in-house made cars and planes and in-house
  | trained pilots.
  | 
  | Heck, why stop at having servers on-site? Cast your own silicon
  | waffers, after all you don't want spectrum exploits.
  | 
  | Because you are worst at it. If a specialist is this bad, and
  | the market is fully open, then it's because the problem is
  | hard.
  | 
  | AWS has fewer outages in one zone alone than the best self-
  | hosted institutions, your facebooks and petagons. In-house
  | servers would lead to an insane amount of outage.
  | 
  | And guess what? AWS (and all other IAAS providers) will beg you
  | to use multiple region because of this. The team/person that
  | has millions of dollars a day staked on a single AWS region is
  | an idiot and could not be entrusted to order a gaming PC from
  | newegg, let alone run an in-house datacenter.
  | 
  | edit: I will add that AWS specifically is meh and I wouldn't
  | use it myself, there's better IASS. But it's insanity to even
  | imagine self-hosted is more reliable than using even the
  | shittiest of IASS providers.
 
    | jgwil2 wrote:
    | > In-house servers would lead to an insane amount of outage.
    | 
    | That might be true, but the effects of any given outage would
    | be felt much less widely. If Disney has an outage, I can just
    | find a movie on Netflix to watch instead. But now if one
    | provider goes down, it can take down everything. To me, the
    | problem isn't the cloud per se, it's one player's dominance
    | in the space. We've taken the inherently distributed
    | structure of the internet and re-centralized it, losing some
    | robustness along the way.
 
      | dragonwriter wrote:
      | > That might be true, but the effects of any given outage
      | would be felt much less widely.
      | 
      | If my system has an hour of downtime every year and the
      | dozen other systems it interacts with and depends on each
      | have an hour of downtime every year, it can be _better_
      | that those tend to be correlated rather than independent.
 
    | bb88 wrote:
    | Apple created their own silicon. Fedex uses its own pilots.
    | The USPS uses it's own cars.
    | 
    | If you're a company relying upon AWS for your business, is it
    | okay if you're down for a day, or two while you wait for AWS
    | to resolve it's issue?
 
      | jasode wrote:
      | > _Apple created their own silicon._
      | 
      | Apple _designs_ the M1. But TSMC (and possibly Samsung)
      | actually manufacture the chips.
 
      | jcranberry wrote:
      | Most companies using AWS are tiny compared to the companies
      | you mentioned.
 
      | lostlogin wrote:
      | It's bloody annoying when all I want to do is vacuum the
      | floor and Roomba says nope, "active AWS incident".
 
        | Grazester wrote:
        | If all you wanted to do was vacuum the floor you would
        | not have gotten that particular vacuum cleaner. Clearly
        | you wanted to do more than just vacuum the floor and
        | something like this happening should be weighed with the
        | purchase of the vacuum.
 
        | lostlogin wrote:
        | I'll rephrase. I wanted the floor vacuumed and I didn't
        | want to do it.
 
      | teh_klev wrote:
      | > Apple created their own silicon
      | 
      | Apple _designed_ their own silicon, a third party
      | manufactures and packages it for them.
 
    | roody15 wrote:
    | Quick follow up. I once used a IASS provider (hyperstreet)
    | that was terrible. Long story short provider ended closing
    | shop and the owner of the company now sells real estate in
    | California.
    | 
    | Was a nightmare recovering data. Even when the service was
    | operational was sub par.
    | 
    | Just saying perhaps the "shittiest" providers may not be more
    | reliable.
 
    | kayson wrote:
    | I think you're missing the point of the comment. It's not
    | "don't use cloud". It's "be prepared for when cloud goes
    | down". Because it will, despite many companies either
    | thinking it won't, or not planning for it.
 
    | midasuni wrote:
    | > AWS has fewer outages in one zone alone than the best self-
    | hosted institutions, your facebooks and petagons. In-house
    | servers would lead to an insane amount of outage.
    | 
    | It's had two in 13 months
 
    | imiric wrote:
    | > This seems like an insane stance to have, it's like saying
    | businesses should ship their own stock, using their own
    | drivers, and their in-house made cars and planes and in-house
    | trained pilots.
    | 
    | > Heck, why stop at having servers on-site? Cast your own
    | silicon waffers, after all you don't want spectrum exploits.
    | 
    | That's an overblown argument. Nobody is saying that, but it's
    | clear that businesses that maintain their own infrastructure
    | would've avoided today's AWS' outage. So just avoiding a
    | single level of abstraction would've kept your company
    | running today.
    | 
    | > Because you are worst at it. If a specialist is this bad,
    | and the market is fully open, then it's because the problem
    | is hard.
    | 
    | The problem is hard mostly because of scale. If you're a
    | small business running a few websites with a few million hits
    | per month, it might be cheaper and easier to colocate a few
    | servers and hire a few DevOps or old-school sysadmins to
    | administer the infrastructure. The tooling is there, and is
    | not much more difficult to manage than a hundred different
    | AWS products. I'm actually more worried about the DevOps
    | trend where engineers are trained purely on cloud
    | infrastructure and don't understand low-level tooling these
    | systems are built on.
    | 
    | > AWS has fewer outages in one zone alone than the best self-
    | hosted institutions, your facebooks and petagons. In-house
    | servers would lead to an insane amount of outage.
    | 
    | That's anecdotal and would depend on the capability of your
    | DevOps team and your in-house / colocation facility.
    | 
    | > And guess what? AWS (and all other IAAS providers) will beg
    | you to use multiple region because of this. The team/person
    | that has millions of dollars a day staked on a single AWS
    | region is an idiot and could not be entrusted to order a
    | gaming PC from newegg, let alone run an in-house datacenter.
    | 
    | Oh great, so the solution is to put even more of our eggs in
    | a single provider's basket? The real solution would be having
    | failover to a different cloud provider, and the
    | infrastructure changes needed for that are _far_ from
    | trivial. Even with that, there's only 3 major cloud providers
    | you can pick from. Again, colocation in a trusted datacenter
    | would've avoided all of this.
 
      | p1necone wrote:
      | > it's clear that businesses that maintain their own
      | infrastructure would've avoided today's AWS' outage.
      | 
      | Sure, that's trivially obvious. But how many other outages
      | would they have had instead because they aren't as
      | experienced at running this sort of infrastructure as AWS
      | is?
      | 
      | You seem to be arguing from the a priori assumption that
      | rolling your own is inherently more stable than renting
      | infra from AWS, without actually providing any
      | justification for that assumption.
      | 
      | You also seem to be under the assumption that any amount of
      | downtime is _always_ unnacceptable, and worth spending
      | large amounts of time and effort to avoid. For a _lot_ of
      | businesses systems going down for a few hours every once in
      | a while just isn 't a big deal, and is much more preferable
      | than spending thousands more on cloud bills, or hiring more
      | full time staff to ensure X 9s of uptime.
 
        | imiric wrote:
        | You and GP are making the same assumption that my DevOps
        | engineers _aren't_ as experienced as AWS' are. There are
        | plenty of engineers capable of maintaining an in-house
        | infrastructure running X 9s because, again, the
        | complexity comes from the scale AWS operates at. So we're
        | both arguing with an a priori assumption that the grass
        | is greener on our side.
        | 
        | To be fair, I'm not saying never use cloud providers. If
        | your systems require the complexity cloud providers
        | simplify, and you operate at a scale where it would be
        | prohibitively expensive to maintain yourself, by all
        | means go with a cloud provider. But it's clear that not
        | many companies are prepared for this type of failure, and
        | protecting against it is not trivial to accomplish. Not
        | to mention the conceptual overhead and knowledge required
        | with dealing with the provider's specific products, APIs,
        | etc. Whereas maintaining these systems yourself is
        | transferrable across any datacenter.
 
      | solveit wrote:
      | This feels like a discussion that could sorely use some
      | numbers.
      | 
      | What are good examples of
      | 
      | >a small business running a few websites with a few million
      | hits per month, it might be cheaper and easier to colocate
      | a few servers and hire a few DevOps or old-school sysadmins
      | to administer the infrastructure.
      | 
      | and how often do they go down?
 
        | i_like_waiting wrote:
        | depends I guess, I am running on-prem workstation for our
        | DWH. So far in 2 years it went down minutes at the time,
        | when I decided to do so, because of hardware updates. I
        | have no idea where this narrative came from, but usually
        | hardware you have is very reliable and doesn't turn off
        | every 15 minutes.
        | 
        | Heck, I use old T430 for my home server and still it
        | doesn't go down on completely random occasions (but thats
        | very simplified example, I know)
 
      | jasode wrote:
      | _> , but it's clear that businesses that maintain their own
      | infrastructure would've avoided today's AWS' outage._
      | 
      | When Netflix was running its own datacenters in 2008, they
      | had a _3 day outage_ from a database corruption and couldn
      | 't ship DVDs to customers. That was the disaster that
      | pushed CEO Reed Hastings to get out of managing his own
      | datacenters and migrate to AWS.
      | 
      | The flaw in the reasoning that running your own hardware
      | would _avoid today 's outage_ is that it doesn't also
      | consider the _extra unplanned outages on other days_
      | because your homegrown IT team (especially at non-tech
      | companies) isn 't as skilled as the engineers working at
      | AWS/GCP/Azure.
 
        | qaq wrote:
        | The flaw in your reasoning is that the complexity of the
        | problem is even remotely the same. Most AWS outages are
        | control plane related.
 
    | naikrovek wrote:
    | > AWS (and all other IAAS providers) will beg you to use
    | multiple region
    | 
    | will they? because AWS still puts new stuff in us-east-1
    | before anywhere else, and there is often a LONG delay before
    | those things go to other regions. there are many other
    | examples of why people use us-east-1 so often, but it all
    | boils down to this: AWS encourage everyone to use us-east-1
    | and discourage the use of other regions for the same reasons.
    | 
    | if they want to change how and where people deploy, they
    | should change how they encourage it's customers to deploy.
    | 
    | my employer uses multi-region deployments where possible, and
    | we can't do that anywhere nearly as much as we'd like because
    | of limitations that AWS has chosen to have.
    | 
    | so if cloud providers want to encourage multi-region
    | adoption, they need to stop discouraging and outright
    | preventing it, first.
 
      | danielheath wrote:
      | It works really well imo. All the people who want to use
      | new stuff at the expense of stability choose us-east-1;
      | those who want stability at the expense of new stuff run
      | multi-region (usually not in us-east-1 )
 
      | mypalmike wrote:
      | This argument seems rather contrived. Which feature
      | available in only one region for a very long time has
      | specifically impacted you? And what was the solution?
 
      | WaxProlix wrote:
      | Most features roll out to IAD second, third, or fourth. PDX
      | and CMH are good candidates for earlier feature rollout,
      | and usually it's tested in a small region first. I use PDX
      | (us-west-2) for almost everything these days.
      | 
      | I also think that they've been making a lot of the default
      | region dropdowns and such point to CMH (us-east-2) to get
      | folks to migrate away from IAD. Your contention that
      | they're encouraging people to use that region just don't
      | ring true to me.
 
    | savant_penguin wrote:
    | they usually beg you to use multiple availability zones
    | though
    | 
    | I'm not sure how many aws services are easy to spawn at
    | multiple regions
 
      | mypalmike wrote:
      | Which ones are difficult to deploy in multiple regions?
 
      | dragonwriter wrote:
      | > they usually beg you to use multiple availability zones
      | though
      | 
      | Doesn't help you if it what goes down is AWS global
      | services on which you directly, or other AWS services,
      | depend (which tend to be tied to US-east-1).
 
    | optiomal_isgood wrote:
    | This is the right answer, I recall studying for the solutions
    | architect professional certification and reading this
    | countless times: outages will happen and you should plan for
    | them by using multi-region if you care about downtime.
    | 
    | It's not AWS fault here, it's the companies', which assume
    | that it will never be down. In-house servers also have
    | outages, it's a very naive assumption to think that it'd be
    | all better if all of those services were using their own
    | servers.
    | 
    | Facebook doesn't use AWS and they were down for several hours
    | a couple weeks ago, and that's because they have way better
    | engineers than the average company, working on their
    | infrastructure, exclusively.
 
    | qaq wrote:
    | "AWS has fewer outages in one zone alone than the best self-
    | hosted institutions" sure you just call an outage "increased
    | error rate"
 
    | SkyPuncher wrote:
    | In addition to fewer outages, _many_ products get a free pass
    | on incidents because basically everyone is being impacted by
    | outage.
 
    | johannes1234321 wrote:
    | the benefit of self hosting is, that you are up, while a your
    | competitors are down.
    | 
    | However if youbare on AWS many of your competitors are down
    | while you are down, so they can't takeover your business.
 
    | [deleted]
 
  | tw04 wrote:
  | >It seems bad now but I wonder how much worse it might be when
  | no one actually has access to money because all financial
  | traffic is going through AWS and it goes down.
  | 
  | Most financial institutions are implementing their own clouds,
  | I can't think of any major one that is reliant on public cloud
  | to the extent transactions would stop.
  | 
  | >Why hasn't the industry come up with an alternative?
  | 
  | You mean like building datacenters and hosting your own gear?
 
    | baoyu wrote:
    | > Most financial institutions are implementing their own
    | clouds
    | 
    | https://www.nasdaq.com/Nasdaq-AWS-cloud-announcement
 
      | filmgirlcw wrote:
      | That doesn't mean what you think it means.
      | 
      | The agreement is more of a hybrid cloud arrangement with
      | AWS Outposts.
      | 
      | FTA:
      | 
      | >Core to Nasdaq's move to AWS will be AWS Outposts, which
      | extend AWS infrastructure, services, APIs, and tools to
      | virtually any datacenter, co-location space, or on-premises
      | facility. Nasdaq plans to incorporate AWS Outposts directly
      | into its core network to deliver ultra-low-latency edge
      | compute capabilities from its primary data center in
      | Carteret, NJ.
      | 
      | They are also starting small, with Nasdaq MRX
      | 
      | This is much less about moving NASDAQ (or other exchanges)
      | to be fully owned/maintained by Amazon, and more about
      | wanting to take advantage of development tooling and
      | resources and services AWS provides, but within the
      | confines of an owned/maintained data center. I'm sure as
      | this partnership grows, racks and racks will be in Amazon's
      | data centers too, but this is a hybrid approach.
      | 
      | I would also bet a significant amount of money that when
      | NASDAQ does go full "cloud" (or hybrid, as it were), it
      | won't be in the same US-east region co-mingling with the
      | rest of the consumer web, but with its own redundant
      | services and connections and networking stack.
      | 
      | NASDAQ wants to modernize its infrastructure but it
      | absolutely doesn't want to offload it to a cloud provider.
      | That's why it's a hybrid partnership.
 
    | jen20 wrote:
    | Indeed I can think of several outages in the past decade in
    | the UK of banks' own infrastructure which have led to
    | transactions stopping for days at a time, with the
    | predictable outcomes.
 
  | tcgv wrote:
  | > Why hasn't the industry come up with an alternative?
  | 
  | The cloud is the solution to self managed data centers. Their
  | value proposition is appealing: Focus on your core business and
  | let us handle infrastructure for you.
  | 
  | This fits the needs of most small and medium sized businesses,
  | there's no reason not to use the cloud and spend time and money
  | on building and operating private data centers when the
  | (perceived) chances of outages are so small.
  | 
  | Then, companies grow to a certain size where the benefits of
  | having a self managed data center begins to outweight not
  | having one. But at this point this becomes more of a
  | strategic/political decision than merely a technical one, so
  | it's not an easy shift.
 
| JamesAdir wrote:
| noob question: Aren't companies using several regions for
| availability and redundancy?
 
  | yunwal wrote:
  | I'm seeing outages across several regions for certain services
  | (SNS), so cross-region failover doesn't necessarily help here.
  | 
  | Additionally, for complex apps, automatic cross-region disaster
  | recovery can take tens or even hundreds of dev years, something
  | most small to midsized companies can't afford.
 
  | blahyawnblah wrote:
  | Ideally, yes. In practice, most are hosted in a single region
  | but with multiple availability zones (this is called high
  | availability). What you're talking about is fault tolerance
  | (across multiple regions). That's harder to implement and costs
  | more.
 
| PragmaticPulp wrote:
| I worked at a company that hired an ex-Amazon engineer to work on
| some cloud projects.
| 
| Whenever his projects went down, he fought tooth and nail against
| any suggestion to update the status page. When forced to update
| the status page, he'd follow up with an extremely long "post-
| mortem" document that was really just a long winded explanation
| about why the outage was someone else's fault.
| 
| He later explained that in his department at Amazon, being at
| fault for an outage was one of the worst things that could happen
| to you. He wanted to avoid that mark any way possible.
| 
| YMMV, of course. Amazon is a big company and I've had other
| friends work there in different departments who said this wasn't
| common at all. I will always remember the look of sheer panic he
| had when we insisted that he update the status page to accurately
| reflect an outage, though.
 
  | broknbottle wrote:
  | This gets posted every time there's an AWS outage. It mind as
  | well be a copy pasta at this point.
 
    | Rapzid wrote:
    | It's the "grandma got run over by a reindeer" of AWS outages.
    | Really no outage thread would be complete without this
    | anecdote.
 
    | JoelMcCracken wrote:
    | well, this is the first time I've seen it, so I am glad it
    | was posted this time.
 
      | rconti wrote:
      | Ditto, it's always annoyed me that their status page is
      | useless, but glad someone else mentioned it.
 
      | jjoonathan wrote:
      | First time I've seen it too. Definitely not my first "AWS
      | us-east-1 is down but the status board is green" thread,
      | either.
 
    | [deleted]
 
    | Spivak wrote:
    | I mean it's true at every company I've ever worked at too. If
    | you can lawyer incidents into not being an outage you avoid
    | like 15 meetings with the business stakeholders about all the
    | things we "have to do" to prevent things like this in the
    | future that get canceled the moment they realize that how
    | much dev/infra time it will take to implement.
 
    | ignoramous wrote:
    | I had that deja vu feeling reading PragmaticPulp's comment,
    | too.
    | 
    | And sure enough, PragmaticPulp did post a similar comment on
    | a thread about Amazon India's alleged hire-to-fire policy 6
    | months back: https://news.ycombinator.com/item?id=27570411
    | 
    | You and I, we aren't among the 10000, but there are
    | potentially 10000 others who might be: https://xkcd.com/1053/
 
    | PragmaticPulp wrote:
    | Sorry. I'm probably to blame because I've posted this a
    | couple times on HN before.
    | 
    | It strikes a nerve with me because it caused so much trouble
    | for everyone around him. He had other personal issues,
    | though, so I should probably clarify that I'm not entirely
    | blaming Amazon for his habits. Though his time at Amazon
    | clearly did exacerbate his personal issues.
 
  | dang wrote:
  | (This was originally a reply to
  | https://news.ycombinator.com/item?id=29473759 but I've pruned
  | it to make the thread less top-heavy.)
 
  | avalys wrote:
  | I can totally picture this. Poor guy.
 
  | kortex wrote:
  | That sounds like the exact opposite of human-factors
  | engineering. No one _likes_ taking blame. But when things go
  | sideways, people are extra spicy and defensive, which makes
  | them clam up and often withhold useful information, which can
  | extend the outage.
  | 
  | No-blame analysis is a much better pattern. Everyone wins. It's
  | about building the system that builds the system. Stuff broke;
  | fix the stuff that broke, then fix the things that _let stuff
  | break_.
 
    | 88913527 wrote:
    | I don't think engineers can believe in no-blame analysis if
    | they know it'll harm career growth. I can't unilaterally
    | promote John Doe, I have to convince other leaders that John
    | would do well the next level up. And in those discussions,
    | they could bring up "but John has caused 3 incidents this
    | year", and honestly, maybe they'd be right.
 
      | SQueeeeeL wrote:
      | Would they? Having 3 outages in a year sounds like an
      | organization problem. Not enough safeguards to prevent very
      | routine human errors. But instead of worrying about that we
      | just assign a guy to take the fall
 
        | JackFr wrote:
        | Well if John caused 3 outages and and his peers Sally and
        | Mike each caused 0, it's worth taking a deeper look.
        | There's a real possibility he's getting screwed by a
        | messed up org, also he could be doing slapdash work or he
        | seriously might not undertsand the seriousness of an
        | outage.
 
        | jjav wrote:
        | Worth a look, certainly. Also very possible that this
        | John is upfront about honest postmortems and like a good
        | leader takes the blame, whereas Sally and Mike are out
        | all day playing politics looking for how to shift blame
        | so nothing has their name attached. Most larger companies
        | that's how it goes.
 
        | Kliment wrote:
        | Or John's work is in frontline production use and Sally's
        | and Mike's is not, so there's different exposure.
 
        | crmd wrote:
        | John's team might also be taking more calculated risks
        | and running circles around Sally and Mike's teams with
        | respect to innovation and execution. If your organization
        | categorically punishes failures/outages, you end up with
        | timid managers that are only playing defense, probably
        | the opposite of what the leadership team wants.
 
        | dolni wrote:
        | If you work in a technical role and you _don't_ have the
        | ability to break something, you're unlikely to be
        | contributing in a significant way. Likely that would make
        | you a junior developer whose every line of code is
        | heavily scrutinized.
        | 
        | Engineers should be experts and you should be able to
        | trust them to make reasonable choices about the
        | management of their projects.
        | 
        | That doesn't mean there can't be some checks in place,
        | and it doesn't mean that all engineers should be perfect.
        | 
        | But you also have to acknowledge that adding all of those
        | safeties has a cost. You can be a competent person who
        | requires fewer safeties or less competent with more
        | safeties.
        | 
        | Which one provides more value to an organization?
 
        | pm90 wrote:
        | > Which one provides more value to an organization?
        | 
        | Neither, they both provide the same value in the long
        | term.
        | 
        | Senior engineers cannot execute on everything they commit
        | to without having a team of engineers they work with. If
        | nobody trains junior engineers, the discipline would go
        | extinct.
        | 
        | Senior engineers provide value by building guardrails to
        | enable junior engineers to provide value by delivering
        | with more confidence.
 
        | jaywalk wrote:
        | You're not wrong, but it's possible that the organization
        | is small enough that it's just not feasible to have
        | enough safeguards that would prevent the outages John
        | caused. And in that case, it's probably best that John
        | not be promoted if he can't avoid those errors.
 
        | kortex wrote:
        | Current co is small. We are putting in the safeguards
        | from Day 1. Well, okay technically like day 120, the
        | first few months were a mad dash to MVP. But now that we
        | have some breathing room, yeah, we put a lot of emphasis
        | on preventing outages, detecting and diagnosing outages
        | promptly, documenting them, doing the whole 5-why's
        | thing, and preventing them in the future. We didn't have
        | to, we could have kept mad dashing and growth hacking.
        | But _very_ fortunately, we have a great culture here
        | (founders have lots of hindsight from past startups).
        | 
        | It's like a seed for crystal growth. Small company is
        | exactly the best time to implement these things, because
        | other employees will try to match the cultural norms and
        | habits.
 
        | jaywalk wrote:
        | Well, I started at the small company I'm currently at
        | around day 7300, where "source control" consisted of
        | asking the one person who was in charge of all source
        | code for a copy of the files you needed to work on, and
        | then giving the updated files back. He'd write down the
        | "checked out" files on a whiteboard to ensure that two
        | people couldn't work on the same file at the same time.
        | 
        | The fact that I've gotten it to the point of using git
        | with automated build and deployment is a small miracle in
        | itself. Not everybody gets to start from a clean slate.
 
      | mountainofdeath wrote:
      | There is no such thing as "no-blame" analysis. Even in the
      | best organizations with the best effort to avoid it, there
      | is always a subconscious "this person did it". It doesn't
      | help that these incidents serve as convenient places for
      | others to leverage to climb their own career ladder at your
      | expense.
 
      | [deleted]
 
      | AnIdiotOnTheNet wrote:
      | > I have to convince other leaders that John would do well
      | the next level up.
      | 
      | "Yes, John has made mistakes and he's always copped to them
      | immediately and worked to prevent them from happening again
      | in the future. You know who doesn't make mistakes? People
      | who don't do anything."
 
      | nix23 wrote:
      | You know why SO-teams, firefighters and military pilots are
      | so successful?
      | 
      | -You don't hide anything
      | 
      | -Errors will be made
      | 
      | -After training/mission everyone talks about the errors (or
      | potential ones) and how to prevent them
      | 
      | -You don't make the same error twice
      | 
      | Being afraid to make errors and learn from them creates a
      | culture of hiding, a culture of denial and especially being
      | afraid to take responsibility.
 
        | jacquesm wrote:
        | You can even make the same error twice but you better
        | have _much_ better explanation the second time around
        | than you had the first time around because you already
        | knew that what you did was risky and or failure prone.
        | 
        | But usually it isn't the same person making the same
        | mistake, usually it is someone else making the same
        | mistake and nobody thought of updating
        | processes/documentation to the point that the error would
        | have been caught in time. Maybe they'll fix that after
        | the second time ;)
 
    | maximedupre wrote:
    | Or just take responsibility. People will respect you for
    | doing that and you will demonstrate leadership.
 
      | artificial wrote:
      | Way more fun argument: Outages just, uh... uh... find a
      | way.
 
      | melony wrote:
      | And the guy who doesn't take responsibility gets promoted.
      | Employees are not responsible for failures of management to
      | set a good culture.
 
        | tomrod wrote:
        | Not in healthy organizations, they don't.
 
        | foobiekr wrote:
        | You can work an entire career and maybe enjoy life in one
        | healthy organization in that entire time even if you work
        | in a variety of companies. It just isn't that common,
        | though of course voicing the _ideals_ is very, very
        | common.
 
        | jacquesm wrote:
        | Once you reach a certain size there are surprisingly few
        | healthy organization, most of them turn into
        | externalization engines with 4 beats per year.
 
        | kortex wrote:
        | The Gervais/Peter Principle is alive and well in many
        | orgs. That doesn't mean that when you have the
        | prerogative to change the culture, you just give up.
        | 
        | I realize that isn't an easy thing to do. Often the best
        | bet is to just jump around till you find a company that
        | isn't a cultural superfund site.
 
      | jrootabega wrote:
      | Cynical/realist take: Take responsibility and then hope
      | your bosses already love you, you can immediately both come
      | with a way to prevent it from happening again, and convince
      | them to give you the resources to implement it. Otherwise
      | your responsibility is, unfortunately, just blood in the
      | water for someone else to do all of that to protect the
      | company against you and springboard their reputation on the
      | descent of yours. There were already senior people scheming
      | to take over your department from your bosses, now they
      | have an excuse.
 
        | mym1990 wrote:
        | This seems like an absolutely horrid way of working or
        | doing 'office politics'.
 
        | geekbird wrote:
        | Yes, and I personally have worked in environments that do
        | just that. They said they didn't, but with management
        | "personalities" plus stack ranking, you know damn well
        | that they did.
 
    | kumarakn wrote:
    | I worked at Walmart Technology. I bravely wrote post mortem
    | documents owning the fault of my team (100+ people), owning
    | both technically and also culturally as their leader. I put
    | together a plan to fix it and executed it. Thought that was
    | the right thing to do. This happend two times in my 10 year
    | career there.
    | 
    | Both times I was called out as a failure in my performance
    | eval. Second time, I resigned and told them to find a better
    | leader.
    | 
    | Happy now I am out of such shitty place.
 
      | gunapologist99 wrote:
      | That's shockingly stupid. I also worked for a major Walmart
      | IT services vendor in another life, and we always had to be
      | careful about how we handled them, because they didn't
      | always show a lot of respect for vendors.
      | 
      | On another note, thanks for building some awesome stuff --
      | walmart.com is awesome. I have both Prime and whatever-
      | they're-currently-calling Walmart's version and I love that
      | Walmart doesn't appear to mix SKU's together in the same
      | bin which seems to cause counterfeiting fraud at Amazon.
 
        | gnat wrote:
        | What's a "bin" in this context?
 
        | AfterAnimator wrote:
        | I believe he means a literal bin. E.g. Amazon takes
        | products from all their sellers and chucks them in the
        | same physical space, so they have no idea who actually
        | sold the product when it's picked. So you could have
        | gotten something from a dodgy 3rd party seller that
        | repackages broken returns, etc, and Amazon doesn't
        | maintain oversight of this.
 
        | notinty wrote:
        | Literally just a bin in a fulfillment warehouse.
        | 
        | An amazon listing doesn't guarantee a particular SKU.
 
        | gnat wrote:
        | Ah, whew. That's what I thought. Thanks! I asked because
        | we make warehouse and retail management systems and every
        | vendor or customer seems to give every word their own
        | meanings (e.g., we use "bin" in our discounts engine to
        | be a collection of products eligible for discounts, and
        | "barcode" has at least three meanings depending on to
        | whom you're speaking).
 
        | throwawayHN378 wrote:
        | Is WalMart.com awesome?
 
        | temp6363t wrote:
        | walmart.com user design sucks. My particular grudge right
        | now is - I'm shopping to go pickup some stuff (and
        | indicate "in store pickup) and each time I search for the
        | next item, it resets that filter making me click on that
        | filter for each item on my list.
 
        | muvb00 wrote:
        | Walmart.com, Am I the only one in the world who can't
        | view their site on my phone? I tried it on a couple
        | devices and couldn't get it to work. Scaling is fubar. I
        | assumed this would be costing them millions/billions
        | since it's impossible to buy something from my phone
        | right now. S21+ in portrait on multiple browsers.
 
        | handrous wrote:
        | Almost every physical-store-chain company's website makes
        | it way too hard to do the thing I _nearly always_ want
        | out of their interface, which is to search the inventory
        | of the X nearest locations. They all want to push online
        | orders or 3rd-party-seller crap, it seems.
 
      | CobrastanJorji wrote:
      | Stories like this are why I'm really glad I stopped talking
      | to that Walmart Technology recruiter a few years ago. I
      | love working for places where senior leadership constantly
      | repeat war stories about "that time I broke the flagship
      | product" to reinforce the importance of blameless
      | postmortems. You can't fix the process if the people who
      | report to you feel the need to lie about why things go
      | wrong.
 
      | jacquesm wrote:
      | Props to you and Walmart will never realize their loss.
      | Unfortunately. But one day there will be headline (or even
      | a couple of them) and you will know that if you had been
      | there it might not have happened and that in the end it is
      | Walmarts' customers that will pay the price for that, not
      | their shareholders.
 
      | dnautics wrote:
      | that's awful. You should have been promoted for that.
 
      | abledon wrote:
      | is it just 'ceremony' to be called out on those things?
      | (even if it is actually a positive sum total)
 
        | ARandomerDude wrote:
        | > Happy now I am out of such shitty place.
        | 
        | Doesn't sound like it.
 
      | emteycz wrote:
      | But hope you found a better place?
 
    | javajosh wrote:
    | I firmly believe in the dictum "if you ship it you own it".
    | That means you own all outages. It's not just an operator
    | flubbing a command, or a bit of code that passed review when
    | it shouldn't. It's all your dependencies that make your
    | service work. You own ALL of them.
    | 
    | People spend all this time threat modelling their stuff
    | against malefactors, and yet so often people don't spend any
    | time thinking about the threat model of _decay_. They don 't
    | do it adding new dependencies (build- or runtime), and
    | therefore are unprepared to handle an outage.
    | 
    | There's a good reason for this, of course: modern software
    | "best practices" encourage moving fast and breaking things,
    | which includes "add this dependency we know nothing about,
    | and which gives an unknown entity the power to poison our
    | code or take down our service, arbitrarily, at runtime, but
    | hey its a cool thing with lots of github stars and it's only
    | one 'npm install' away".
    | 
    | Just want to end with this PSA: Dependencies bad.
 
      | syngrog66 wrote:
      | if I were a black hat I would absolutely love GitHub and
      | all the various language-specific package systems out
      | there. giving me sooooo many ways to sneak arbitrary
      | tailored malicious code into millions of installs around
      | the world 24x7. sure, some of my attempts might get caught,
      | or not but not lead to a valuable outcome for me. but that
      | percentage that does? can make it worth it. its about scale
      | and a massive parallelization of infiltration attempts.
      | logic similar to the folks blasting out phishing emails or
      | scam calls.
      | 
      | I _love_ the ubiquity of thirdparty software from
      | strangers, and the lack of bureaucratic gatekeepers. but I
      | also _hate_ it in ways. and not enough people know about
      | the dangers of this second thing.
 
        | throwawayHN378 wrote:
        | Any yet oddly enough the Earth continues to spin and the
        | internet continues to work. I think the system we have
        | now is necessarily the system that must exist ( in this
        | particular case, not in all cases ). Something more
        | centralized is destined to fail. And, while the open
        | source nature of software introduces vulnerabilities it
        | also fixes them.
 
        | syngrog66 wrote:
        | > And, while the open source nature of software
        | introduces vulnerabilities it also fixes them.
        | 
        | dat gap tho... which was my point. smart black hats will
        | be exploiting this gap, at scale. and the strategy will
        | work because the majority of folks seem to be either
        | lazy, ignorant or simply hurried for time.
        | 
        | and btw your 1st sentence was rude. constructive feedback
        | for the future
 
      | AtlasBarfed wrote:
      | That's a great philosophy.
      | 
      | Ok, let's take an organization, let's call them, say
      | Ammizzun. Totally not Amazon. Let's say you have a very
      | aggressive hire/fire policy which worked really well in
      | rapid scaling and growth of your company. Now you have a
      | million odd customers highly dependent on systems that were
      | built by people that are now one? two? three? four?
      | hire/fire generations up-or-out or cashed-out cycles ago.
      | 
      | So.... who owns it if the people that wrote it are
      | lllloooooonnnnggg gone? Like, not just long gone one or two
      | cycles ago so some institutional memory exists. I mean,
      | GONE.
 
        | javajosh wrote:
        | A lot can go wrong as an organization grows, including
        | loss of knowledge. At amazon "Ownership" officially rests
        | with the non-technical money that owns voting shares.
        | They control the board who controls the CEO. "Ownership"
        | can be perverted to mean that you, a wage slave, are
        | responsible for the mess that previous ICs left behind.
        | The obvious thing to do in such a circumstance is quit
        | (or don't apply). It is unfair and unpleasant to be
        | treated in a way that gives you responsibility but no
        | authority, and to participant in maintaining (and
        | extending) that moral hazard, and as long as there are
        | better companies you're better off working for them.
 
      | Mezzie wrote:
      | It's also a nightmare for software preservation. There's
      | going to be a lot from this era that won't be usable 80
      | years from now because everything is so interdependent and
      | impossible to archive. It's going to be as messy and
      | irretrievable as the Web pre Internet Archive + Wayback
      | are.
 
      | 88913527 wrote:
      | Should I be penalized if an upstream dependency, owned by
      | another team, fails? Did I lack due diligence in choosing
      | to accept the risk that the other team couldn't deliver?
      | These are real problems in the micro-services world,
      | especially since I own UI and there are dozens of teams
      | pumping out services, and I'm at the mercy of all of them.
      | The best I can do is gracefully fail when services don't
      | function in a healthy state.
 
        | NikolaeVarius wrote:
        | > Should I be penalized if an upstream dependency, owned
        | by another team, fails?
        | 
        | Yes
        | 
        | > Did I lack due diligence in choosing to accept the risk
        | that the other team couldn't deliver?
        | 
        | Yes
 
        | bandyaboot wrote:
        | Where does this mindset end? Do I lack due diligence by
        | choosing to accept that the cpu microcode on the system
        | I'm deploying to works correctly?
 
        | unionpivo wrote:
        | If it's brand new RiscV CPU that was just relesed 5 min
        | ago, and nobody really tested then yes.
        | 
        | If its standard CPU that everybody else uses, and its not
        | known to be bad then no.
        | 
        | Same for software. Is it ok to have dependency on AWS
        | services ? Their history shows yes. Dependency on brand
        | new SaaS product ? Nothing mission critical.
        | 
        | Or npm/crates/pip packages. Packages that have been
        | around and seedily maintained for few years, have active
        | users, are worth checking out. Some random project from
        | single developer ? Consider vendoring (and owning if
        | necessary ) it.
 
        | jrockway wrote:
        | Why? Intel has Spectre/Meltdown which erased like half of
        | everyone's capacity overnight.
 
        | treis wrote:
        | You choose the CPU and you choose what happens in a
        | failure scenario. Part of engineering is making choices
        | that meet the availability requirements of your service.
        | And part of that is handling failures from dependencies.
        | 
        | That doesn't extend to ridiculous lengths but as a rule
        | you should engineer around any single point of failure.
 
        | NikolaeVarius wrote:
        | Yes? If you are worried about CPU microcode failing, then
        | you do a NASA and have multiple CPU arch's doing
        | calculations in a voting block. These are not unsolved
        | problems.
 
        | javajosh wrote:
        | JPL goes further and buys multiple copies of all hardware
        | and software media used for ground systems, and keeps
        | them in storage "just in case". It's a relatively cheap
        | insurance policy against the decay of progress.
 
        | obstacle1 wrote:
        | Say during due diligence two options are uncovered: use
        | an upstream dependency owned by another team, or use that
        | plus a 3P vendor for redundancy. Implementing parallel
        | systems costs 10x more than the former and takes 5x
        | longer. You estimate a 0.01% chance of serious failure
        | for the former, and 0.001% for the latter.
        | 
        | Now say you're a medium sized hyper-growth company in a
        | competitive space. Does spending 10 times more and
        | waiting 5 times longer for redundancy make business
        | sense? You could argue that it'd be irresponsible to
        | over-engineer the system in this case, since you delay
        | getting your product out and potentially lose $ and
        | ground to competitors.
        | 
        | I don't think a black and white "yes, you should be
        | punished" view is productive here.
 
        | bityard wrote:
        | You and many others here may be conflating two concepts
        | which are actually quite separate.
        | 
        | Taking blame is a purely punitive action and solves
        | nothing. Taking responsibility means it's your job to
        | correct the problem.
        | 
        | I find that the more "political" the culture in the
        | organization is, the more likely everyone is to search
        | for a scapegoat to protect their own image when a mistake
        | happens. The higher you go up in the management chain,
        | the more important vanity becomes, and the more you see
        | it happening.
        | 
        | I have made plenty of technical decisions that turned out
        | to be the wrong call in retrospect. I took
        | _responsibility_ for those by learning from the mistake
        | and reversing or fixing whatever was implemented.
        | However, I never willfully took _blame_ for those
        | mistakes because I believed I was doing the best job I
        | could at the time.
        | 
        | Likewise, the systems I manage sometimes fail because
        | something that another team manages failed. Sometimes
        | it's something dumb and could have easily been prevented.
        | In these cases, it's easy point blame and say, "Not our
        | fault! That team or that person is being a fuckup and
        | causing our stuff to break!" It's harder but much more
        | useful to reach out and say, "hey, I see x system isn't
        | doing what we expect, can we work together to fix it?"
 
        | cyanydeez wrote:
        | Every argument I have on the internet is between
        | prescriptive and descriptive language.
        | 
        | People tend to believe that if you can describe a problem
        | that means you can prescribe a solution. Often times, the
        | only way to survive is to make it clear that the first
        | thing you are doing is describing the problem.
        | 
        | After you do that, and it's clear that's all you are
        | doing, then you follow up with a prescriptive description
        | where you place clearly what could be done to manage a
        | future scenario.
        | 
        | If you don't create this bright line, you create a
        | confused interpretation.
 
        | javajosh wrote:
        | My comment was made from the relatively simpler
        | entrepreneurial perspective, not the corporate one. Corp
        | ownership rests with people in the C-suite who are
        | social/political lawyer types, not technical people. They
        | delegate responsibility but not authority, because they
        | can hire people, even smart people, to work under those
        | conditions. This is an error mode where "blame" flows
        | from those who control the money to those who control the
        | technology. Luckily, not all money is stupid so some
        | corps (and some parts of corps) manage to function even
        | in the presence of risk and innovation failures. I mean
        | the whole industry is effectively a distributed R&D
        | budget that may or may not yield fruit. I suppose this is
        | the market figuring out whether iterated R&D makes sense
        | or not. (Based on history, I'd say it makes a lot of
        | sense.)
 
        | [deleted]
 
        | javajosh wrote:
        | I wish you wouldn't talk about "penalization" as if it
        | was something that comes from a source of authority.
        | _Your customers are depending on you_ , and you've let
        | them down, and the reason that's bad has nothing to do
        | with what your boss will do to you in a review.
        | 
        | The injustice that can and does happen is that you're
        | explicitly given a narrow responsibility during
        | development, and then a much broader responsibility
        | during operation. This is patently unfair, and very
        | common. For something like a failed uService you want to
        | blame "the architect" that didn't anticipate these system
        | level failures. What is the solution? Have plan b (and
        | plan c) ready to go. If these services don't exist, then
        | you must build them. It also implies a level of
        | indirection that most systems aren't comfortable with,
        | because we want to consume services directly (and for
        | good reason) but reliability _requires_ that you never,
        | ever consume a service directly, but instead from an in-
        | process location that is failure aware.
        | 
        | This is why reliable software is hard, and engineers are
        | expensive.
        | 
        | Oh, and it's also why you generally do NOT want to defer
        | the last build step to runtime in the browser. If you
        | start combining services on both the client and server,
        | you're in for a world of hurt.
 
        | bostik wrote:
        | Not penalised no, but questioned as to how well your
        | graceful failure worked in the end.
        | 
        | Remember: it may not be your fault, but it still is your
        | problem.
 
        | fragmede wrote:
        | A analogy for illustrating this is:
        | 
        | You get hit by a car and injured. The accident is the
        | other driver's fault, but getting to the ER is your
        | problem. The other driver may help and call an ambulance,
        | but they might not even be able to help you if they also
        | got hurt in the car crash.
 
      | ssimpson wrote:
      | when working on CloudFiles, we often had monitoring for our
      | limited dependencies that were better than their
      | monitoring. Don't just know what your stuff is doing, but
      | what your whole dependency ecosystem is doing and know when
      | it all goes south. also helps to learn where and how you
      | can mitigate some of those dependencies.
 
        | foobiekr wrote:
        | This. We found very big, serious issues with our anti-
        | DDOS provider because their monitoring sucked compared to
        | ours. It was a sobering reality check when we realized
        | that.
 
    | insaneirish wrote:
    | > No-blame analysis is a much better pattern. Everyone wins.
    | It's about building the system that builds the system. Stuff
    | broke; fix the stuff that broke, then fix the things that let
    | stuff break.
    | 
    | Yea, except it doesn't work in practice. I work with a lot of
    | people who come from places with "blameless" post-mortem
    | 'culture' and they've evangelized such a thing extensively.
    | 
    | You know what all those people have proven themselves to
    | really excel at? _Blaming people._
 
      | kortex wrote:
      | Ok, and? I don't doubt it fails in places. That doesn't
      | mean that it doesn't work in practice. Our company does it
      | just fine. We have a high trust, high transparency system
      | and it's wonderful.
      | 
      | It's like saying unit tests don't work in practice because
      | bugs got through.
 
        | kortilla wrote:
        | Have you ever considered that the "no-blame" postmortems
        | you are giving credit for everything are just a side
        | effect of living in a high trust, high transparency
        | system?
        | 
        | In other words, "no-blame" should be an emergent property
        | of a culture of trust. It's not something you can
        | prescribe.
 
        | kortex wrote:
        | Yes, exactly. Culture of trust is the root. Many
        | beneficial patterns emerge when you can have that: more
        | critical PRs, blameless post-mortems, etc.
 
  | maximedupre wrote:
  | Damn, he had serious PTSD lol
 
  | jonhohle wrote:
  | On the retail/marketplace side this wasn't my experience, but
  | we also didn't have any public dashboards. On Prime we
  | occasionally had to refund in bulk, and when it was called for
  | (internally or externally) we would right up a detailed post-
  | mortem. This wasn't fun, but it was never about blaming a
  | person and more about finding flaws in process or monitoring.
 
  | mountainofdeath wrote:
  | Former AWSser. I can totally believe that happened and
  | continues to happen in some teams. Officially, it's not
  | supposed to be done that way.
  | 
  | Some AWS managers and engineers bring their corporate cultural
  | baggage with them when they join AWS and it takes a few years
  | to unlearn it.
 
  | hinkley wrote:
  | I am finding that I have a very bimodal response to "He did
  | it". When I write an RCA or just talk about near misses, I may
  | give you enough details to figure out that Tom was the one who
  | broke it, but I'm not going to say Tom on the record anywhere,
  | with one extremely obvious exception.
  | 
  | If I think Tom has a toxic combination of poor judgement,
  | Dunning-Kruger syndrome, and a hint of narcissism (I'm not sure
  | but I may be repeating myself here), such that he won't listen
  | to reason and he actively steers others into bad situations
  | (and especially if he then disappears when shit hits the fan),
  | then I will nail him to a fucking cross every chance I get.
  | Public shaming is only a tool for getting people to discount
  | advice from a bad actor. If it comes down to a vote between my
  | idea and his, then I'm going to make sure everyone knows that
  | his bets keep biting us in the ass. This guy kinda sounds like
  | the Toxic Tom.
  | 
  | What is important when I turned out to be the cause of the
  | issue is a bit like some court cases. Would a reasonable person
  | in this situation have come to the same conclusion I did? If
  | so, then I'm just the person who lost the lottery. Either way,
  | fixing it for me might fix it for other people. Sometimes the
  | answer is, "I was trying to juggle three things at once and a
  | ball got dropped." If the process dictated those three things
  | then the process is wrong, or the tooling is wrong. If someone
  | was asking me questions we should think about being more pro-
  | active about deflecting them to someone else or asking them to
  | come back in a half hour. Or maybe I shouldn't be trying to
  | watch training videos while babysitting a deployment to
  | production.
  | 
  | If you never say "my bad" then your advice starts to sound like
  | a lecture, and people avoid lectures so then you never get the
  | whole story. Also as an engineer you should know that owning a
  | mistake early on lets you get to what most of us consider the
  | interesting bit of _solving the problem_ instead of talking
  | about feelings for an hour and then using whatever is left of
  | your brain afterward to fix the problem. In fact in some cases
  | you can shut down someone who is about to start a rant (which
  | is funny as hell because they look like their head is about to
  | pop like a balloon when you say,  "yep, I broke it, let's move
  | on to how do we fix it?")
 
    | bgribble wrote:
    | To me, the point of "blameless" PM is not to hide the
    | identity of the person who was closest to the failure point.
    | You can't understand what happened unless you know who did
    | what, when.
    | 
    | "Blameless" to me means you acknowledge that the ultimate
    | problem isn't that someone made a mistake that caused an
    | outage. The problem is that you had a system in place where
    | someone could make a single mistake and cause an outage.
    | 
    | If someone fat-fingers a SQL query and drops your database,
    | the problem isn't that they need typing lessons! If you put a
    | DBA in a position where they have to be typing SQL directly
    | at a production DB to do their job, THAT is the cause of the
    | outage, the actual DBA's error is almost irrelevant because
    | it would have happened eventually to someone.
 
      | hinkley wrote:
      | Naming someone is how you discover that not everyone in the
      | organization believes in Blamelessness. Once it's out it's
      | out, you can't put it back in.
      | 
      | It's really easy for another developer to figure out who
      | I'm talking about. Managers can't be arsed to figure it
      | out, or at least pretend like they don't know.
 
  | StreamBright wrote:
  | Yep I can confirm that. The process when the outage is caused
  | by you is called COE (correction of errors). I was oncall once
  | for two teams because I was switching teams and I got 11
  | escalations in 2 hours. 10 of these were caused by an overly
  | sensitive monitoring setting. The 11th was a real one. Guess
  | which one I ignored. :)
 
  | kache_ wrote:
  | Sometimes, these large companies tack on too much "necessary"
  | incident "remediation" actions with Arbitrary Due Date SLAs
  | that completely wrench any ongoing work. And ongoing,
  | strategically defined ""muh high impact"" projects are what get
  | you promoted, not doing incident remediations.
  | 
  | When you get to the level you want, you get to not really give
  | a shit and actually do The Right Thing. However, for all of the
  | engineers clamoring to get out of the intermediate brick laying
  | trenches, opening an incident can have pervasive incentives.
 
    | bendbro wrote:
    | In my experience this is the actual reason for fear of the
    | formal error correction process.
 
    | pts_ wrote:
    | Politicized cloud meh.
 
  | ashr wrote:
  | This is the exact opposite of my experience at AWS. Amazon is
  | all about blameless fact finding when it comes to root cause
  | analysis. Your company just hired a not so great engineer or
  | misunderstood him.
 
    | Insanity wrote:
    | Adding my piece of anecdata to this.. the process is quite
    | blameless. If a postmortem seems like it points blame, this
    | is pointed out and removed.
 
      | swiftcoder wrote:
      | Blameless, maybe, but not repercussion-less. A bad CoE was
      | liable to upend the team's entire roadmap and put their
      | existing goals at risk. To be fair, management was fairly
      | receptive to "we need to throw out the roadmap and push our
      | launch out to the following reinvent", but it wasn't an
      | easy position for teams to be in.
 
  | kator wrote:
  | I've worked for Amazon for 4 years, including stints at AWS,
  | and even in my current role my team is involved in LSE's. I've
  | never seen this behavior, the general culture has been find the
  | problem, fix it, and then do root cause analysis to avoid it
  | again.
  | 
  | Jeff himself has said many times in All Hands and in public
  | "Amazon is the best place to fail". Mainly because things will
  | break, it's not that they break that's interesting, it's what
  | you've learned and how you can avoid that problem in the
  | future.
 
    | jsperson wrote:
    | I guess the question is why can't you (AWS) fix the problem
    | of the status page not reflecting an outage? Maybe acceptable
    | if the console has a hiccup, but when www.amazon.com isn't
    | working right, there should be some yellow and red dots out
    | there.
    | 
    | With the size of your customer base there were man years
    | spent confirming the outage after checking the status.
 
      | andrewguenther wrote:
      | Because there's a VP approval step for updating the status
      | page and no repercussions for VPs who don't approve updates
      | in a timely manner. Updating the status page is fully
      | automated on both sides of VP approval. If the status page
      | doesn't update, it's because a VP wouldn't do it.
 
    | Eduard wrote:
    | LSE?
 
      | merciBien wrote:
      | Large Scale Event
 
  | dehrmann wrote:
  | > explanation about why the outage was someone else's fault
  | 
  | In my experience, it's rarely clear who was at fault for any
  | sort of non-trivial outage. The issue tends to be at interfaces
  | and involve multiple owners.
 
  | jimt1234 wrote:
  | Every incident review meeting I've ever been in starts out
  | like, _"This meeting isn't to place blame..."_, then, 5 minutes
  | later, it turns into the Blame Game.
 
  | mijoharas wrote:
  | That's a real shame, one of the leadership principles used to
  | be "be vocally self-critical" which I think was supposed to
  | explicitly counteract this kind of behaviour.
  | 
  | I think they got rid of it at some point though.
 
  | howdydoo wrote:
  | Manually updated status pages are an anti-pattern to begin
  | with. At that point, why not just call it a blog?
 
  | jacquesm wrote:
  | And this is exactly why you can expect these headlines to hit
  | with great regularity. These things are never a problem at the
  | individual level, they are always at the level of culture and
  | organization.
 
  | 300bps wrote:
  | _being at fault for an outage was one of the worst things that
  | could happen to you_
  | 
  | Imagine how stressful life would be thinking that you had to be
  | perfect all the time.
 
    | errcorrectcode wrote:
    | That's been most of my life. Welcome to perfectionism.
 
  | soheil wrote:
  | > I will always remember the look of sheer panic
  | 
  | I don't know if you're exaggerating or not, but even if true
  | why would anyone show that emotion about losing a job in the
  | worst case?
  | 
  | You certainly had a lot of relevate-to-todays-top-hn-post
  | stories throughout you career. And I'm less and less surprised
  | to continuously find PragmaticPulp as one of the top commenters
  | if not the top that resonates with a good chunk of HN.
 
  | sharpy wrote:
  | Haha... This bring back memories. It really depends on the org.
  | 
  | I've had push backs on my postmortems before because of
  | phrasing that could be constituted as laying some of the blame
  | on some person/team when it's supposed to be blameless.
  | 
  | And for a long time, it was fairly blameless. You would still
  | be punished with the extra work of writing high quality
  | postmortems, but I have seen people accidentally bring down
  | critical tier-1 services and not be adversely affected in terms
  | of promotion, etc.
  | 
  | But somewhere along the way, it became politicized. Things like
  | the wheel of death, public grilling of teams on why they didn't
  | follow one of the thousands of best practices, etc, etc. Some
  | orgs are still pretty good at keeping it blameless at the
  | individual level, but... being a big company, your mileage may
  | vary.
 
    | hinkley wrote:
    | We're in a situation where the balls of mud made people
    | afraid to touch some things in the system. As experiences and
    | processes have improved we've started to crack back into
    | those things and guess what, when you are being groomed to
    | own a process you're going to fuck it up from time to time.
    | Objectively, we're still breaking production less often per
    | year than other teams, but we are breaking it, and that's
    | novel behavior, so we have to keep reminding people why.
    | 
    | The moment that affects promotions negatively, or your
    | coworkers throw you under the bus, you should 1) be assertive
    | and 2) proof-read your resume as a precursor to job hunting.
 
      | sharpy wrote:
      | Or problems just persisting, because the fix is easy, but
      | explaining it to others who do not work on the system are
      | hard. Esp. justifying why it won't cause an issue, and
      | being told that the fixes need to be done via scripts that
      | will only ever be used once, but nevertheless needs to be
      | code reviewed and tested...
      | 
      | I wanted to be proactive and fix things before they became
      | an issue, but such things just drained life out of me, to
      | the point I just left.
 
  | staticassertion wrote:
  | I don't think anecdotes like this are even worth sharing,
  | honestly. There's so much context lost here, so much that can
  | be lost in translation. No one should be drawing any
  | conclusions from this post.
 
  | amzn-throw wrote:
  | It's popular to upvote this during outages, because it fits a
  | narrative.
  | 
  | The truth (as always) is more complex:
  | 
  | * No, this isn't the broad culture. It's not even a blip. These
  | are EXCEPTIONAL circumstances by extremely bad teams that - if
  | and when found out - would be intervened dramatically.
  | 
  | * The broad culture is blameless post-mortems. Not whose fault
  | is it. But what was the problem and how to fix it. And one of
  | the internal "Ten commandments of AWS availability" is you own
  | your dependencies. You don't blame others.
  | 
  | * Depending on the service one customer's experience is not the
  | broad experience. Someone might be having a really bad day but
  | 99.9% of the region is operating successfully, so there is no
  | reason to update the overall status dashboard.
  | 
  | * Every AWS customer has a PERSONAL health dashboard in the
  | console that should indicate _their_ experience.
  | 
  | * Yes, VP approval is needed to make any updates on the status
  | dashboard. But that's not as hard as it may seem. AWS
  | executives are extremely operation-obsessed, and when there is
  | an outage of any size are engaged with their service teams
  | immediately.
 
    | flerchin wrote:
    | We knew us-east-1 was unuseable for our customers for 45
    | minutes before amazon acknowledged anything was wrong _at
    | all_. We made decisions _in the dark_ to serve our customers,
    | because amazon drug their feet communicating with us. Our
    | customers were notified after 2 minutes.
    | 
    | It's not acceptable.
 
    | pokot0 wrote:
    | Hiding behind a throw away account does not help your point.
 
    | miken123 wrote:
    | Well, the narrative is sort of what Amazon is asking for,
    | heh?
    | 
    | The whole us-east-1 management console is gone, what is
    | Amazon posting for the management console on their website?
    | 
    | "Service degradation"
    | 
    | It's not a degradation if it's outright down. Use the red
    | status a little bit more often, this is a "disruption", not a
    | "degradation".
 
      | taurath wrote:
      | Yeah no kidding. Is there a ratio of how many people it has
      | to be working for to be in yellow rather than red? Some
      | internal person going "it works on my machine" while 99% of
      | customers are down.
 
      | whoknowswhat11 wrote:
      | I've always wondered why services are not counted down more
      | often. Is there some sliver of customers who have access to
      | the management console for example?
      | 
      | An increase in error rates - no biggie, any large system is
      | going to have errors. But when 80%+ of customers loads in
      | the region are impacted (cross availability zones for
      | whatever good those do) - that counts as down doesn't it?
      | Error rates in one AZ - degraded. Multi-AZ failures - down?
 
        | mynameisvlad wrote:
        | SLAs. Officially acknowledging an incident means that
        | they now _have_ to issue the SLA credits.
 
        | res0nat0r wrote:
        | The outage dashboard is normally only updated if a
        | certain $X percent of hosts / service is down. If the EC2
        | section were updated every time a rack in a datacenter
        | went down, it would be red 24x7.
        | 
        | It's only updated when a large percentage of customers
        | are impacted, and most of the time this number is less
        | than what the HN echo chamber makes it appear to be.
 
        | mynameisvlad wrote:
        | I mean, sure, there are technical reasons why you would
        | want to buffer issues so they're only visible if
        | something big went down (although one would argue that's
        | exactly what the "degraded" status means).
        | 
        | But if the official records say everything is green, a
        | customer is going to have to push a lot harder to get the
        | credits. There is a massive incentivization to "stay
        | green".
 
        | bwestpha wrote:
        | yes there were. I'm from central europe and we were at
        | least able to get some pages of the console in us-east-1
        | -but i assume this was more caching related. Even though
        | the console loaded and worked for listing some entries -
        | we weren't able to post a support case nor viewing SQS
        | messages etc.
        | 
        | So i aggree that degraded is not the proper wording - but
        | it's / was not completly vanished. so.... hard to tell
        | what is an common acceptable wording here.
 
        | vladvasiliu wrote:
        | From France, when I connect to "my personal health
        | dashboard" in eu-west-3, it says several services are
        | having "issues" in us-east-1.
        | 
        | To your point, for support center (which doesn't show a
        | region) it says:
        | 
        |  _Description
        | 
        | Increased Error Rates
        | 
        | [09:01 AM PST] We are investigating increased error rates
        | for the Support Center console and Support API in the US-
        | EAST-1 Region.
        | 
        | [09:26 AM PST] We can confirm increased error rates for
        | the Support Center console and Support API in the US-
        | EAST-1 Region. We have identified the root cause of the
        | issue and are working towards resolution. _
 
      | threecheese wrote:
      | I'm part of a large org with a large AWS footprint, and
      | we've had a few hundred folks on a call nearly all day. We
      | have only a few workloads that are completely down; most
      | are only degraded. This isn't a total outage, we are still
      | doing business in east-1. Is it "red"? Maybe! We're all
      | scrambling to keep the services running well enough for our
      | customers.
 
      | Thaxll wrote:
      | Because the console works just fine in us-east-2 and that
      | the console on the status page does not display regions.
      | 
      | If the console works 100% in us-east-2 and not in us-east-1
      | why would they put the console completely down in us-east?
 
      | keyle wrote:
      | Well you know, like when a rocket explode, it's a sudden
      | and "unexpected rapid disassembly" or something...
      | 
      | And a cleaner is called a "floor technician".
      | 
      | Nothing really out of the ordinary for a service to be
      | called degraded while "hey, the cache might still be
      | working right?" ... or "Well you know, it works every other
      | day except today, so it's just degradation" :-)
 
    | lgylym wrote:
    | Come on, we all know managers don't want to claim an outage
    | till the last minute.
 
    | codegeek wrote:
    | "Yes, VP approval is needed to make any updates on the status
    | dashboard."
    | 
    | If services are clearly down, why is this needed ? I can
    | understand the oversights required for a company like Amazon
    | but this sounds strange to me. If services are clearly down,
    | I want that damn status update right away as a customer.
 
    | tekromancr wrote:
    | Oh, yes. Let me go look at the PERSONAL health dashboard
    | and... oh, I need to sign into the console to view it... hmm
 
    | oscribinn wrote:
    | 100 BEZOBUCKS(tm) have been deposited to your account for
    | this post.
 
    | mrsuprawsm wrote:
    | If your statement is true, then why is the AWS status page
    | widely considered useless, and everyone congregates on HN
    | and/or Twitter to actually know what's broken on AWS during
    | an outage?
 
      | andrewguenther wrote:
      | > Yes, VP approval is needed to make any updates on the
      | status dashboard. But that's not as hard as it may seem.
      | AWS executives are extremely operation-obsessed, and when
      | there is an outage of any size are engaged with their
      | service teams immediately.
      | 
      | My experience generally aligns with amzn-throw, but this
      | right here is why. There's a manual step here and there's
      | always drama surrounding it. The process to update the
      | status page is fully automated on both sides of this step,
      | if you removed VP approval, the page would update
      | immediately. So if the page doesn't update, it is always a
      | VP dragging their feet. Even worse is that lags in this
      | step were never discussed in the postmortem reviews that I
      | was a part of.
 
        | Frost1x wrote:
        | It's intentional plausible deniability. By creating the
        | manual step you can shift blame away. It's just like the
        | concept of personal health dashboards which are designed
        | to keep an asymmetry in reliability information from a
        | host and the client to their personal anecdata
        | experiences. Ontop of all of this, the metrics are pretty
        | arbitrary.
        | 
        | Let's not pretend businesses haven't been intentionally
        | advertising in deceitful ways for decades if not hundreds
        | of years. This just happens to be current strategy in
        | tech of lying and deceiving customers to limit liability,
        | responsibility, and recourse actions.
        | 
        | To be fair, it's not it's not just Amazon, they just
        | happen to be the largest and targeted whipping boys on
        | the block. Few businesses under any circumstances will
        | admit to liability under any circumstances. Liability has
        | to always be assessed externally.
 
    | amichal wrote:
    | I have in the past directed users here on HN who were
    | complaining about https://status.aws.amazon.com to the
    | Personal Health Dashboard at https://phd.aws.amazon.com/ as
    | well. Unfortunately even though the account I was logged into
    | this time only has a single S3 bucket in the EU, billed
    | through the EU and with zero direct dependancies on the US
    | the personal health dashboard was ALSO throwing "The request
    | processing has failed because of an unknown error" messages.
    | Whatever the problem was this time it had global effects for
    | the majority of users of the Console, the internet noticed
    | for over 30 minutes before either the status page or the PHD
    | were able to report it. There will be no explanation and the
    | official status page logs will say there was "increased API
    | failure rates" for an hour.
    | 
    | Now i guess its possible that the 1000s and 1000s of us who
    | noticed and commented are some tiny fraction of the user base
    | but if thats so you could at least publish a follow up like
    | other vendors do that says something like 0.00001% of API
    | requests failed effecting an estimated 0.001% of our users at
    | the time.
 
    | yaacov wrote:
    | Can't comment on most of your post but I know a lot of Amazon
    | engineers who think of the CoE process (Correction of Error,
    | what other companies would call a postmortem) as punitive
 
      | jrd259 wrote:
      | I don't know any, and I have written or reviewed about 20
 
      | andrewguenther wrote:
      | They aren't _meant_ to be, but shitty teams are shitty. You
      | can also create a COE and assign it to another team. When I
      | was at AWS, I had a few COEs assigned to me by disgruntled
      | teams just trying to make me suffer and I told them to
      | pound sand. For my own team, I wrote COEs quite often and
      | found it to be a really great process for surfacing
      | systemic issues with our management chain and making real
      | improvements, but it needs to be used correctly.
 
    | nanis wrote:
    | > * Depending on the service one customer's experience is not
    | the broad experience. Someone might be having a really bad
    | day but 99.9% of the region is operating successfully, so
    | there is no reason to update the overall status dashboard.
    | 
    | https://rachelbythebay.com/w/2019/07/15/giant/
 
    | marcinzm wrote:
    | >* Every AWS customer has a PERSONAL health dashboard in the
    | console that should indicate their experience.
    | 
    | You mean the one that is down right now?
 
      | CobrastanJorji wrote:
      | Seems like it's doing an exemplary job of indicating their
      | experience, then.
 
    | ultimoo wrote:
    | > ...you own your dependencies. You don't blame others.
    | 
    | Agreed, teams should invest resources in architecting their
    | systems in a way that can withstand broken dependencies. How
    | does AWS teams account for "core" dependencies (e.g. auth)
    | that may not have alternatives?
 
      | gunapologist99 wrote:
      | This is the irony of building a "reliable" system across
      | multiple AZ's.
 
    | AtlasBarfed wrote:
    | Because OTHERWISE people might think AMAZON is a
    | DYSFUNCTIONAL company that is beginning to CRATER under its
    | HORRIBLE work culture and constant H/FIRE cycle.
    | 
    | See, AWS is basically turning into a long standing utility
    | that needs to be reliable.
    | 
    | Hey, do most institutions like that completely turn over
    | their staff every three years? Yeah, no.
    | 
    | Great for building it out and grabbing market share.
    | 
    | Maybe not for being the basis of a reliable substrate of the
    | modern internet.
    | 
    | If there are dozens of bespoke systems that keep AWS afloat
    | (disclosure: I have friends who worked there, and there are,
    | and also Conway's law), but if the people who wrote them are
    | three generations of HIRE/FIRE ago....
    | 
    | Not good.
 
      | ctvo wrote:
      | > Maybe not for being the basis of a reliable substrate of
      | the modern internet.
      | 
      | Maybe THEY will go to a COMPETITOR and THINGS MOVE ON if
      | it's THAT BAD. I wasn't sure what the pattern for all caps
      | was, so just giving it a shot there. Apologies if it's
      | incorrect.
 
        | AtlasBarfed wrote:
        | I was mocking the parent, who was doing that. Yes it's
        | awful. Effective? Sigh, yes. But awful.
 
    | [deleted]
 
    | the-pigeon wrote:
    | What?!
    | 
    | Everybody is very slow to update their outage pages because
    | of SLAs. It's in a company's financial interest to deny
    | outages and when they are undeniable to make them appear as
    | short as possible. Status pages updating slowly is definitely
    | by design.
    | 
    | There's no large dev platform I've used that this wasn't true
    | of their status pages.
 
    | jjoonathan wrote:
    | I haven't asked AWS employees specifically about blameless
    | postmortems, but several of them have personally corroborated
    | that the culture tends towards being adversarial and
    | "performance focused." That's a tough environment for
    | blameless debugging and postmoretems. Like if I heard that
    | someone has a rain forest tree-frog living happily in their
    | outdoor Arizona cactus garden, I have doubts.
 
      | azinman2 wrote:
      | When I was at Google I didn't have a lot of exposure to the
      | public infra side. However I do remember back in 2008 when
      | a colleague was working on routing side of YouTube, he made
      | a change that cost millions of dollars in mere hours before
      | noticing and reverting it. He mentioned this to the larger
      | team which gave applause during a tech talk. I cannot
      | possibly generalize the culture differences between Amazon
      | and Google, but at least in that one moment, the Google
      | culture seemed to support that errors happen, they get
      | noticed, and fixed without harming the perceived
      | performance of those responsible.
 
        | wolverine876 wrote:
        | While I support that, how are the people involved
        | evaluated?
 
        | abdabab wrote:
        | Google puts automation or process in place to avoid
        | outages rather than pointing fingers. If an engineer
        | causes an outage by mistake and then works to ensure that
        | would never happen again, he made a positive impact.
 
  | 1-6 wrote:
  | Perhaps reward structure should be changed to incentivize the
  | post-mortems. There could be several flaws that run
  | underreported otherwise.
  | 
  | We may run into the problem of everything documented and
  | possible deliberate acts but for a service that relies heavily
  | on uptime, that's a small price to pay for a bulletproof
  | operation.
 
    | A4ET8a8uTh0 wrote:
    | Then we would drown in a sea of meetings and 'lessons
    | learned' emails. There is a reason for post-mortems, but
    | there has to be balance.
 
      | 1-6 wrote:
      | I find post-mortems interesting to read through especially
      | when it's not my fault. Most of them would probably be
      | routine to read through but there are occasional ones that
      | make me cringe or laugh.
      | 
      | Post-mortems can sometime be thought of like safety
      | training. There is a big imbalance of time dedicated to
      | learning proper safety handling just for those small
      | incidences.
 
        | hinkley wrote:
        | Does Disney still play the "Instructional Videos" series
        | starring Goofy where he's supposed to be teaching you how
        | to do something and instead we learn how NOT to do
        | something? Or did I just date myself badly?
 
  | throwaway82931 wrote:
  | This fits with everything I've heard about terrible code
  | quality at Amazon and engineers working ridiculous hours to
  | close tickets any way they can. Amazon as a corporate entity
  | seems to be remarkably distrustful of and hostile to its labor
  | force.
 
  | mbordenet wrote:
  | When I worked for AMZN (2012-2015, Prime Video & Outbound
  | Fulfillment), attempting to sweep issues under the rug was a
  | clear path to termination. The Correction-Of-Error (COE)
  | process can work wonders in a healthy, data-driven, growth-
  | mindset culture. I wonder if the ex-Amazonian you're referring
  | to did not leave AMZN by their own accord?
  | 
  | Blame deflection is a recipe for repeat outages and unhappy
  | customers.
 
    | PragmaticPulp wrote:
    | > I wonder if the ex-Amazonian you're referring to did not
    | leave AMZN by their own accord?
    | 
    | Entirely possible, and something I've always suspected.
 
  | taf2 wrote:
  | What if they just can't access the console to update the status
  | page...
 
    | Slartie wrote:
    | They could still go into the data center, open up the status
    | page servers' physical...ah wait, what if their keyfobs don't
    | work?
 
  | soheil wrote:
  | This may not actually be that bad of thing. If you think about
  | it if they're fighting tooth and nail to keep the status page
  | still green that tells you they were probably doing that at
  | every step of the way before the failure became eminent. Gotta
  | have respect for that.
 
  | mrweasel wrote:
  | That's idiotic, the service is down regardless. If you foster
  | that kind of culture, why have a status page at all?
  | 
  | It make AWS engineers look stupid, because it looks like they
  | are not monitoring their services.
 
    | mountainofdeath wrote:
    | The status page is as much a political tool as a technical
    | one. Giving your service a non-green state makes your entire
    | management chain responsible. You don't want to be one that
    | upsets some VPs advancement plans.
 
    | nine_zeros wrote:
    | > It make AWS engineers look stupid, because it looks like
    | they are not monitoring their services.
    | 
    | Management.
 
| thefourthchime wrote:
| If someone needs to get to the console, you can make a url like
| this:
| 
| https://us-west-1.console.aws.amazon.com/
 
  | yabones wrote:
  | Works for a lot of things, but not Route53... Which is great
  | because that's the only thing I need to do in AWS today :)
 
| hulahoop wrote:
| I heard from my cousin who works at an Amazon warehouse that the
| conveyor belts stopped working and items were messed up and
| getting randomly removed off the belts.
 
| joshstrange wrote:
| This seems to be affecting Audible as well. I can't buy a book
| which sucks since I just finished the previous one in the series
| and I'm stuck in bed sick.
 
| griffinkelly wrote:
| Hosting and processing all the photos for the California
| International Marathon on EC2, this doesnt make easier dealing
| with impatient customers any easier
 
| bennyp101 wrote:
| Yea, Amazon Music has gone down for me in the UK now :(
| 
| Looks like it might be getting worse
 
| [deleted]
 
| adamtester wrote:
| eu-west-1 is down for us
 
  | PeterBarrett wrote:
  | I hope you stay being the only person who has said that, 1
  | region being gone is enough for me!
 
    | adamtester wrote:
    | I should have said, only the Console and CLI was down for us,
    | our services remained up!
 
      | ComputerGuru wrote:
      | Console has lots of us-east-1 dependencies.
 
| strictfp wrote:
| "some customers may experience a slight elevation in error rates"
| --> everything is on fire
 
  | kello wrote:
  | ah corporate speak at it's finest
 
  | Xenoamorphous wrote:
  | Maybe when the error rate hits 100% they'll say "error rate now
  | stable".
 
    | retbull wrote:
    | "Only direction is up"
 
  | hvgk wrote:
  | ECR and API are fucked so it's impossible to scale anything to
  | the point fire can come out :)
 
  | soco wrote:
  | I'm also experiencing a slight elevation in billing rates - got
  | alarms for 10x consumption and I can't check on them... Edit:
  | also API access is failing, terraform can't take anything down
  | because "connection was forcibly closed"
 
    | gchamonlive wrote:
    | Imagine triggering a big instance for machine learning or a
    | huge EMR cluster that would otherwise be short lived and not
    | being able to scale it down.
    | 
    | I am quite sure the AWS support will be getting many refund
    | requests over the course of the week.
 
| lordnacho wrote:
| This got me thinking, are there any major chat services that
| would go down if a particular AWS/GCP/etc data centre went down?
| 
| You don't want your service to go down, plus your team's comms at
| the same time.
 
  | tyre wrote:
  | Slack going down is a godsend for developer productivity.
 
  | wizwit999 wrote:
  | You should multi region something like that.
 
  | milofeynman wrote:
  | Remember when Facebook went down? Fb, Whatsapp, messenger,
  | Instagram were all down. Don't know what they use internally
 
  | DoctorOW wrote:
  | Slack is pretty much full AWS, I've been switched over to Teams
  | so I can't check.
 
    | rodiger wrote:
    | Slack is working fine for me
 
    | umanwizard wrote:
    | My company's Slack instance is currently fine.
 
  | ChadyWady wrote:
  | I'm impressed but Amazon Chime still appears to be working
  | right now. It's sad because this is the one service that could
  | go down and be a net benefit.
 
  | perydell wrote:
  | We have a SMS text thread with about 12 people that we send one
  | message on the first of every month. To make sure it is tested
  | and ready to be used for communications if all other comms
  | networks are down.
 
  | dahak27 wrote:
  | Especially if enough Amazon internal tools rely on it - would
  | be funny if there were a repeat of the FB debacle where Amazon
  | employees somehow couldn't communicate/get back into their
  | offices because of the problem they were trying to fix
 
    | umanwizard wrote:
    | Last I knew, Amazon used all Microsoft stuff for business
    | communication.
 
      | itsyaboi wrote:
      | Slack, as of last year.
      | 
      | https://slack.com/blog/news/slack-aws-drive-development-
      | agil...
 
        | nostrebored wrote:
        | And before that, Amazon Chime was the messaging and
        | conferencing tool. Now that I'm not using it, I actually
        | miss it a lot!
 
        | shepherdjerred wrote:
        | I cried tears of joy when Amazon finally switched to
        | Slack last year
 
        | manquer wrote:
        | Slack uses Chime for A/V under the hood so I don't think
        | it is all that different for non text.[1]
        | 
        | [1] https://www.theverge.com/2020/6/4/21280829/slack-
        | amazon-aws-...
 
| shaftoe444 wrote:
| Company wide can't log in to console. Many, many SNS and SQS
| errors in us-east-1.
 
  | _wldu wrote:
  | Same here and I use us-east-2.
 
| 2bitlobster wrote:
| I wonder what the cost is on the US economy
 
| taormina wrote:
| Have folks considered a class-action lawsuit against these
| blatantly fraudulent SLAs to recoup costs?
 
  | itsdrewmiller wrote:
  | In my experience, despite whatever is published, companies will
  | private acknowledge and pay their SLA terms. (Which still only
  | gets you, like, one day's worth of reimbursement if you're
  | lucky.)
 
    | adrr wrote:
    | Retail SLAs are a small risk compared to the enterprise SLAs
    | where an outage like this could cost Amazon tens of millions.
    | I assume these contracts have discount tiers based on
    | availability and anything below 99% would be a 100% discount
    | for that bill cycle.
 
      | jaywalk wrote:
      | But those enterprise SLAs are the ones they'll be paying
      | out. Retail SLAs are the ones that you'll have to fight
      | for.
 
| larrik wrote:
| We are having a number of rolling issues, but the site is sort of
| up? I worry it'll get worse before it gets better.
| 
| Nothing on their status page. But the Console is not working.
 
  | commandlinefan wrote:
  | We're seeing some stuff is up, and some stuff is down and some
  | of the stuff that was up a little while is down now. It's
  | getting worse as of 9:53 AM CST.
 
| 7six wrote:
| eu-central-1 as well
 
| htrp wrote:
| Looks like all aws internal APIs are down....
 
  | taf2 wrote:
  | status page looks very green even 17 minutes later...
 
| ZebusJesus wrote:
| Me thinks Venmo uses AWS because they are down as well. Status
| gator has AWS as on off on off on off. I can access my servers
| hosted in the west coast but I cannot access the AWS console,
| this is making for an interesting morning.
 
| errcorrectcode wrote:
| Alexa (jokes and flash briefing) is currently partially down for
| me. Skills and routines are working.
| 
| Amazon customer service can't help me handle an order.
| 
| It us-east-1 is out, then half of Amazon is out too.
 
| stoneham_guy wrote:
| AWS Management Console Home page is currently unavailable.
| 
| That's the error I am getting when logging to aws console
 
| soheil wrote:
| Can't even login this is the error I'm getting:
| Internal Error       Please try again later
 
| tmarice wrote:
| AWS OpenSearch is also returning 500 in us-east-1.
 
| WesolyKubeczek wrote:
| Our instances are up (when I poke with SSH, say), but the console
| itself is under the weather.
 
| stoneham_guy wrote:
| AWS Management Console Home page is currently unavailable.
 
| swasheck wrote:
| they're a bit later than normal with their large annual post-
| thanksgiving us-east-1 outage
 
| chazu wrote:
| ECR borked for us in east-1
 
  | the-rc wrote:
  | You can get new tokens. Image pulling times out after ~30s,
  | which tells me that maybe ECR is actually up, but it can't
  | verify the caller's credentials or access image metadata from
  | some other internal service. It's probably something low level
  | that crashed, taking down anything built above it.
 
    | the-rc wrote:
    | Actually, images that do not exist will return the
    | appropriate error within a few seconds, so it's really timing
    | out when talking to the storage layer or similar.
 
| mancerayder wrote:
| The Personal Health Dashboard is unhealthy. It says unknown
| error, failure or exception.
| 
| They need a monitor for the monitoring.
 
| albatross13 wrote:
| Welp, this is awkward (._. )
 
| echlipse wrote:
| imdb.com is down right now. But https://www.imdb.com/chart/top is
| reachable right now. Strange.
| 
| https://imgur.com/a/apBT86o
 
| sakopov wrote:
| AWS Management console is dead/dying. Numerous errors across
| major services like S3 and EC2 in us-east-1. This looks pretty
| bad.
 
| zedpm wrote:
| Yep, PHD isn't loading, Cloudwatch is reporting SQS errors,
| metrics aren't loading, can't pull logs. This is in US-East-1.
 
| ec109685 wrote:
| Surprised Netflix is down. I thought they were hot/hot multi-
| region: https://netflixtechblog.com/active-active-for-multi-
| regional...
 
  | ManuelKiessling wrote:
  | Just watched two episodes of Better Call Saul on Netflix
  | Germany without issues (while not being able to run my
  | Terraform plan against my eu-central-1 infrastructure...).
 
| PaulHoule wrote:
| Makes me glad I am in us-east-2.
 
  | heyitsguay wrote:
  | I'm having issues with us-east-2 now. Console is down, then
  | when I try to sign into a particular service I just get "please
  | try again later".
 
    | muttantt wrote:
    | It's related to east-1, looks like some single point of
    | failure not letting you access east-2 console URLs
 
  | newhouseb wrote:
  | We're flapping pretty hard in us-east-2 (looks API Gateway
  | related, which is probably because it's an edge deployment
  | which has a bunch of us-east-1 dependencies).
 
  | muttantt wrote:
  | us-east-2 is truly a hidden gem
 
    | kohanz wrote:
    | Except that it had a significant outage just a couple of
    | weeks ago. Source: most of our stuff is on us-east-2.
 
| crad wrote:
| We make heavy usage of Kinesis Firehose in us-east-1.
| 
| Issues started ~1:24am ET and resolved around 7:31am ET.
| 
| Then really kicked in at a much larger scale at 10:32am ET.
| 
| We're now seeing failures with connections to RDS Postgres and
| other services.
| 
| Console is completely unavailable to me.
 
  | m3nu wrote:
  | Route53 is not updating new records. Console is also out.
 
  | dylan604 wrote:
  | >Issues started ~1:24am ET and resolved around 7:31am ET.
  | 
  | First engineer found a clever hack using bubble gum
  | 
  | >Then really kicked in at a much larger scale at 10:32am ET.
  | 
  | Bubble gum dried out, and the connector lost connection again.
  | Now, connector also fouled by the gum making a full replacement
  | required.
 
  | grumple wrote:
  | Kinesis was the cause last Thanksgiving too iirc. It's the
  | backbone of many services.
 
| [deleted]
 
| bilalq wrote:
| Some advice that may help:
| 
| * Visit the console directly from another region's URL (e.g.,
| https://us-east-2.console.aws.amazon.com/console/home?region...).
| You can try this after you've successfully signed in but see the
| console failing to load as well.
| 
| * If your AWS SSO app is hosted in a region other than us-east-1,
| you're probably fine to continue signing in with other
| accounts/roles.
| 
| Of course, if all your stuff is in us-east-1, you're out of luck.
| 
| EDIT: Removed incorrect advice about running AWS SSO in multiple
| regions.
 
  | binaryblitz wrote:
  | I don't think you can run SSO in multiple regions on the same
  | AWS account.
 
    | bilalq wrote:
    | Thanks, corrected.
 
  | staticassertion wrote:
  | > Might also be a good idea to run AWS SSO in multiple regions
  | if you're not already doing so.
  | 
  | Is this possible?
  | 
  | > AWS Organizations only supports one AWS SSO Region at a time.
  | If you want to make AWS SSO available in a different Region,
  | you must first delete your current AWS SSO configuration.
  | Switching to a different Region also changes the URL for the
  | user portal. [0]
  | 
  | This seems to indicate you can only have one region.
  | 
  | [0]
  | https://docs.aws.amazon.com/singlesignon/latest/userguide/re...
 
    | bilalq wrote:
    | Good call. I just assumed you could for some reason. I guess
    | the fallback is to devise your own SSO implementation using
    | STS in another region if needed.
 
| zackbloom wrote:
| I'm now getting failures searching for products on Amazon.com
| itself. This is somewhat surprising, as the narrative always was
| that Amazon didn't do a great job of dogfooding their own cloud
| platform.
 
  | di4na wrote:
  | They did more of it starting a few years back. It has been
  | interesting to see how some services evolved far faster when
  | retail started to use them. Seems that some customer are far
  | more centric than others if you catch my drift...
 
  | whoknowswhat11 wrote:
  | My Amazon order history showed no orders, but now is showing my
  | orders again - so stuff seems to be getting either fixed or
  | intermittent outages.
 
  | blahyawnblah wrote:
  | Doesn't amazon.com run on us-east-1?
 
  | zackbloom wrote:
  | Update: I'm also getting Internal Errors trying to log into the
  | Amazon.com site now as well.
 
| mabbo wrote:
| Are the actual _services_ down, or is it just the console and /or
| login page?
| 
| For example, the sign-up page appears to be working:
| https://portal.aws.amazon.com/billing/signup#/start
| 
| Are websites that run on AWS us-east up? Are the AWS CLIs
| working?
 
  | pavel_lishin wrote:
  | We're seeing issues with EventBridge, other folks are having
  | trouble reaching S3.
  | 
  | Looks like actual services.
 
  | grumple wrote:
  | I can tell you that some processes are not running, possibly
  | due to SQS or SWF problems. Previous outages of this scale were
  | caused by Kinesis outages. Can't connect via aws login at the
  | cli either since we use SSO and that seems to be down.
 
  | meepmorp wrote:
  | EventBridge, CloudWatch. I've just started getting session
  | errors with the console, too.
 
  | 0xCMP wrote:
  | Using cli to describe instances isn't working. Instances
  | themselves seem fine so far.
 
  | Waterluvian wrote:
  | My ECS, EC2, Lambda, load balancer, and other services on us-
  | east-1 still function. But these outages can sometimes
  | propagate over time rather than instantly.
  | 
  | I cannot access the admin console.
 
  | snewman wrote:
  | Anecdotally, we're seeing a small number of 500s from S3 and
  | SQS, but mostly our service (which is at nontrivial scale, but
  | mostly just uses EC2, S3, DynamoDB, and some basic network
  | facilities including load balancers) seems fine, knock on wood.
  | Either the problem is primarily in more complex services, or it
  | is specific to certain AZs or shards or something.
 
  | 2bitlobster wrote:
  | Interactive Video Service (IVS) is down too
 
  | Guest19023892 wrote:
  | One of my sites went offline an hour ago because the web server
  | stopped responding. I can't SSH into it or get any type of
  | response. The database server in the same region and zone is
  | continuing to run fine though.
 
    | bijoo wrote:
    | Interesting, is the site on a particular type of EC2
    | instance, e.g. bare metal? I see c4.xlarge is doing fine in
    | us-east-1.
 
      | Guest19023892 wrote:
      | It's just a t3a.nano instance since it's a project under
      | development. However, I have a high number of t3a.nano
      | instances in the same region operating as expected. This
      | particular server has been running for years, so although
      | it could be a coincidence it just went offline within
      | minutes of the outage starting, it seems unlikely.
      | Hopefully no hardware failures or corruption, and it'll
      | just need a reboot once I can get access to AWS again.
 
  | dangrossman wrote:
  | My website that runs on US-East-1 is up.
  | 
  | However, my Alexa (Echo) won't control my thermostat right now.
  | 
  | And my Ring app won't bring up my cameras.
  | 
  | Those services are run on AWS.
 
    | kingcharles wrote:
    | Now I'm imagining someone dying because they couldn't turn
    | their heating on because AWS. The 21st Century is fucked up.
 
    | [deleted]
 
  | SEMW wrote:
  | Definitely not just the console. We had hundreds of thousands
  | of websocket connections to us-east-1 drop at 15:40, and new
  | websocket connections to that region are still failing.
  | (Luckily not a huge impact on our service cause we run in 6
  | other regions, but still).
 
    | andrew_ wrote:
    | Side question: How happy are you with API Gateway's WebSocket
    | service?
 
      | SEMW wrote:
      | No idea, we don't use it. These were websocket connections
      | to processes on ec2, via NLB and cloudfront. Not sure
      | exactly what part of that chain was broken yet.
 
        | zedpm wrote:
        | This whole time I've been seeing intermittent timeouts
        | when checking a UDP service via NLB; I've been wondering
        | if it's general networking trouble or something
        | specifically with the NLB. EC2 hosts are all fine, as far
        | as I can tell.
 
  | sophacles wrote:
  | I wasn't able to load my amazon.com wishlist, nor the shopping
  | page through the app. Not an aws service specifically, but an
  | amazon service that I couldn't use.
 
  | heartbreak wrote:
  | I'm getting blank pages from Amazon.com itself.
 
  | nowahe wrote:
  | I can't access anything related to Cloudfront, either through
  | the CLI or console :                 $ aws cloudfront list-
  | distributions            An error occurred
  | (HttpTimeoutException) when calling the ListDistributions
  | operation: Could not resolve DNS within remaining TTL of 4999
  | ms
  | 
  | However I can still access the distribution fine
 
  | lambic wrote:
  | We've had reports of some intermittent 500 errors from
  | cloudfront, apart from that our sites are up.
 
  | bijoo wrote:
  | I see started EC2 instances are doing fine. However, starting
  | offline instances cannot be done through AWS SDK due to the
  | HTTP 500 error, even for Ec2 service. The CLI should be getting
  | the HTTP 500 error too since likely the same API as the SDK.
 
| bkirkby wrote:
| fwiw, we are seeing errors when trying to publish to SNS although
| the aws status pages say nothing about SNS.
 
| saggy4 wrote:
| It seems that only console is having the problem CLI works fine
 
  | rhines wrote:
  | CLI for EC2 works for me, but not ELB.
 
  | MatthewCampbell wrote:
  | CloudFormation changesets are reporting "InternalFailure" for
  | us in us-east-1.
 
  | albatross13 wrote:
  | Not entirely true- we federate through ADFS and `saml2aws
  | login` is currently failing with:
  | 
  | error logging into aws role using saml assertion: error
  | retrieving STS credentials using SAML: ServiceUnavailable:
  | status code: 503
 
  | zedpm wrote:
  | I'm having CLI issues as well, they're using the same APIs
  | under the hood. For example, I'm getting 503 errors for
  | cloudwatch DescribeLogGroups.
 
    | saggy4 wrote:
    | Tried a few cli commands seems to be working fine for me.
    | Maybe it is not for everyone or maybe It is just the start of
    | something very worse. :(
 
      | technics256 wrote:
      | try aws ecr describe-registry and you will get an error
 
        | saggy4 wrote:
        | Yes Indeed, getting failures in CLI as well
 
| [deleted]
 
| 999900000999 wrote:
| What if it never comes back up ?
 
| zegl wrote:
| I love that every time this happens, 100% of the services on
| https://status.aws.amazon.com are green.
 
  | daniel-s wrote:
  | That page is not loading for me... on which region is it
  | hosted?
 
  | qudat wrote:
  | Status pages are hard
 
    | mrweasel wrote:
    | Not if you're AWS. At this point I'm fairly sure their status
    | page is just a static html that always show all green.
 
      | siva7 wrote:
      | Well, it is.
 
    | jtdev wrote:
    | Why? Twitter and HN can tell me that AWS is having an outage,
    | why can't AWS?
 
    | AH4oFVbPT4f8 wrote:
    | They sent their CEO into space, I am sure they have the
    | resources to figure it out.
 
    | cr3ative wrote:
    | When they have too much pride in an all-green dash, sure.
    | Allowing any engineer to declare a problem when first
    | detected? Not so hard, but it doesn't make you look good if
    | you have an ultra-twitchy finger. They have the balance badly
    | wrong at the moment though.
 
      | lukeschlather wrote:
      | A trigger-happy status page gives realtime feedback for
      | anyone doing a DoS attack. Even if you published that
      | information publicly you would probably want it on a
      | significant delay.
 
    | pid-1 wrote:
    | More like admitting failure is hard.
 
    | smt88 wrote:
    | No they're not.
    | 
    | Step 1: deploy status checks to an external cloud.
 
      | kube-system wrote:
      | I agree, but does come with increased challenges with false
      | positives.
      | 
      | That being said, AWS status pages _are_ up.
 
    | wruza wrote:
    | "Falsehoods Programmers Believe About Status Pages"
 
  | 0xmohit wrote:
  | No wonder IMDB  is down (returning 503).
  | Sad that Amazon engineers don't implement what they teach their
  | customers -- designing fault-tolerant and highly available
  | systems.
 
  | barbazoo wrote:
  | It seems they updated it ~30 minutes after your comment.
 
  | judge2020 wrote:
  | I don't see why they couldn't provide an error rate graph like
  | Reddit[0] or simply make services yellow saying "increased
  | error rate detected, investigating..."
  | 
  | 0: https://www.redditstatus.com/#system-metrics
 
    | VWWHFSfQ wrote:
    | because nobody cares when reddit is down. or at least, nobody
    | is paying them to be up 99.999% of the time.
 
    | willcipriano wrote:
    | A executive has a OKR around uptime and a automated system
    | prevents him or her from having control over the messaging.
    | Therefore any effort to create one is squashed, leaving the
    | people requesting it confused as to why and left without any
    | explanation. Oldest story in the book.
 
    | jkingsman wrote:
    | Because Amazon has $$$$$ in their SLOs, and it costs them
    | through the nose every minute they're down in payments made
    | to customers and fees refunded. I trust them and most
    | companies not to be outright fraudulent (although I'm sure
    | some are), but it's totally understandable they'd be reticent
    | to push the "Downtime Alert/Cost Us a Ton of Money" button
    | until they're sure something serious is happening.
 
      | jolux wrote:
      | It should be costing them trust not to push it when they
      | should though. A trustworthy company will err on the side
      | of pushing it. AWS is a near-monopoly, so their
      | unprofessional business practices have still yet to cost
      | them.
 
        | ethbr0 wrote:
        | > _It should be costing them trust not to push it when
        | they should though._
        | 
        | This is what Amazon, the startup, understood.
        | 
        | Step 1: _Always_ make it right and make the customer
        | happy, even if it hurts in $.
        | 
        | Step 2: If you find you're losing too much money over a
        | particular issue, _fix the issue_.
        | 
        | Amazon, one of the world's largest companies, seems to
        | have forgotten that the risk of not reporting accurately
        | isn't money, but _breaking the feedback chain_. Once you
        | start gaming metrics, no leaders know what 's really
        | important to work on internally, because no leaders know
        | what the actual issues are. It's late Soviet Union in a
        | nutshell. If everyone is gaming the system at all levels,
        | then eventually the ability to objectively execute
        | decreases, because effort is misallocated due to
        | misunderstanding.
 
        | Kavelach wrote:
        | > It's late Soviet Union in a nutshell
        | 
        | How come an action of a private company in a capitalist
        | country is like the Soviet Union?
 
        | jolux wrote:
        | Private companies are small centrally-planned economies
        | within larger capitalist systems.
 
      | dhsigweb wrote:
      | I can Google and see how many apps, games, or other
      | services are down. So them not "pushing some buttons" to
      | confirm it isn't fooling anyone.
 
      | btilly wrote:
      | This is an incentive to dishonesty, leading to fraudulent
      | payments and false advertising of uptime to potential
      | customers.
      | 
      | Hopefully it results in a class action lawsuit for enough
      | money that Amazon decides that an automated system is
      | better than trying to supply human judgement.
 
        | jenkinstrigger wrote:
        | Can someone just have a site ping all the GET endpoints
        | on the AWS API? That is very far from "automating [their
        | entire] system" but it's better than what they're doing.
 
        | tynorf wrote:
        | Something like this? https://stop.lying.cloud/
 
      | lozenge wrote:
      | It literally is fraudulent though.
      | 
      | I don't think a region being down is something that you can
      | be unsure about.
 
        | hamburglar wrote:
        | Oh, you can get pretty weaselly about what "down" means.
        | If there is "just" an S3 issue, are all the various
        | services which are still "available" but throwing an
        | elevated number of errors because of their own internal
        | dependency on S3 actually down or just "degraded?" You
        | have to spin up the hair-splitting apparatus early in the
        | incident to try to keep clear of the post-mortem party.
        | :D
 
    | w0m wrote:
    | The more transparency you give; the harder it is to control
    | the narrative. They have a general reputation for
    | reliability; and exposing just how many actual
    | errors/failures there are (that generally don't effect a
    | large swath of users/usecases) would do hurt that reputation
    | for minimal gain.
 
  | sakopov wrote:
  | Those five 9s don't come easy. Sometimes you have to prop them
  | up :)
 
    | 1-6 wrote:
    | It's hard to measure what five-9 is because you have to wait
    | around until a 0.00001 occurs. Incentivizing post-mortems are
    | absolutely critical in this case.
 
      | notinty wrote:
      | It's 0.001; the first 2 9's count.                 5N  =
      | 99.999%       3N  = 99.9%       1N5 = 95%
      | 
      | 5N is <43m12s downtime per month.
 
        | 1-6 wrote:
        | I considered writing it as a percent but then decided
        | against using it and moving the decimal instead. But good
        | info for clarification.
 
    | JoelMcCracken wrote:
    | Every time someone asks to update the status page, managers
    | say "nein"
 
    | jjoonathan wrote:
    | I wonder how often outages really happen. The official page
    | is nonsense, of course, and we only collectively notice when
    | the outage is big enough that lots of us are affected. On
    | AWS, I see about a 3:1 ratio of "bump in the night" outages
    | (quickly resolved, little corroboration) to mega too-big-to-
    | hide outages. Does that mirror others' experiences?
 
      | Spivak wrote:
      | If you count any time AWS is having a problem that impacts
      | our production workloads then I think it's about 5:1.
      | Dealing with "AWS is down" outages are easy because I can
      | just sit back and grab some popcorn, it's the "dammit I
      | know this is AWS's fault" outages that are a PITA because
      | you count yourself lucky to even get a report in your
      | personalized dashboard.
 
        | jjoonathan wrote:
        | Yep.
        | 
        | Random aside: any chance you are related to the Calculus
        | on Manifolds Spivak?
 
        | Spivak wrote:
        | Nope, just a fan. It was the book that pioneered my love
        | of math.
 
        | clh1126 wrote:
        | I had to log in to say, that one of my favorite quotes of
        | all time I found in Calculus on Manifolds.
        | 
        | He says that any good theorem is worth generalizing, and
        | I've generalized that to any life rule.
 
    | kylemh wrote:
    | https://aws.amazon.com/compute/sla/
    | 
    | looks like only four 9's
 
      | dotancohen wrote:
      | > looks like only four 9's
      | 
      | That's why the Germans are such good engineers.
      | Did the drives fail? Nein.       Did the CPU overheat?
      | Nein.       Did the power get cut? Nein.       Did the
      | network go down? Nein.
      | 
      | That's "four neins" right there.
 
  | [deleted]
 
  | [deleted]
 
  | NicoJuicy wrote:
  | Not right now. I think they monitor if it appears on HN too.
 
  | swiftcoder wrote:
  | When I worked there it required the signoff of both your VP-
  | level executive and the comms team to update the status page. I
  | do not believe I ever received said signoff before the issues
  | were resolved.
 
  | queuebert wrote:
  | Are they lying, or just prioritizing their own services?
 
    | _verandaguy wrote:
    | Willing to bet the status page gets updated by logic on us-
    | east-1
 
    | itsyaboi wrote:
    | Status service is probably hosted in us-east-1
 
    | AH4oFVbPT4f8 wrote:
    | amazon.com seems to be having problems too. I get something
    | went wrong to a new design/layout which I assume is either
    | new or a fail safe.
 
      | KineticLensman wrote:
      | Looks okay right now to this UK user of amazon.co.uk
 
        | btilly wrote:
        | It depends on which Amazon region you are being served
        | from.
        | 
        | It is very unlikely that Amazon would deliberately make
        | your messages cross the Atlantic just to find an American
        | region that is unable to serve you.
 
    | bennyp101 wrote:
    | https://music.amazon.co.uk is giving me an error since about
    | 16:30 GMT
    | 
    | "We are experiencing an error. Our apologies - We will be
    | back up soon."
 
  | [deleted]
 
  | gbear0 wrote:
  | I assume each service has its own health check that checks the
  | service is accessible from an internal location, thus most are
  | green. However, when Service A requires Service B to do work,
  | but Service B is down, a simple access check on Service A
  | clearly doesn't give a good representation of uptime.
  | 
  | So what's a good health check actually report these days? Is it
  | just about its own status, or should it include a breakdown of
  | the status of external dependencies as part of its folded up
  | status?
 
  | goshx wrote:
  | I remember the time when S3 went down and took the status page
  | down with it
 
  | notreallyserio wrote:
  | Makes you wonder if they have to manually update the page when
  | outages occur. That'd be a pretty bad way to go, so I'd hope
  | not. Maybe the code to automatically update the page is in us-
  | east-1? :)
 
    | gromann wrote:
    | Word on the street is the status page is just a JPG
 
    | wfleming wrote:
    | Something like that has impacted the status page in the past.
    | There was a severe Kinesis outage last year
    | (https://aws.amazon.com/message/11201/), and they couldn't
    | update the service dashboard for quite a while because their
    | tool to do manage the service dashboard lives in us-east-1
    | and depends on Kinesis.
 
  | JohnJamesRambo wrote:
  | > Goodhart's Law is expressed simply as: "When a measure
  | becomes a target, it ceases to be a good measure."
  | 
  | It's very frustrating. Why even have them?
 
    | Spivak wrote:
    | Because "uptime" and "nines" became a marketing term. Simple
    | as that. But the problem is that any public-facing measure of
    | availability becomes a defacto marketing term.
 
      | hinkley wrote:
      | Also 4-5 nines is virtually impossible for complex systems,
      | so the sort of responsible people who could make 3 nines
      | true begin to check out, and now you've getting most of
      | your info from the delusional, and you're lucky if you
      | manage 2 objective nines.
 
      | Enginerrrd wrote:
      | The older I get the more I hate marketers. The whole field
      | stands on the back of war-time propaganda research and it
      | sure feels like it's the cause of so much rot in society.
 
  | jbavari wrote:
  | Well yea, it's eventually consistent ;)
 
  | jrochkind1 wrote:
  | Even better, when I try to go to console, I get:
  | 
  | > AWS Management Console Home page is currently unavailable.
  | 
  | > You can monitor status on the AWS Service Health Dashboard.
  | 
  | "AWS Service Health Dashboard" is a link to
  | status.aws.amazon.com... which is ALL GREEN. So... thanks for
  | the suggestion?
  | 
  | At this point the AWS service health dashboard is kind of
  | famous for always been green isn't it? It's a joke to it's
  | users. Do the folks who work on the relevant AWS internal
  | team(s) know this, and just not have the resources to do
  | anything about it, or what? If it's a harder problem than you'd
  | think for interesting technical reasons, that'd be interesting
  | to hear about.
 
  | kortex wrote:
  | It's like trying to get the truth out of a kid that caused some
  | trouble.
  | 
  | Mom: Alexa, did you break something?
  | 
  | Alexa: No.
  | 
  | M: Really? What's this? _500 Internal server error_
  | 
  | A: ok maybe management console is down
  | 
  | M: Anything else?
  | 
  | A: ...
  | 
  | A: ... ok maybe cloudwatch logs
  | 
  | M: Ah hah. What else?
  | 
  | A: That's it, I swear!
  | 
  | M: _503 ClientError_
  | 
  | A: ...well okay secretsmanager might be busted too...
 
    | hinkley wrote:
    | There was a great response in r/relationship advice the other
    | day where someone said that OP's partner forced a fight
    | because they're planning to cheat on them, reconcile, and
    | then will 'trickle out the truth' over the next 6 months. I'm
    | stealing that phrase.
 
    | mdni007 wrote:
    | Funny I literally just asked my Alexa.
    | 
    | Me: Alexa, is AWS down right now?
    | 
    | Alexa: I'd rather not answer that
 
      | hinkley wrote:
      | Wise robot.
      | 
      | That's a bit like involving your kid in an argument between
      | parents.
 
    | PopeUrbanX wrote:
    | The very expensive EC2 instance I started this morning still
    | works. Of course now I can't shut it down.
 
  | ta20200710 wrote:
  | EC2 or S3 showing red in any region literally requires personal
  | approval of the CEO of AWS.
 
    | dekhn wrote:
    | Uhhhhh... what if the monitoring said it was hard down?
    | They'd still not show red?
 
      | choeger wrote:
      | Probably they cannot. They outsourced this dashboard and it
      | runs on AWS now ;).
 
    | dia80 wrote:
    | Unfortunately, errors don't require his approval...
 
    | notreallyserio wrote:
    | Is this true or a joke? This sort of policy is how you
    | destroy trust.
 
      | marcosdumay wrote:
      | If you trust them at this point, you have not being paying
      | attention, and will probably continue to trust after this.
 
      | bsedlm wrote:
      | maybe we gotta consider the publicly facing status pages as
      | something other than a technical tool (e.g. marketing or PR
      | or something like that, dunno)
 
      | jeffrallen wrote:
      | Well, no big deal, there's not really a lot of trust there
      | to destroy...
 
      | jedberg wrote:
      | From what I've heard it's mostly true. Not only the CEO but
      | a few SVPs can approve it, but yes a human must approve the
      | update and it must be a high level exec.
      | 
      | Part of the reason is because their SLAs are based on that
      | dashboard, and that dashboard going red has a financial
      | cost to AWS, so like any financial cost, it needs approval.
 
        | orangepurple wrote:
        | Being dishonest about SLAs seems to bear zero cost in
        | this case?
 
        | solatic wrote:
        | Zero directly-attributable, calculable-at-time-of-
        | decision cost. Of course there's a cost in terms of
        | customers who leave because of the dishonest practice,
        | but, who knows how many people that'll be? Out of the
        | customers who left after the outage, who knows whether
        | they left due to not communicating status promptly and
        | honestly or whether it was for some other reason?
        | 
        | Versus, if a company has X SLA contracts signed, that
        | point to Y reimbursement for being out for Z minutes, so
        | it's easily calculable.
 
        | jedberg wrote:
        | It's not really dishonest though because there is nuance.
        | Most everything in EC2 is still working it seems, just
        | the console is down. So is it really down? It should
        | probably be yellow but not red.
 
        | dekhn wrote:
        | if you cannot access the control plane to create or
        | destroy resources, it is down (partial availability). The
        | jobs that are running are basically zombies.
 
        | w0m wrote:
        | Depending the workload being run users may or may not
        | notice. Should be Yellow at a minimum.
 
        | jedberg wrote:
        | Seems like the API is still working and so is auto
        | scaling. So they aren't really zombies.
        | 
        | Partial availability isn't the same as no availability.
 
        | electroly wrote:
        | The API is NOT working -- it may not have been listed on
        | the service health dashboard when you posted that, but it
        | is now. We haven't been able to launch an instance at
        | all, and we are continuously trying. We can't even start
        | existing instances.
 
        | dekhn wrote:
        | I'm right in the middle of an AWS-run training and we
        | literally can't run the exercises because of this.
        | 
        | let me repeat that: my AWS trainign that is run by AWS
        | that I pay AWS for isn't working, because AWS is having
        | control plane (or other) issues. This is several hours
        | after the initial incident. We're doing training in us-
        | west-2, but the identity service and other components run
        | in us-east-1.
 
        | justrudd wrote:
        | I'm running EKS in us-west-2. My pods use a role ARN and
        | identity token file to get temporary credentials via STS.
        | STS can't return credentials right now. So my EKS cluster
        | is "down" in the sense that I can't bring up new pods. I
        | only noticed because an auto-scaling event failed.
 
        | dekhn wrote:
        | We ran through the whole 4.5 hour training and the
        | training app didn't work the entire time.
 
        | jjoonathan wrote:
        | "Good at finding excuses" is not the same thing as
        | "honest."
 
        | paulryanrogers wrote:
        | SNS seems to be at least partially down as well
 
        | jtheory wrote:
        | My company relies on DynamoDB, so we're totally down.
        | 
        | edit: partly down; it's sporadically failing
 
        | jrochkind1 wrote:
        | Heroku is currently having major problems. My stuff is
        | still up, but I can't deploy any new versions. Heroku
        | runs their stuff on AWS. I have heard reports of other
        | companies who run on AWS also having degarded service and
        | outages.
        | 
        | i'd say when other companies who run their infrastruture
        | on AWS are going out, it's hard to argue it's not a real
        | outage.
        | 
        | But AWS status _has_ changed to yellow at this point.
        | Probably heroku could be completely down because of an
        | AWS problem, and AWS status would still not show red. But
        | at least yellow tells us there's a problem, the
        | distinction between yellow and red probably only matters
        | at this point to lawyers arguing about the AWS SLA, the
        | rest of us know yellow means "problems", red will never
        | be seen, and green means "maybe problems anyway".
        | 
        | I believe the entire us-east-1 could be entirely missing,
        | and they'd still only put a yellow not a red on status
        | page. After all, the other regions are all fine, right?
 
        | dekhn wrote:
        | Sure, but... that just raises more questions :)
        | 
        | Taken literally what you are saying is the service could
        | be down and an executive could override that, preventing
        | them for paying customers for a service outage, even if
        | the service did have an outage and the customer could
        | prove it (screenshots, metrics from other cloud
        | providers, many different folks see it).
        | 
        | I'm sure there is some subtlety to this, but it does mean
        | that large corps with influence should be talking to AWS
        | to ensure that status information corresponds with actual
        | service outages.
 
        | [deleted]
 
        | emodendroket wrote:
        | I have no inside knowledge or anything but it seems like
        | there are a lot of scenarios with degraded performance
        | where people could argue about whether it really
        | constitutes an outage.
 
        | dilyevsky wrote:
        | One time gcp argued that since they did return 404s on
        | gcs for a few hours that wasn't an uptime/latency sla
        | violation so we were not entitled to refund (tho they
        | refunded us anyway)
 
        | Enginerrrd wrote:
        | Man, between costs and shenanigans like this, why don't
        | more companies self-host?
 
        | dilyevsky wrote:
        | 1. Leadership prefers to blame cloud when things break
        | rather than take responsibility.
        | 
        | 2. Cost is not an issue (until it is but you're already
        | locked in so oh well)
        | 
        | 3. Faang has drained the talent pool of people who know
        | how
 
        | pm90 wrote:
        | Opex > Capex. If companies thought about long term, yes
        | they might consider it. But unless the cloud providers
        | fuck up really badly, they're ok to take the heat
        | occasionally and tolerate a bit of nonsense.
 
        | dilyevsky wrote:
        | You can lease equipment you know...
 
        | dekhn wrote:
        | Yep. I was an SRE who worked at Google and also launched
        | a product on Google Cloud. We had these arguments all the
        | time, and the contract language often provides a way for
        | the provider to weasel out.
 
        | jedberg wrote:
        | Like I said I never worked there and this is all hearsay
        | but there is a lot of nuance here being missed like
        | partial outages.
 
        | dekhn wrote:
        | This is no longer a partial outage. The status page
        | reports elevated API error rates, DynamoDB issues, EC2
        | API error rates, and my company's monitoring is
        | significantly affected (IE, our IT folks can't tell us
        | what isn't working) and my AWS training class isn't
        | working either.
        | 
        | If this needed a CEO to eventually get around to pressing
        | a button that said "show users the actual information
        | about a problem" that reflects poorly on amazon.
 
        | dhsigweb wrote:
        | My friend works at a telemetry company for monitoring and
        | they are working on alerting customers of cloud service
        | outages before the cloud providers since the providers
        | like to sit on their hands for a while (presumably to try
        | and fix it before anyone notices).
 
        | meetups323 wrote:
        | Large corps with influence get what they want regardless.
        | Status page goes red and the small corps start thinking
        | they can get what they want too.
 
        | scrose wrote:
        | > Status page goes red and the small corps start thinking
        | they can get what they want too.
        | 
        | I think you mean "start thinking they can get what they
        | pay for"
 
        | notreallyserio wrote:
        | I wonder how well known this is. You'd think it would be
        | hard to hire ethical engineers with such a scheme in
        | place and yet they have tens of thousands.
 
  | sneak wrote:
  | It's widespread industry knowledge now that AWS is publicly
  | dishonest about downtime.
  | 
  | When the biggest cloud provider in the world is famous for
  | gaslighting, it sets expectations for our whole industry.
  | 
  | It's fucking disgraceful that they tolerate such a lack of
  | integrity in their organization.
 
  | strictfp wrote:
  | "some customers may experience a slight elevation in error
  | rates" --> everything is on fire, absolutely nothing works
 
  | ballenf wrote:
  | https://downdetector.com
  | 
  | Amazing and scary to see all the unrelated services down right
  | now.
 
    | nightpool wrote:
    | I think it's pretty unlikely that both Google and Facebook
    | are affected by this minor AWS outage, whatever DownDetector
    | says. I even did a spot check on some of the smaller websites
    | they report as "down", like canva.com, and didn't see any
    | issues.
 
      | zarkov99 wrote:
      | You might be right about Google and Facebook, but this
      | isn't minor at all. Impact is widespread.
 
  | john37386 wrote:
  | It starts to show issues now. I agree that it was a bit long
  | before we can get real visibility on the incident.
 
  | jrockway wrote:
  | I wonder if the other parts of Amazon do this. Like their
  | inventory system thinks something is in stock, but people can't
  | find it in the warehouse, do they just simply not send it to
  | you and hope you don't notice? AWS's culture sounds super
  | broken.
  | 
  | My favorite status page, though, is Slack's. You can read an
  | article in the New York Times about how Slack was down for most
  | of a day, and the status page is just like "some percentage of
  | users experienced minor connectivity issues". "Some percentage"
  | is code for "100%" and "minor" is code for "total". Good try.
 
  | whoknew1122 wrote:
  | The problem being that often times you can't actually update
  | the status page. Most internal systems are down.
  | 
  | We can't even update our product to say it's down, because
  | accessing the product requires a process that is currently
  | dead.
 
    | thayne wrote:
    | That's why your status page should be completely independent
    | from the services it is monitoring (minus maybe something
    | that automatically updates it). We use a third party to host
    | our status page specifically so that we can update it even if
    | all our systems are down.
 
      | whoknew1122 wrote:
      | I'm not saying you're wrong, or that the status page is
      | architected properly. I'm just speaking to the current
      | situation.
 
  | davecap1 wrote:
  | On top of that, the "Personalized Health Dashboard" doesn't
  | work because I can't seem to log in to the console.
 
    | meepmorp wrote:
    | I'm logged in; you're missing an error message.
 
      | davecap1 wrote:
      | We have federated login with MFA required (which was
      | failing). It just started working again.
      | 
      | Scratch that... console is not loading at all now :)
 
  | ricardobayes wrote:
  | Wonder why almost all Amazon frontend looks like it was written
  | in c++
 
| [deleted]
 
| davikawasaki wrote:
| EKS works for us
 
| alephnan wrote:
| Vanguard has been slow all day. I'm going to guess Vanguard has a
| depencency on US-east-1
 
| tgtweak wrote:
| I'm going with "BGP error" which is likely config-related, likely
| human error.
| 
| Seems to be the trend with the last 5-6 big cloud outages.
 
| keyle wrote:
| Why does that webpage renders like a dog?... I get that it's
| under load but the rendering itself is chugging something rare.
| 
| Edit: wow that webpage is humongous... never heard of paging?
 
| saggy4 wrote:
| I think the only the console is down. CLI is working fine for me
| in ap-southeast-1
 
| kchoudhu wrote:
| At this point I have no idea why anyone would put anything in us-
| east-1.
| 
| Also isolation is not as good as they would have you believe: I
| am unable to login to AWS Quicksight in us-west-2...
 
  | bradhe wrote:
  | Man, some conclusions are being _jumped_ to by this reply.
 
    | InTheArena wrote:
    | There is a very long history of US-east-1 being horrible.
    | Just bad. We've told every client we can to get out of there.
    | It's one of the oldest amazon regions, and I think too much
    | old legacy and weird stuff happens there. Use US-west-2.
 
      | jedberg wrote:
      | Or US-East-2.
 
      | blahyawnblah wrote:
      | Isn't us-east-1 where they deploy everything first? And the
      | only region that has 100% of all available services?
 
  | crad wrote:
  | Been in us-east-1 for a long time. Things like Direct Connect
  | and other integrations aren't easy or cheap to move and when
  | you have other, bigger priorities, moving regions is not an
  | easy decision to prioritize.
 
| RubberShoes wrote:
| "AWS Management Console Home page is currently unavailable. You
| can monitor status on the AWS Service Health Dashboard."
| 
| And then Health Dashboard is 100% green. What a joke.
 
| nurgasemetey wrote:
| It seems that IMDB and Goodreads are also affected
 
  | kingcharles wrote:
  | Yeah, and Audible and Amazon.com search and the Amazon retail
  | stores.
  | 
  | Basically Amazon fucked all their own products too.
 
| picodguyo wrote:
| Funny, I just asked Alexa to set a timer and she said there was a
| problem doing that. Apparently timers require functioning us-
| east-1 now.
 
  | minig33 wrote:
  | I can't turn on my lights... the future is weird
 
    | lsaferite wrote:
    | And that is why my lighting automation has a baseline req
    | that it works 100% without the internet and preferably
    | without a central controller.
 
      | organsnyder wrote:
      | I love my Home Assistant setup for this reason. I can even
      | get light bulbs pre-flashed with ESPHome now (my wife was
      | bemused when I was updating the firmware on the
      | lightbulbs).
 
      | rodgerd wrote:
      | HomeKit compatibility is a useful proxy for local API,
      | since it's a hard requirement for HomeKit certification.
 
| m12k wrote:
| A former colleague told me years ago that us-east-1 is basically
| the guinea pig where changes get tested before being rolled out
| to the other regions, and as a result is less stable than the
| others. Does anyone know if there's any truth to this?
 
  | wizwit999 wrote:
  | false, it's often 4th iirc, SFO (us-west-1) is actually usually
  | first.
 
  | shepherdjerred wrote:
  | At my org it was deployed in the middle, around the fourth wave
  | iirc
 
  | arpinum wrote:
  | This is not true. Lambda updates us-east-1 last.
 
  | treesknees wrote:
  | I can't see why they'd use the most common/popular region as a
  | guinea pig.
 
    | Kye wrote:
    | Problem: you've gone as far as you can go testing internally
    | or with test groups. You know there are edge cases you'll
    | only identify by having enough people test it.
    | 
    | Solution: push it to production on the zone with the most
    | users and see what breaks.
 
  | sharpy wrote:
  | The guideline has been to deploy to it last.
  | 
  | If the team follows pipeline best practices, they are supposed
  | to deploy to a single small region first, wait 24 hours, and
  | then deploy to more, wait more, and deploy to more, until
  | finally deploying to us-east-1.
 
| mystcb wrote:
| Not sure it helps, but got this update from someone inside AWS a
| few moments ago.
| 
| "We have identified the root cause of the issues in the US-EAST-1
| Region, which is a network issue with some network devices in
| that Region which is affecting multiple services, including the
| console but also services like S3. We are actively working
| towards recovery."
 
  | rozenmd wrote:
  | They finally updated the status page:
  | https://status.aws.amazon.com/
 
    | mystcb wrote:
    | Ahh good spot, it does seem that the AWS person I am speaking
    | too has a few more bits other than what is shown on the page,
    | they just messaged me the same message there, but added:
    | 
    | "All teams are engaged and continuing to work towards
    | mitigation. We have confirmed the issue is due to multiple
    | impaired network devices in the US-EAST-1 Region."
    | 
    | Doesn't sound like they are having a good day there!
 
  | alpha_squared wrote:
  | That's a copy-paste, we got the same thing from our AWS
  | contact. It's just enough info to confirm there's an issue, but
  | not enough to give any indication on the scope or timeline to
  | resolution.
 
  | alexatalktome wrote:
  | Internally the rumor is that our CICD pipelines failed to stop
  | bad commits to certain AWS services. This isn't due to tests
  | but due to actual pipelines infra failing.
  | 
  | We've been told to disable all pipelines even if we have time
  | blockers or manual approval steps or failing tests
 
  | jdc0589 wrote:
  | I love how they are sharing this stuff out to some clients, but
  | its technically under NDA.
 
    | alfalfasprout wrote:
    | Yeah, we got updates via NDA too lol. Such a joke that a
    | status page update is considered privileged lol.
 
| romanhotsiy wrote:
| It's funny that the first place I go to learn about the outage is
| Hacker News and not https://status.aws.amazon.com/ (it's still
| reports everything to be "operating normally"...)
 
  | albatross13 wrote:
  | Yeah, I tend to go off of https://downdetector.com/status/aws-
  | amazon-web-services/
 
    | murph-almighty wrote:
    | I always got the impression that downdetector worked by
    | logging the number of times they get a hit for a particular
    | service and using that as a heuristic to determine if
    | something is down. If so, that's brilliant.
 
      | albatross13 wrote:
      | I think it's a bit simpler for AWS- there's a big red "I
      | have a problem with AWS" button on that page. You click it,
      | tell it what your problem is, and it logs a report. Unless
      | that's what you were driving at and I missed it, it's
      | early. Too early for AWS to be down :(
      | 
      | Some 3600 people have hit that button in the last ~15
      | minutes.
 
      | cmg wrote:
      | It's brilliant until the information is bad.
      | 
      | When Facebook's properties all went down in October, people
      | were saying that AT&T and other cell phone carriers were
      | also down - because they couldn't connect to FB/Insta/etc.
      | There were even some media reports that cited Downdetector,
      | seeming without understanding that they are basically
      | crowdsourced and sometimes the crowd is wrong.
 
  | bmcahren wrote:
  | I made sure our incident response plan includes checking Hacker
  | News and Twitter for _actual_ updates and information.
  | 
  | As of right now, this thread and one update from a twitter
  | user,
  | https://twitter.com/SiteRelEnby/status/1468253604876333059 are
  | all we have. I went into disaster recovery mode when I saw our
  | traffic dropped to 0 suddenly at 10:30am ET. That was just the
  | SQS/something else preventing our ELB logs from being extracted
  | to DataDog though.
 
    | unethical_ban wrote:
    | So as of the time you posted this comment, were other
    | services actually down? The way the 500 shows up, and the AWS
    | status page, makes it sound like "only" the main landing
    | page/mgt console is unavailable, not AWS services.
 
      | jeremyjh wrote:
      | Yes, they are still publishing lies on their status page.
      | In this thread people are reporting issues with many
      | services. I'm seeing periodic S3 PUT failures for the last
      | 1.5 hours.
 
        | alexatalktome wrote:
        | AWS services are all built against each other so one
        | failing will take down a bunch more which take down more
        | like dominos. Internally there's a list of >20 "public
        | facing" AWS services impacted.
 
  | authed wrote:
  | I usually go on Twitter first for outages.
 
  | 1-6 wrote:
  | Community reporting > internal operations
 
  | taf2 wrote:
  | Now 57 minutes later and it still reports everything as
  | operating normally.
 
    | mijoharas wrote:
    | It shows errors now.
 
      | romanhotsiy wrote:
      | It doesn't show errors with Lambda and we clearly do
      | experience them.
 
| kingcharles wrote:
| Does McDonalds use AWS for the backend to their app?
| 
| If I find out this is why I couldn't get my Happy Meal this
| morning I'm going to be really, really grumpy.
| 
| EDIT: I'm REALLY grumpy now:
| 
| https://aws.amazon.com/blogs/industries/aws-is-how-mcdonalds...
 
  | hoofhearted wrote:
  | The McDonalds App was showing on the frontpage of Down Detector
  | at the same time as all the Amazon dependent services last I
  | checked.
 
  | Crespyl wrote:
  | Apparently Taco Bell too, not being able to place an order and
  | then also not being able to fall back to McDonalds was how I
  | realized there was a larger outage :p
  | 
  | What am I supposed to do for lunch now? Go to the drive through
  | and order like a normal person? /s
  | 
  | Grumble grumble
 
| dwighttk wrote:
| Is this why goodreads hasn't been working today?
 
| TameAntelope wrote:
| We run as much as we can out of us-east-2 because it has more
| uptime than us-east-1, and I don't think I've ever regretted that
| decision.
 
| p2t2p wrote:
| I have alerts going off because of that...
 
| kello wrote:
| Don't envy the engineers working on this right now. Good luck!
 
| nickysielicki wrote:
| The Amazon.com storefront was giving me issues loading search
| results -- this is the worst possible time of year for Amazon to
| have issues. It's horrifying and awesome to imagine hundreds of
| thousands (if not millions) of dollars of lost orders an hour --
| just from sluggish load times. Hugops to those dealing with this.
 
  | throwanem wrote:
  | Third worst time. It's not BFCM and it's not the week before
  | Christmas; from prior high-volume ecommerce experience I
  | suspect their purchase rate is elevated at this time but
  | nowhere near those two peaks.
 
| n0cturne wrote:
| I just walked into the Amazon Books store at our local mall. They
| are letting everyone know at the entrance that "some items aren't
| available for purchase right now because our systems are down."
| 
| So at least Amazon retail is feeling some of the pain from this
| outage!
 
| etimberg wrote:
| Seeing issues in ca-central-1 and us-east
 
| taf2 wrote:
| We're not down and we're in us-east-1... maybe there is more to
| this issue?
 
  | bradhe wrote:
  | I think could just be the console?
 
  | taf2 wrote:
  | Found that it seems Lambda is impacted
 
  | fsagx wrote:
  | Everything fine with S3 in us-east-1 for me. Also just not able
  | to access the console.
 
  | abarringer wrote:
  | Our services in AWS East are down.
 
___________________________________________________________________
(page generated 2021-12-07 23:00 UTC)