[HN Gopher] 1.1.1.1 lookup failures on October 4th, 2023
___________________________________________________________________
 
1.1.1.1 lookup failures on October 4th, 2023
 
Author : todsacerdoti
Score  : 64 points
Date   : 2023-10-04 19:41 UTC (3 hours ago)
 
web link (blog.cloudflare.com)
w3m dump (blog.cloudflare.com)
 
| homero wrote:
| This got me. I spent an hour trying to figure out why my Internet
| seemingly went down but not fully
 
| throwaway67743 wrote:
| [flagged]
 
  | jarym wrote:
  | Up until a few months ago the HN crowd loved Cloudflare. How
  | sentiment has changed in such a short period.
  | 
  | My guess would be their weird 'site protection' stuff is
  | burning too many people and negatively impacting their
  | reputation.
 
    | kkielhofner wrote:
    | > My guess would be their weird 'site protection' stuff is
    | burning too many people and negatively impacting their
    | reputation.
    | 
    | What's always been interesting to me about this take is it's
    | not as though Cloudflare is randomly inserting themselves in
    | internet traffic.
    | 
    | Cloudflare customers have choice in the marketplace and they
    | chose Cloudflare for whatever reasons. If end-users take
    | issue with accessing the site of a Cloudflare customer they
    | should take it up with the owners of the site that chose
    | Cloudflare. Theoretically the Cloudflare customer would take
    | it up with them if it becomes problematic. Cloudflare has no
    | obligation to the site end-users other than meeting the needs
    | of their customer who does have obligation to their end-users
    | (theoretically).
    | 
    | Cloudflare is, ostensibly, providing a solution for their
    | customers. How that impacts their customer's end-users is
    | between Cloudflare and the customer.
 
    | reaperman wrote:
    | In general, I always seem to find comments along the lines of
    | this are very easy to thoroughly disprove. There has been
    | consistent criticism of Cloudflare for many years, ever since
    | the majority of web traffic started going through their anti-
    | DDOS and anti-bot gateways.
    | 
    | Here's a HN post with lots of very critical comments[0] from
    | 7 years ago, including a fairly scathing one from 'tptacek.
    | Even way back then, you'd get the same comments you hear
    | today like:
    | 
    | > So rather than demand fixes for the fundamental issues that
    | enable ddos attacks (preventing IP spoofing, allowing
    | infected computers to remain connected, etc), we just
    | continue down this path of massive centralization of services
    | into a few big players that can afford the arms race against
    | bonnets. Using services like Cloudflare as a 'fix' is
    | wrecking the decentralized principles of the Internet. At
    | that point we might as well just write all apps as Facebook
    | widgets.
    | 
    | 0: https://news.ycombinator.com/item?id=13718947
 
    | throwaway67743 wrote:
    | I've never loved cloudflare - as someone doing this long
    | before they existed I see through their wordy blog posts
    | about rookie mistakes. It's embarrassing really.
 
  | Eduard wrote:
  | maybe to compensate Cloudflare's success blog posts where they
  | usually represent themselves as the saviors of the world.
 
    | throwaway67743 wrote:
    | Quite. Nobody else can do what they do! (Brb doing the same
    | thing before Prince was even born)
 
      | kkielhofner wrote:
      | This is peak HN comment.
      | 
      | 300 pops around the world delivering 210 Tbps of capacity,
      | mitigation of some of the largest DDoS attacks in history,
      | 20% of internet traffic. Workers, Pages, R2, D1, Zero
      | Trust, Stream, Images, Warp, 1.1.1.1, etc, etc, etc - all
      | at incredible scale.
      | 
      | But yes, of course you have been doing the exact same thing
      | since before Prince was born.
 
        | throwaway67743 wrote:
        | People had global networks of the same scale long before,
        | they just didn't offer the same features because they had
        | different products.
 
  | Zambyte wrote:
  | I would rather they be open about their failures than deceptive
  | about it. Of course simply not failing would be ideal, but we
  | don't live in a perfect world. If a single, external point of
  | failure causes your system to crumble, that's a design problem,
  | not a dependency problem.
 
    | reaperman wrote:
    | To your point, Cloudflare leadership are pretty active on HN.
    | They generally do a pretty good job of providing detailed
    | explanations to good-faith questions here and providing
    | decent post-mortems of major incidents to the HN community.
    | 
    | They do take care to avoid engaging with people who are
    | opposed to their dominance on ideological levels ("no one
    | should be the gatekeeper for that much of the internet", etc)
    | and there are a small handful of questions they seem to avoid
    | (e.g. direct feature-to-feature comparisons between Warp and
    | Mullvad)
 
    | throwaway67743 wrote:
    | They use transparency as a cover for rookie mistakes it's not
    | the same as actual transparency. Especially as these are
    | really bad examples of doing it wrong.
 
  | aftbit wrote:
  | They're practicing "just culture" (as in justice), which
  | rewards explaining and root causing your failures, and rejects
  | the concept that "someone sucks" in favor of "systems can
  | always be improved".
 
| LeoPanthera wrote:
| Did 1.0.0.1 also go down? The article doesn't say.
 
  | homero wrote:
  | Of course it did it's the same service
 
    | toast0 wrote:
    | A highly reliable service might run one partition on a
    | completely separate serving stack. It's worth asking.
 
| morugam wrote:
| We noticed this through our own, homegrown scripts that check for
| this, having been screwed by an outage a few years ago. I'm happy
| they so quickly acknowledge and explain these issues. Good work!
 
| suprjami wrote:
| Strangely I noticed this because some parts of eBay stopped
| loading. I spent a while troubleshooting my privacy/adblock
| nonsense because _surely CloudFlare couldn 't be down_ but that's
| the only conclusion I could come to.
 
| tedunangst wrote:
| > Visit 1.1.1.1 from any device to get started with our free app
| that makes your Internet faster and safer.
| 
| Ironic.
 
| denysvitali wrote:
| My only concern:
| 
| 7:57 UTC: first reports coming in
| 
| I noticed this issue quite quickly ("reported" at 7:54 UTC [1]),
| and I noticed I wasn't alone thanks to Twitter / X. I tried to
| get in touch with Cloudflare to report this issue - but I haven't
| found any meaningful contact other than Twitter.
| 
| For such an important service, I'm impressed there is no contact
| email / form where you can get in touch with the engineers
| responsible for keeping the service up and running.
| 
| Other than that, kudos for the well written blog post - as
| always!
| 
| [1]: https://nitter.net/DenysVitali/status/1709476961523835246
 
| araes wrote:
| I like how its a 42 joke.
| 
| 4(0b10) 7:00 ends at 11:02 (4 hr 2 min) on a 4 sum 2x2. And refs
| to 1.1.1.1 vs 1.0.0.1
 
| robhlt wrote:
| The lack of additional alerts in the Remediation section is a
| little bit concerning. Adding an alert for serving stale root
| zone data is great, but I think a few more would be very useful
| too:
| 
| - There's a clear uptick in SERVFAIL responses at 7:00 UTC but
| they don't start their response until an hour later after
| receiving external reports. This uptick should have automatically
| triggered an alert. It can't have been within the normal range
| because they got customer reports about it.
| 
| - The resolver failed to load the root zone data on startup and
| resorted a fallback path. Even if this isn't an error for the
| resolver it should still be an alert for the static_zone service,
| because its only client is failing to consume its data.
| 
| - The static_zone service should also alert when some percentage
| of instances fail to parse the root zone data, to get ahead of
| potential problems before the existing data becomes stale.
 
| ChrisArchitect wrote:
| Earlier discussion while outage was active:
| https://news.ycombinator.com/item?id=37763143
 
___________________________________________________________________
(page generated 2023-10-04 23:00 UTC)