[HN Gopher] The case of the recursive resolvers: What happened d...
___________________________________________________________________
 
The case of the recursive resolvers: What happened during Slack's
DNSSEC rollout
 
Author : usrme
Score  : 49 points
Date   : 2021-11-29 11:36 UTC (11 hours ago)
 
web link (slack.engineering)
w3m dump (slack.engineering)
 
| daper wrote:
| From the described mistakes two come from lack of understanding
| how exactly DNS works. But I agree it's in fact hard, see [1]).
| 
| 1. "This strict DNS spec enforcement will reject a CNAME record
| at the apex of a zone (as per RFC-2181), including the APEX of a
| sub-delegated subdomain. This was the reason that customers using
| VPN providers were disproportionately" - This is non intuitive
| and maay people are surprised by that. You cannot create any
| subdomain (even www.domain.tld) if you created "domain.tld CNAME
| something...". Looks like not every server/resolver enforces that
| restriction.
| 
| 2. "based on expert advice, our understanding at the time was
| that DS records at the .com zone were never cached, so pulling it
| from the registrar would cause resolvers to immediately stop
| performing DNSSEC validation." - like any other record, they can
| be cached. DNS has also negative caching (caching of "not found
| responses". Moreover there are resolvers that allow configuring
| minimum TTL that can be higher that what your NS servers returns
| (like unbound - "cache-min-ttl" option) or can be configured to
| serve stale responses in case of resolution failures after the
| cached data expires [2]. That means returning TTL of "1s" will
| not work as you expect.
| 
| [1] https://blog.powerdns.com/2020/11/27/goodbye-dns-goodbye-
| pow... [2] https://www.isc.org/blogs/2020-serve-stale/
 
  | btown wrote:
  | My (basic and conservative) mental model that "in DNS,
  | _everything including the lack of presence of a thing_ can be
  | cached " is why I'm very cautious before rolling out anything
  | from DKIM to DNSSEC. A deep understanding of specifications is
  | vital. I'm somewhat surprised an organization of Slack's scale
  | didn't have a consultant on the level of "I designed DNSSEC" on
  | hand for this.
 
    | belorn wrote:
    | DNS is a bit like network engineering, in that simpler errors
    | has the tendency to have large impacts that prevent trial and
    | error. Before working as a sysadmin I thought that doing
    | experimental lab setups was something only researchers and
    | student did, but when you have an old system up and running,
    | it can be quite difficult to get in there and make changes
    | unless you are very sure about what you are doing.
    | 
    | Like networking there can also be existing protocol errors
    | and plain broken things that has for one reason or an other
    | been seemingly working for decades without causing a problem.
    | Internet flag day is one of those things that pokes at those
    | problems, and maybe one day we will see a test for CNAME at
    | the apex.
 
      | tptacek wrote:
      | It's worth noting that this by itself is a reason not to do
      | ambitious security things (and a global PKI is nothing if
      | not ambitious) at the layer of DNS. It's an extension of
      | the end-to-end argument, or at least of the the logic used
      | in the Saltzer and Reed paper: because it's difficult and
      | error-prone to deploy policy code in the core of the
      | network (here: the "conceptual" core of the protocol
      | stack), we should work to get that policy further up the
      | stack and closer to the applications that actually care
      | about that policy.
      | 
      | The Saltzer and Reed paper, if I'm remembering right, even
      | calls out security as specifically one of those things you
      | don't want to be doing in the middle of the network.
      | 
      | See also: Zero Trust / BeyondCorp.
 
| dogecoinbase wrote:
| In addition to the other note that DNSSEC is _not_ required for
| FedRAMP certification (it's even discouraged by cloud.gov!
| https://cloud.gov/docs/compliance/domain-standards/ ), this is
| some weirdly intellectually dishonest phrasing (linking to
| tptacek's article Against DNSSEC:
| https://sockpuppet.org/blog/2015/01/15/against-dnssec/ ):
| 
| > While we are aware of the debate around the utility of DNSSEC
| among the DNS community, we are still committed to securing Slack
| for our customers.
| 
| The argument is specifically that it doesn't provide that
| security. At least it's neat to see actual begging the question
| in the wild, I guess.
 
  | mpyne wrote:
  | FedRAMP is designed to provide reusable cybersecurity work
  | against the NIST security controls that your Federal agency's
  | Authorizing Official deems your Federal IT system must
  | implement.
  | 
  | Those security controls come from a document NIST SP 800-53, 2
  | of which (that Slack linked to in the linked post-mortem),
  | SC-20 and SC-21, effectively seem to me to conspire to require
  | DNSSEC. Both of these are included as part of the "Low"
  | baseline of security controls, so they are effectively required
  | for all Federal IT systems unless your Agency Authorizing
  | Official wants to walk on the wild side.
  | 
  | So even if you get a FedRAMP certification, if you do it
  | without fully implementing SC-20 and SC-21, that just means
  | your customer needs to either convince their Agency Authorizing
  | Official to sign off on an ATO despite the missing SC-20 and
  | SC-21 security control, convince them to sign off on some sort
  | of Plan of Action and Milestones where Slack will commit to fix
  | this in the future (which is just kicking the can down the
  | road), or somehow manage to implement the same effect
  | completely within the customer end without help from Slack. All
  | you would have done is to spend a lot of money on FedRAMP
  | paperwork without making it appreciably easier for potential
  | customers who have to deal with compliance regimes to buy your
  | product.
  | 
  | Cloud.gov's argument is valid but all they posted is that they
  | don't implement SC-20 or SC-21 for their government customers,
  | and that the OMB M-08-23 mandate for DNSSEC is no longer
  | operative (not that no other DNSSEC mandate applies). Indeed
  | they even give explanation for how their customers should work
  | to enable it (presumably by refusing to use the non-DNSSEC
  | compliant .app.cloud.gov services and instead using only their
  | DNSSEC-compliant custom domains).
  | 
  | FWIW I fully agree with tptacek's arguments against DNSSEC, and
  | will note that I recently stopped being able to navigate to
  | literally the entire .mil on my Linux host until I disabled
  | DNSSEC in systemd, for reasons that are still unclear to me
  | even now.
 
  | tylermenezes wrote:
  | > intellectually dishonest phrasing
  | 
  | Not everyone agrees with the linked argument. For example, I
  | disagree that browsers can't take advantage of DNSSEC, since
  | many are using DoH, and the rest of the article reads like
  | someone complaining that we need to wait for the perfect
  | protocol or nothing at all.
  | 
  | That's the thing about a debate... it's got arguments on both
  | sides.
 
    | dogecoinbase wrote:
    | It's fine to disagree with the linked argument, but you
    | actually have to do so. This is them presupposing that
    | "securing Slack for [their] customers" requires DNSSEC --
    | it's not engaging with the argument at all.
 
    | tptacek wrote:
    | I mean, I agree with you and don't find the language
    | disingenuous (I felt like it was more of a tell that the
    | people working on this cursed project weren't super read into
    | DNSSEC and DNS security in general, which isn't a knock; it's
    | a boring thing to keep up with, especially when the best-
    | practice answer is so simple --- just don't bother with
    | DNSSEC).
    | 
    | But I'd also say that DoH (1) largely obviates any need for
    | DNSSEC (the last-mile DNS problem is the only on-the-wire DNS
    | security problem that needs solving) and (2) doesn't enable
    | DANE in browsers, which is what people are talking about when
    | they talk about DNSSEC intersecting with browsers in any way
    | other than randomly making sites fall off the Internet.
 
| Joe8Bit wrote:
| I know we've all collectively accepted that DNSSEC is a terrible,
| complicated blight on the world but I still find it incredible
| that that an organisation with Slacks resources and access to
| expertise can't make it work.
 
  | toomuchtodo wrote:
  | No tech company is infallible. All of them have outages, some
  | lasting hours, even days.
  | 
  | Complex systems can and will fail. Try to do better, of course,
  | but let's acknowledge that perfection will always exceed our
  | grasp. The world will continue to turn regardless.
  | 
  | One day it might just be your turn to break production.
 
    | tptacek wrote:
    | The subtext here isn't that Slack is bad at this (they are
    | not), but that DNSSEC is somehow intrinsically unsafe (it
    | probably is).
 
      | toomuchtodo wrote:
      | I agree with your points about DNSSEC (disclaimer: I have
      | not had the pleasure of having to implement it myself in
      | infra), but was attempting to communicate that DNSSEC isn't
      | the only area of ops that folks get exposed to these sorts
      | of unknowns or edge cases, and that no amount of resourcing
      | enables you to avoid these issues. For Slack, it was
      | DNSSEC. For Roblox, Consul. Facebook/Insta, software
      | defined BGP. Akamai, DNS.
      | 
      | Perhaps I did not read the room appropriately. Mea culpa.
 
  | tptacek wrote:
  | You say Slack, and I agree, that's telling, but you have to add
  | to that _AWS itself_ , which had a DNSSEC bug in its wildcard
  | record support as well. Slack and AWS together couldn't make
  | this feature work. Further: the open source tooling Slack (like
  | most places) relies on for deployment is also DNSSEC-hostile:
  | one of their problems is that Terraform's Route53 provider
  | doesn't safely disable DNSSEC once enabled. It's a mess
  | everywhere you look.
  | 
  | I think another interesting question here is why Slack bothered
  | in the first place. As was pointed out on the other DNSSEC
  | thread today: practically nobody in the technology industry
  | uses DNSSEC in the first place. Presumably, Slack did DNSSEC
  | (they don't anymore!) in service of FedRAMP compliance. Why?
  | Slack has one of the most popular products in all of computing.
  | What bad thing was going to happen if they said "nah, we're
  | going to go with Cloud.gov's recommendation and not this
  | FedRAMP document"?
 
    | x3n0ph3n3 wrote:
    | Because FedRAMP compliance is required for many US federal
    | (and now some state) customers, which Slack can charge a
    | premium.
 
    | vimda wrote:
    | Gotta be Fedramp compliant to do business with the US
    | government. Even worse, you have to be Fedramp compliant to
    | work with anyone who works with the US government. From a
    | business (if not an engineering) standpoint, there's plenty
    | to gain in going through the motions
 
      | tptacek wrote:
      | As was pointed out downthread, there are tech companies
      | that are "more" FedRAMP compliant (FedRAMP "High") without
      | DNSSEC support.
      | 
      | (Kenn White points out on Twitter that some of this may be
      | due to grandfathering --- though, the FedRAMP DNSSEC
      | requirement is pretty old.)
 
    | mpyne wrote:
    | > Presumably, Slack did DNSSEC (they don't anymore!) in
    | service of FedRAMP compliance. Why? Slack has one of the most
    | popular products in all of computing. What bad thing was
    | going to happen if they said "nah, we're going to go with
    | Cloud.gov's recommendation and not this FedRAMP document"?
    | 
    | As just one example, it's tremendously difficult, if not
    | impossible, to sell your cloud-based SaaS to Navy customers
    | if you have open FedRAMP compliance issues that you aren't at
    | least working to address.
    | 
    | I say "compliance" instead of "security" for a reason as
    | well, as "compliance" truly runs the show in Navy
    | cybersecurity. And if you want to sell to that market (and
    | it's hardly just Navy who runs this way), it's easier to
    | check the checkboxes than it is to argue about whether NIST
    | is right or cloud.gov is right.
 
  | technion wrote:
  | I know HN has collectively accepted but every time I'm
  | associated with an organisation that pays for a penetration
  | test it comes in as a high risk finding, so much so that I've
  | given in to deploying it to avoid sitting with non-technical
  | managers doing the "here's why I disagree" all over again.
  | Outside of this group I definitely feel like I'm on my own in
  | that view.
 
  | belorn wrote:
  | _" It turned out that some resolvers become more strict when
  | DNSSEC signing is enabled at the authoritative name servers,
  | even while signing was not enabled at the root name servers
  | (i.e. before DS records were published to COM nameservers).
  | This strict DNS spec enforcement will reject a CNAME record at
  | the apex of a zone (as per RFC-2181), including the APEX of a
  | sub-delegated subdomain"_
  | 
  | Slack's second attempt wasn't a DNSSEC problem. Slack depended
  | on a permissive fallback of revolvers when encountering a plain
  | DNS protocol error. It is similar to how some websites in the
  | past relied on permissive browsers implementation when facing
  | broken HTML/JS/CSS. Slack fixed their broken DNS as a result of
  | this.
  | 
  | Slack's third attempt was not the fault of Slack but rather a
  | software bug at Amazon. I would make the argument that Amazon's
  | primary product isn't DNS services, but they did fixed their
  | bug after this.
  | 
  | The general conclusion I get from the article is not that
  | DNSSEC is broken, nor that is too complicated. It is that when
  | doing changes with your core infrastructure to make it more
  | secure, bugs that may have been laying dormant might pop up and
  | bite. I am sure some people has had that experience in domains
  | outside of DNS.
 
    | ignoramous wrote:
    | You are not wrong, but by simply avoiding DNSSEC, Slack would
    | have not had the outage they did. Not to mention the drain on
    | eng resources, which perhaps may be even more expensive.
    | 
    | What one can't ignore is the underlying chicken-and-egg
    | problem that DNSSEC must overcome: Not many DNSSEC
    | deployments and hence not much of it has been tested in the
    | real-world, which results in bugs despite the attention of
    | some of the most qualified engs, including the ones running
    | one of the largest nameserver deployments in the world.
    | 
    | https://apenwarr.ca/log/20201227
 
| tptacek wrote:
| Additional discussion, indirectly and spurred from this, is here:
| 
| https://news.ycombinator.com/item?id=29381778
| 
| That thread, which is big, is probably the right place to take
| general discussion of DNSSEC itself, though I'll snipe DNSSEC
| here too. :)
 
| [deleted]
 
| teddyh wrote:
| From what I can tell, the problem was not caused by DNSSEC
| directly. It was caused by:
| 
| 1. A bug in Route 53 which caused wildcard record not to work
| with DNSSEC signing. Anyone not using Route 53 would not have had
| any problems with DNSSEC.
| 
| 2. Slack decided to revert the DNSSEC rollout, but botched the
| process _badly_ , effectively locking themselves in the trunk and
| throwing away the key. If they hadn't tried to revert the DNSSEC
| rollout, or if they had been a bit more deliberate and careful
| while doing it, this would not have happened.
 
| jeffbee wrote:
| Seems like an organizational failure, as they got conned by their
| 3PAO into believing that DNSSEC was a requirement for FedRAMP
| moderate when it's not. The disproof of this belief is that
| Google has FedRAMP High (for Google Cloud and Workspace) but does
| not use DNSSEC for google.com.
 
  | goalieca wrote:
  | If you use https everywhere, you will have a server certificate
  | with the hostname embedded in it. This is how TLS knows you're
  | talking to the right server.
 
  | mpyne wrote:
  | The ultimate arbiter of whether a cloud service gets used isn't
  | FedRAMP, it's the Agency Authorizing Official. FedRAMP just
  | makes much of the work reusable. With GCP, you can build
  | something that obeys and uses DNSSEC without needing google.com
  | to participate in DNSSEC.
  | 
  | Google Workspace is a good point though. I know there are many
  | users of it in government... maybe some AOs are fine signing
  | off on it even without the needed security controls, which is
  | an option they have in their discretion with and without
  | FedRAMP.
 
| dsXLII wrote:
| It's always DNS.
 
  | eropple wrote:
  | This is a dirty lie.
  | 
  | Sometimes it's BGP.
 
    | vimda wrote:
    | And sometimes (as in the Facebook outage), it's both!
 
___________________________________________________________________
(page generated 2021-11-29 23:00 UTC)