|
| daper wrote:
| From the described mistakes two come from lack of understanding
| how exactly DNS works. But I agree it's in fact hard, see [1]).
|
| 1. "This strict DNS spec enforcement will reject a CNAME record
| at the apex of a zone (as per RFC-2181), including the APEX of a
| sub-delegated subdomain. This was the reason that customers using
| VPN providers were disproportionately" - This is non intuitive
| and maay people are surprised by that. You cannot create any
| subdomain (even www.domain.tld) if you created "domain.tld CNAME
| something...". Looks like not every server/resolver enforces that
| restriction.
|
| 2. "based on expert advice, our understanding at the time was
| that DS records at the .com zone were never cached, so pulling it
| from the registrar would cause resolvers to immediately stop
| performing DNSSEC validation." - like any other record, they can
| be cached. DNS has also negative caching (caching of "not found
| responses". Moreover there are resolvers that allow configuring
| minimum TTL that can be higher that what your NS servers returns
| (like unbound - "cache-min-ttl" option) or can be configured to
| serve stale responses in case of resolution failures after the
| cached data expires [2]. That means returning TTL of "1s" will
| not work as you expect.
|
| [1] https://blog.powerdns.com/2020/11/27/goodbye-dns-goodbye-
| pow... [2] https://www.isc.org/blogs/2020-serve-stale/
| btown wrote:
| My (basic and conservative) mental model that "in DNS,
| _everything including the lack of presence of a thing_ can be
| cached " is why I'm very cautious before rolling out anything
| from DKIM to DNSSEC. A deep understanding of specifications is
| vital. I'm somewhat surprised an organization of Slack's scale
| didn't have a consultant on the level of "I designed DNSSEC" on
| hand for this.
| belorn wrote:
| DNS is a bit like network engineering, in that simpler errors
| has the tendency to have large impacts that prevent trial and
| error. Before working as a sysadmin I thought that doing
| experimental lab setups was something only researchers and
| student did, but when you have an old system up and running,
| it can be quite difficult to get in there and make changes
| unless you are very sure about what you are doing.
|
| Like networking there can also be existing protocol errors
| and plain broken things that has for one reason or an other
| been seemingly working for decades without causing a problem.
| Internet flag day is one of those things that pokes at those
| problems, and maybe one day we will see a test for CNAME at
| the apex.
| tptacek wrote:
| It's worth noting that this by itself is a reason not to do
| ambitious security things (and a global PKI is nothing if
| not ambitious) at the layer of DNS. It's an extension of
| the end-to-end argument, or at least of the the logic used
| in the Saltzer and Reed paper: because it's difficult and
| error-prone to deploy policy code in the core of the
| network (here: the "conceptual" core of the protocol
| stack), we should work to get that policy further up the
| stack and closer to the applications that actually care
| about that policy.
|
| The Saltzer and Reed paper, if I'm remembering right, even
| calls out security as specifically one of those things you
| don't want to be doing in the middle of the network.
|
| See also: Zero Trust / BeyondCorp.
| dogecoinbase wrote:
| In addition to the other note that DNSSEC is _not_ required for
| FedRAMP certification (it's even discouraged by cloud.gov!
| https://cloud.gov/docs/compliance/domain-standards/ ), this is
| some weirdly intellectually dishonest phrasing (linking to
| tptacek's article Against DNSSEC:
| https://sockpuppet.org/blog/2015/01/15/against-dnssec/ ):
|
| > While we are aware of the debate around the utility of DNSSEC
| among the DNS community, we are still committed to securing Slack
| for our customers.
|
| The argument is specifically that it doesn't provide that
| security. At least it's neat to see actual begging the question
| in the wild, I guess.
| mpyne wrote:
| FedRAMP is designed to provide reusable cybersecurity work
| against the NIST security controls that your Federal agency's
| Authorizing Official deems your Federal IT system must
| implement.
|
| Those security controls come from a document NIST SP 800-53, 2
| of which (that Slack linked to in the linked post-mortem),
| SC-20 and SC-21, effectively seem to me to conspire to require
| DNSSEC. Both of these are included as part of the "Low"
| baseline of security controls, so they are effectively required
| for all Federal IT systems unless your Agency Authorizing
| Official wants to walk on the wild side.
|
| So even if you get a FedRAMP certification, if you do it
| without fully implementing SC-20 and SC-21, that just means
| your customer needs to either convince their Agency Authorizing
| Official to sign off on an ATO despite the missing SC-20 and
| SC-21 security control, convince them to sign off on some sort
| of Plan of Action and Milestones where Slack will commit to fix
| this in the future (which is just kicking the can down the
| road), or somehow manage to implement the same effect
| completely within the customer end without help from Slack. All
| you would have done is to spend a lot of money on FedRAMP
| paperwork without making it appreciably easier for potential
| customers who have to deal with compliance regimes to buy your
| product.
|
| Cloud.gov's argument is valid but all they posted is that they
| don't implement SC-20 or SC-21 for their government customers,
| and that the OMB M-08-23 mandate for DNSSEC is no longer
| operative (not that no other DNSSEC mandate applies). Indeed
| they even give explanation for how their customers should work
| to enable it (presumably by refusing to use the non-DNSSEC
| compliant .app.cloud.gov services and instead using only their
| DNSSEC-compliant custom domains).
|
| FWIW I fully agree with tptacek's arguments against DNSSEC, and
| will note that I recently stopped being able to navigate to
| literally the entire .mil on my Linux host until I disabled
| DNSSEC in systemd, for reasons that are still unclear to me
| even now.
| tylermenezes wrote:
| > intellectually dishonest phrasing
|
| Not everyone agrees with the linked argument. For example, I
| disagree that browsers can't take advantage of DNSSEC, since
| many are using DoH, and the rest of the article reads like
| someone complaining that we need to wait for the perfect
| protocol or nothing at all.
|
| That's the thing about a debate... it's got arguments on both
| sides.
| dogecoinbase wrote:
| It's fine to disagree with the linked argument, but you
| actually have to do so. This is them presupposing that
| "securing Slack for [their] customers" requires DNSSEC --
| it's not engaging with the argument at all.
| tptacek wrote:
| I mean, I agree with you and don't find the language
| disingenuous (I felt like it was more of a tell that the
| people working on this cursed project weren't super read into
| DNSSEC and DNS security in general, which isn't a knock; it's
| a boring thing to keep up with, especially when the best-
| practice answer is so simple --- just don't bother with
| DNSSEC).
|
| But I'd also say that DoH (1) largely obviates any need for
| DNSSEC (the last-mile DNS problem is the only on-the-wire DNS
| security problem that needs solving) and (2) doesn't enable
| DANE in browsers, which is what people are talking about when
| they talk about DNSSEC intersecting with browsers in any way
| other than randomly making sites fall off the Internet.
| Joe8Bit wrote:
| I know we've all collectively accepted that DNSSEC is a terrible,
| complicated blight on the world but I still find it incredible
| that that an organisation with Slacks resources and access to
| expertise can't make it work.
| toomuchtodo wrote:
| No tech company is infallible. All of them have outages, some
| lasting hours, even days.
|
| Complex systems can and will fail. Try to do better, of course,
| but let's acknowledge that perfection will always exceed our
| grasp. The world will continue to turn regardless.
|
| One day it might just be your turn to break production.
| tptacek wrote:
| The subtext here isn't that Slack is bad at this (they are
| not), but that DNSSEC is somehow intrinsically unsafe (it
| probably is).
| toomuchtodo wrote:
| I agree with your points about DNSSEC (disclaimer: I have
| not had the pleasure of having to implement it myself in
| infra), but was attempting to communicate that DNSSEC isn't
| the only area of ops that folks get exposed to these sorts
| of unknowns or edge cases, and that no amount of resourcing
| enables you to avoid these issues. For Slack, it was
| DNSSEC. For Roblox, Consul. Facebook/Insta, software
| defined BGP. Akamai, DNS.
|
| Perhaps I did not read the room appropriately. Mea culpa.
| tptacek wrote:
| You say Slack, and I agree, that's telling, but you have to add
| to that _AWS itself_ , which had a DNSSEC bug in its wildcard
| record support as well. Slack and AWS together couldn't make
| this feature work. Further: the open source tooling Slack (like
| most places) relies on for deployment is also DNSSEC-hostile:
| one of their problems is that Terraform's Route53 provider
| doesn't safely disable DNSSEC once enabled. It's a mess
| everywhere you look.
|
| I think another interesting question here is why Slack bothered
| in the first place. As was pointed out on the other DNSSEC
| thread today: practically nobody in the technology industry
| uses DNSSEC in the first place. Presumably, Slack did DNSSEC
| (they don't anymore!) in service of FedRAMP compliance. Why?
| Slack has one of the most popular products in all of computing.
| What bad thing was going to happen if they said "nah, we're
| going to go with Cloud.gov's recommendation and not this
| FedRAMP document"?
| x3n0ph3n3 wrote:
| Because FedRAMP compliance is required for many US federal
| (and now some state) customers, which Slack can charge a
| premium.
| vimda wrote:
| Gotta be Fedramp compliant to do business with the US
| government. Even worse, you have to be Fedramp compliant to
| work with anyone who works with the US government. From a
| business (if not an engineering) standpoint, there's plenty
| to gain in going through the motions
| tptacek wrote:
| As was pointed out downthread, there are tech companies
| that are "more" FedRAMP compliant (FedRAMP "High") without
| DNSSEC support.
|
| (Kenn White points out on Twitter that some of this may be
| due to grandfathering --- though, the FedRAMP DNSSEC
| requirement is pretty old.)
| mpyne wrote:
| > Presumably, Slack did DNSSEC (they don't anymore!) in
| service of FedRAMP compliance. Why? Slack has one of the most
| popular products in all of computing. What bad thing was
| going to happen if they said "nah, we're going to go with
| Cloud.gov's recommendation and not this FedRAMP document"?
|
| As just one example, it's tremendously difficult, if not
| impossible, to sell your cloud-based SaaS to Navy customers
| if you have open FedRAMP compliance issues that you aren't at
| least working to address.
|
| I say "compliance" instead of "security" for a reason as
| well, as "compliance" truly runs the show in Navy
| cybersecurity. And if you want to sell to that market (and
| it's hardly just Navy who runs this way), it's easier to
| check the checkboxes than it is to argue about whether NIST
| is right or cloud.gov is right.
| technion wrote:
| I know HN has collectively accepted but every time I'm
| associated with an organisation that pays for a penetration
| test it comes in as a high risk finding, so much so that I've
| given in to deploying it to avoid sitting with non-technical
| managers doing the "here's why I disagree" all over again.
| Outside of this group I definitely feel like I'm on my own in
| that view.
| belorn wrote:
| _" It turned out that some resolvers become more strict when
| DNSSEC signing is enabled at the authoritative name servers,
| even while signing was not enabled at the root name servers
| (i.e. before DS records were published to COM nameservers).
| This strict DNS spec enforcement will reject a CNAME record at
| the apex of a zone (as per RFC-2181), including the APEX of a
| sub-delegated subdomain"_
|
| Slack's second attempt wasn't a DNSSEC problem. Slack depended
| on a permissive fallback of revolvers when encountering a plain
| DNS protocol error. It is similar to how some websites in the
| past relied on permissive browsers implementation when facing
| broken HTML/JS/CSS. Slack fixed their broken DNS as a result of
| this.
|
| Slack's third attempt was not the fault of Slack but rather a
| software bug at Amazon. I would make the argument that Amazon's
| primary product isn't DNS services, but they did fixed their
| bug after this.
|
| The general conclusion I get from the article is not that
| DNSSEC is broken, nor that is too complicated. It is that when
| doing changes with your core infrastructure to make it more
| secure, bugs that may have been laying dormant might pop up and
| bite. I am sure some people has had that experience in domains
| outside of DNS.
| ignoramous wrote:
| You are not wrong, but by simply avoiding DNSSEC, Slack would
| have not had the outage they did. Not to mention the drain on
| eng resources, which perhaps may be even more expensive.
|
| What one can't ignore is the underlying chicken-and-egg
| problem that DNSSEC must overcome: Not many DNSSEC
| deployments and hence not much of it has been tested in the
| real-world, which results in bugs despite the attention of
| some of the most qualified engs, including the ones running
| one of the largest nameserver deployments in the world.
|
| https://apenwarr.ca/log/20201227
| tptacek wrote:
| Additional discussion, indirectly and spurred from this, is here:
|
| https://news.ycombinator.com/item?id=29381778
|
| That thread, which is big, is probably the right place to take
| general discussion of DNSSEC itself, though I'll snipe DNSSEC
| here too. :)
| [deleted]
| teddyh wrote:
| From what I can tell, the problem was not caused by DNSSEC
| directly. It was caused by:
|
| 1. A bug in Route 53 which caused wildcard record not to work
| with DNSSEC signing. Anyone not using Route 53 would not have had
| any problems with DNSSEC.
|
| 2. Slack decided to revert the DNSSEC rollout, but botched the
| process _badly_ , effectively locking themselves in the trunk and
| throwing away the key. If they hadn't tried to revert the DNSSEC
| rollout, or if they had been a bit more deliberate and careful
| while doing it, this would not have happened.
| jeffbee wrote:
| Seems like an organizational failure, as they got conned by their
| 3PAO into believing that DNSSEC was a requirement for FedRAMP
| moderate when it's not. The disproof of this belief is that
| Google has FedRAMP High (for Google Cloud and Workspace) but does
| not use DNSSEC for google.com.
| goalieca wrote:
| If you use https everywhere, you will have a server certificate
| with the hostname embedded in it. This is how TLS knows you're
| talking to the right server.
| mpyne wrote:
| The ultimate arbiter of whether a cloud service gets used isn't
| FedRAMP, it's the Agency Authorizing Official. FedRAMP just
| makes much of the work reusable. With GCP, you can build
| something that obeys and uses DNSSEC without needing google.com
| to participate in DNSSEC.
|
| Google Workspace is a good point though. I know there are many
| users of it in government... maybe some AOs are fine signing
| off on it even without the needed security controls, which is
| an option they have in their discretion with and without
| FedRAMP.
| dsXLII wrote:
| It's always DNS.
| eropple wrote:
| This is a dirty lie.
|
| Sometimes it's BGP.
| vimda wrote:
| And sometimes (as in the Facebook outage), it's both!
___________________________________________________________________
(page generated 2021-11-29 23:00 UTC) |