proxy70

	[HN Gopher] The case of the recursive resolvers: What happened d... ___________________________________________________________________ The case of the recursive resolvers: What happened during Slack's DNSSEC rollout Author : usrme Score : 49 points Date : 2021-11-29 11:36 UTC (11 hours ago)
	web link (slack.engineering)
	w3m dump (slack.engineering)
	\| daper wrote: \| From the described mistakes two come from lack of understanding \| how exactly DNS works. But I agree it's in fact hard, see [1]). \| \| 1. "This strict DNS spec enforcement will reject a CNAME record \| at the apex of a zone (as per RFC-2181), including the APEX of a \| sub-delegated subdomain. This was the reason that customers using \| VPN providers were disproportionately" - This is non intuitive \| and maay people are surprised by that. You cannot create any \| subdomain (even www.domain.tld) if you created "domain.tld CNAME \| something...". Looks like not every server/resolver enforces that \| restriction. \| \| 2. "based on expert advice, our understanding at the time was \| that DS records at the .com zone were never cached, so pulling it \| from the registrar would cause resolvers to immediately stop \| performing DNSSEC validation." - like any other record, they can \| be cached. DNS has also negative caching (caching of "not found \| responses". Moreover there are resolvers that allow configuring \| minimum TTL that can be higher that what your NS servers returns \| (like unbound - "cache-min-ttl" option) or can be configured to \| serve stale responses in case of resolution failures after the \| cached data expires [2]. That means returning TTL of "1s" will \| not work as you expect. \| \| [1] https://blog.powerdns.com/2020/11/27/goodbye-dns-goodbye- \| pow... [2] https://www.isc.org/blogs/2020-serve-stale/ \| btown wrote: \| My (basic and conservative) mental model that "in DNS, \| _everything including the lack of presence of a thing_ can be \| cached " is why I'm very cautious before rolling out anything \| from DKIM to DNSSEC. A deep understanding of specifications is \| vital. I'm somewhat surprised an organization of Slack's scale \| didn't have a consultant on the level of "I designed DNSSEC" on \| hand for this. \| belorn wrote: \| DNS is a bit like network engineering, in that simpler errors \| has the tendency to have large impacts that prevent trial and \| error. Before working as a sysadmin I thought that doing \| experimental lab setups was something only researchers and \| student did, but when you have an old system up and running, \| it can be quite difficult to get in there and make changes \| unless you are very sure about what you are doing. \| \| Like networking there can also be existing protocol errors \| and plain broken things that has for one reason or an other \| been seemingly working for decades without causing a problem. \| Internet flag day is one of those things that pokes at those \| problems, and maybe one day we will see a test for CNAME at \| the apex. \| tptacek wrote: \| It's worth noting that this by itself is a reason not to do \| ambitious security things (and a global PKI is nothing if \| not ambitious) at the layer of DNS. It's an extension of \| the end-to-end argument, or at least of the the logic used \| in the Saltzer and Reed paper: because it's difficult and \| error-prone to deploy policy code in the core of the \| network (here: the "conceptual" core of the protocol \| stack), we should work to get that policy further up the \| stack and closer to the applications that actually care \| about that policy. \| \| The Saltzer and Reed paper, if I'm remembering right, even \| calls out security as specifically one of those things you \| don't want to be doing in the middle of the network. \| \| See also: Zero Trust / BeyondCorp. \| dogecoinbase wrote: \| In addition to the other note that DNSSEC is _not_ required for \| FedRAMP certification (it's even discouraged by cloud.gov! \| https://cloud.gov/docs/compliance/domain-standards/ ), this is \| some weirdly intellectually dishonest phrasing (linking to \| tptacek's article Against DNSSEC: \| https://sockpuppet.org/blog/2015/01/15/against-dnssec/ ): \| \| > While we are aware of the debate around the utility of DNSSEC \| among the DNS community, we are still committed to securing Slack \| for our customers. \| \| The argument is specifically that it doesn't provide that \| security. At least it's neat to see actual begging the question \| in the wild, I guess. \| mpyne wrote: \| FedRAMP is designed to provide reusable cybersecurity work \| against the NIST security controls that your Federal agency's \| Authorizing Official deems your Federal IT system must \| implement. \| \| Those security controls come from a document NIST SP 800-53, 2 \| of which (that Slack linked to in the linked post-mortem), \| SC-20 and SC-21, effectively seem to me to conspire to require \| DNSSEC. Both of these are included as part of the "Low" \| baseline of security controls, so they are effectively required \| for all Federal IT systems unless your Agency Authorizing \| Official wants to walk on the wild side. \| \| So even if you get a FedRAMP certification, if you do it \| without fully implementing SC-20 and SC-21, that just means \| your customer needs to either convince their Agency Authorizing \| Official to sign off on an ATO despite the missing SC-20 and \| SC-21 security control, convince them to sign off on some sort \| of Plan of Action and Milestones where Slack will commit to fix \| this in the future (which is just kicking the can down the \| road), or somehow manage to implement the same effect \| completely within the customer end without help from Slack. All \| you would have done is to spend a lot of money on FedRAMP \| paperwork without making it appreciably easier for potential \| customers who have to deal with compliance regimes to buy your \| product. \| \| Cloud.gov's argument is valid but all they posted is that they \| don't implement SC-20 or SC-21 for their government customers, \| and that the OMB M-08-23 mandate for DNSSEC is no longer \| operative (not that no other DNSSEC mandate applies). Indeed \| they even give explanation for how their customers should work \| to enable it (presumably by refusing to use the non-DNSSEC \| compliant .app.cloud.gov services and instead using only their \| DNSSEC-compliant custom domains). \| \| FWIW I fully agree with tptacek's arguments against DNSSEC, and \| will note that I recently stopped being able to navigate to \| literally the entire .mil on my Linux host until I disabled \| DNSSEC in systemd, for reasons that are still unclear to me \| even now. \| tylermenezes wrote: \| > intellectually dishonest phrasing \| \| Not everyone agrees with the linked argument. For example, I \| disagree that browsers can't take advantage of DNSSEC, since \| many are using DoH, and the rest of the article reads like \| someone complaining that we need to wait for the perfect \| protocol or nothing at all. \| \| That's the thing about a debate... it's got arguments on both \| sides. \| dogecoinbase wrote: \| It's fine to disagree with the linked argument, but you \| actually have to do so. This is them presupposing that \| "securing Slack for [their] customers" requires DNSSEC -- \| it's not engaging with the argument at all. \| tptacek wrote: \| I mean, I agree with you and don't find the language \| disingenuous (I felt like it was more of a tell that the \| people working on this cursed project weren't super read into \| DNSSEC and DNS security in general, which isn't a knock; it's \| a boring thing to keep up with, especially when the best- \| practice answer is so simple --- just don't bother with \| DNSSEC). \| \| But I'd also say that DoH (1) largely obviates any need for \| DNSSEC (the last-mile DNS problem is the only on-the-wire DNS \| security problem that needs solving) and (2) doesn't enable \| DANE in browsers, which is what people are talking about when \| they talk about DNSSEC intersecting with browsers in any way \| other than randomly making sites fall off the Internet. \| Joe8Bit wrote: \| I know we've all collectively accepted that DNSSEC is a terrible, \| complicated blight on the world but I still find it incredible \| that that an organisation with Slacks resources and access to \| expertise can't make it work. \| toomuchtodo wrote: \| No tech company is infallible. All of them have outages, some \| lasting hours, even days. \| \| Complex systems can and will fail. Try to do better, of course, \| but let's acknowledge that perfection will always exceed our \| grasp. The world will continue to turn regardless. \| \| One day it might just be your turn to break production. \| tptacek wrote: \| The subtext here isn't that Slack is bad at this (they are \| not), but that DNSSEC is somehow intrinsically unsafe (it \| probably is). \| toomuchtodo wrote: \| I agree with your points about DNSSEC (disclaimer: I have \| not had the pleasure of having to implement it myself in \| infra), but was attempting to communicate that DNSSEC isn't \| the only area of ops that folks get exposed to these sorts \| of unknowns or edge cases, and that no amount of resourcing \| enables you to avoid these issues. For Slack, it was \| DNSSEC. For Roblox, Consul. Facebook/Insta, software \| defined BGP. Akamai, DNS. \| \| Perhaps I did not read the room appropriately. Mea culpa. \| tptacek wrote: \| You say Slack, and I agree, that's telling, but you have to add \| to that _AWS itself_ , which had a DNSSEC bug in its wildcard \| record support as well. Slack and AWS together couldn't make \| this feature work. Further: the open source tooling Slack (like \| most places) relies on for deployment is also DNSSEC-hostile: \| one of their problems is that Terraform's Route53 provider \| doesn't safely disable DNSSEC once enabled. It's a mess \| everywhere you look. \| \| I think another interesting question here is why Slack bothered \| in the first place. As was pointed out on the other DNSSEC \| thread today: practically nobody in the technology industry \| uses DNSSEC in the first place. Presumably, Slack did DNSSEC \| (they don't anymore!) in service of FedRAMP compliance. Why? \| Slack has one of the most popular products in all of computing. \| What bad thing was going to happen if they said "nah, we're \| going to go with Cloud.gov's recommendation and not this \| FedRAMP document"? \| x3n0ph3n3 wrote: \| Because FedRAMP compliance is required for many US federal \| (and now some state) customers, which Slack can charge a \| premium. \| vimda wrote: \| Gotta be Fedramp compliant to do business with the US \| government. Even worse, you have to be Fedramp compliant to \| work with anyone who works with the US government. From a \| business (if not an engineering) standpoint, there's plenty \| to gain in going through the motions \| tptacek wrote: \| As was pointed out downthread, there are tech companies \| that are "more" FedRAMP compliant (FedRAMP "High") without \| DNSSEC support. \| \| (Kenn White points out on Twitter that some of this may be \| due to grandfathering --- though, the FedRAMP DNSSEC \| requirement is pretty old.) \| mpyne wrote: \| > Presumably, Slack did DNSSEC (they don't anymore!) in \| service of FedRAMP compliance. Why? Slack has one of the most \| popular products in all of computing. What bad thing was \| going to happen if they said "nah, we're going to go with \| Cloud.gov's recommendation and not this FedRAMP document"? \| \| As just one example, it's tremendously difficult, if not \| impossible, to sell your cloud-based SaaS to Navy customers \| if you have open FedRAMP compliance issues that you aren't at \| least working to address. \| \| I say "compliance" instead of "security" for a reason as \| well, as "compliance" truly runs the show in Navy \| cybersecurity. And if you want to sell to that market (and \| it's hardly just Navy who runs this way), it's easier to \| check the checkboxes than it is to argue about whether NIST \| is right or cloud.gov is right. \| technion wrote: \| I know HN has collectively accepted but every time I'm \| associated with an organisation that pays for a penetration \| test it comes in as a high risk finding, so much so that I've \| given in to deploying it to avoid sitting with non-technical \| managers doing the "here's why I disagree" all over again. \| Outside of this group I definitely feel like I'm on my own in \| that view. \| belorn wrote: \| _" It turned out that some resolvers become more strict when \| DNSSEC signing is enabled at the authoritative name servers, \| even while signing was not enabled at the root name servers \| (i.e. before DS records were published to COM nameservers). \| This strict DNS spec enforcement will reject a CNAME record at \| the apex of a zone (as per RFC-2181), including the APEX of a \| sub-delegated subdomain"_ \| \| Slack's second attempt wasn't a DNSSEC problem. Slack depended \| on a permissive fallback of revolvers when encountering a plain \| DNS protocol error. It is similar to how some websites in the \| past relied on permissive browsers implementation when facing \| broken HTML/JS/CSS. Slack fixed their broken DNS as a result of \| this. \| \| Slack's third attempt was not the fault of Slack but rather a \| software bug at Amazon. I would make the argument that Amazon's \| primary product isn't DNS services, but they did fixed their \| bug after this. \| \| The general conclusion I get from the article is not that \| DNSSEC is broken, nor that is too complicated. It is that when \| doing changes with your core infrastructure to make it more \| secure, bugs that may have been laying dormant might pop up and \| bite. I am sure some people has had that experience in domains \| outside of DNS. \| ignoramous wrote: \| You are not wrong, but by simply avoiding DNSSEC, Slack would \| have not had the outage they did. Not to mention the drain on \| eng resources, which perhaps may be even more expensive. \| \| What one can't ignore is the underlying chicken-and-egg \| problem that DNSSEC must overcome: Not many DNSSEC \| deployments and hence not much of it has been tested in the \| real-world, which results in bugs despite the attention of \| some of the most qualified engs, including the ones running \| one of the largest nameserver deployments in the world. \| \| https://apenwarr.ca/log/20201227 \| tptacek wrote: \| Additional discussion, indirectly and spurred from this, is here: \| \| https://news.ycombinator.com/item?id=29381778 \| \| That thread, which is big, is probably the right place to take \| general discussion of DNSSEC itself, though I'll snipe DNSSEC \| here too. :) \| [deleted] \| teddyh wrote: \| From what I can tell, the problem was not caused by DNSSEC \| directly. It was caused by: \| \| 1. A bug in Route 53 which caused wildcard record not to work \| with DNSSEC signing. Anyone not using Route 53 would not have had \| any problems with DNSSEC. \| \| 2. Slack decided to revert the DNSSEC rollout, but botched the \| process _badly_ , effectively locking themselves in the trunk and \| throwing away the key. If they hadn't tried to revert the DNSSEC \| rollout, or if they had been a bit more deliberate and careful \| while doing it, this would not have happened. \| jeffbee wrote: \| Seems like an organizational failure, as they got conned by their \| 3PAO into believing that DNSSEC was a requirement for FedRAMP \| moderate when it's not. The disproof of this belief is that \| Google has FedRAMP High (for Google Cloud and Workspace) but does \| not use DNSSEC for google.com. \| goalieca wrote: \| If you use https everywhere, you will have a server certificate \| with the hostname embedded in it. This is how TLS knows you're \| talking to the right server. \| mpyne wrote: \| The ultimate arbiter of whether a cloud service gets used isn't \| FedRAMP, it's the Agency Authorizing Official. FedRAMP just \| makes much of the work reusable. With GCP, you can build \| something that obeys and uses DNSSEC without needing google.com \| to participate in DNSSEC. \| \| Google Workspace is a good point though. I know there are many \| users of it in government... maybe some AOs are fine signing \| off on it even without the needed security controls, which is \| an option they have in their discretion with and without \| FedRAMP. \| dsXLII wrote: \| It's always DNS. \| eropple wrote: \| This is a dirty lie. \| \| Sometimes it's BGP. \| vimda wrote: \| And sometimes (as in the Facebook outage), it's both! ___________________________________________________________________ (page generated 2021-11-29 23:00 UTC)