proxy70

	[HN Gopher] Static analysis at GitHub ___________________________________________________________________ Static analysis at GitHub Author : trptcolin Score : 132 points Date : 2022-03-30 13:27 UTC (9 hours ago)
	web link (cacm.acm.org)
	w3m dump (cacm.acm.org)
	\| evmar wrote: \| I contributed to a similar system used within Google (partially \| open source at kythe.io), that took the very different approach \| of integrating with the language-native toolchain for each \| language. \| \| As this article describes, doing this requires per-language \| integrations and also effectively being able to "run the build" \| for any given code (because e.g. the C++ header search path can \| vary on a per-source-file basis), which is untenable for a \| codebase as large and varied as GitHub's. However, if you can \| make it work, you get the benefit of having the compiler's \| understanding of the semantics of the code, which is especially \| finicky in complex languages like C++ or, say, Rust. \| \| For example, if you look at this[1] method call it refers to a \| symbol generated by a chain of macros, but the browser is still \| able to point you at the definition of it. \| \| It's an interesting tradeoff to make: the GitHub approach likely \| doesn't handle corner cases like the above but it makes up for it \| in broad applicability and performance. I recall an IDE developer \| once telling me they made a similar tradeoff in code completion, \| in that it's better DX to pop up completions quickly even if \| they're "only" 99% correct. \| \| (To be clear, I absolutely think the approach taken in the \| article was the right one for the domain they're working in, I \| was just contrasting it against my experience in a similar \| problem where we took a very different approach.) \| \| [1] \| https://source.chromium.org/chromium/chromium/src/+/main:v8/... \| dcreager wrote: \| Note that this article describes our implementation of "search- \| based" or "ctags-like" Code Navigation, which definitely has \| the imprecision that you describe. We've also been working over \| the previous ~year on a framework called Stack Graphs [1,2,3], \| which lets us tackle "precise" Code Navigation while still \| having the zero-config and incremental aspects that are \| described in the paper. \| \| The build-based approach that you describe is also used by the \| Language Server Protocol (LSP) ecosystem. You've summarized the \| tradeoffs quite well! I've described a bit more about why we \| decided against a build-based/LSP approach here [4]. One of the \| biggest deciding factors is that at our scale, incremental \| processing is an absolute necessity, not a nice-to-have. \| \| [1] https://github.blog/2021-12-09-introducing-stack-graphs/ \| \| [2] https://dcreager.net/talks/2021-strange-loop/ \| \| [3] https://news.ycombinator.com/item?id=29500602 \| \| [4] https://news.ycombinator.com/item?id=29501824 \| evmar wrote: \| I read about stack graphs before, it sounds interesting! \| \| I think they help, but ultimately I expect you need a \| compiler solve the absolute madness of the totality of C++. \| For example I think getting argument-dependent lookup right \| in the presence of 'auto' requires type information? And \| there are other categories of things (like header search \| paths) where I think you are forced to involve the build \| system too. \| dmoy wrote: \| Yup, it is probably fair to say that C++ accounts for like \| 50% of the complexity of Kythe at Google. Or certainly it \| feels like it. \| \| And it is also worth noting that Kythe goes a bit deeper \| than what LSP can accomplish. In particular Kythe is built \| around a sort of two-layer graph, where it separates the \| physical code/line representation from a more abstract \| semantic graph. This allows us to accomplish some things \| that are very difficult to do in LSP. \| \| Finally, Kythe at least internally has a big reliance on a \| unified build system (Blaze, or Bazel). It becomes rapidly \| more difficult to do when you have to hook in N different \| build systems up front, which is why search-based \| references are so appealing. Build integration is hard. \| nerdponx wrote: \| Has Tree Sitter been useful to projects like this? Does it \| have promise to be useful in the future? It seems to be \| gaining a lot of adoption among Neovim users and plugin \| developers, but not really anywhere else. I'm curious if \| that's because of lack of familiarity, or because it's \| technically deficient somehow. \| [deleted] \| beliu wrote: \| Sourcegraph CTO here. It's interesting to read about GitHub's \| approach and how it contrasts with the approach we've taken at \| Sourcegraph. One of the key tradeoffs the article highlights is \| GitHub's decision to take the "shallow-but-wide" approach to code \| navigation, which has enabled them to provide some level of code \| navigation for most open-source repositories on GitHub, but at \| the expense of precision/accuracy (i.e., the system can't \| necessarily differentiate between different symbols with the same \| name). \| \| Sourcegraph decided early on to take the opposite approach, \| favoring precision and accuracy over supporting every public \| codebase. Part of the reason why is that we aren't a code host \| that hosts millions of open-source repositories, so we didn't \| feel the need to support all of those at once. Another big reason \| is we heard from our users and customers that code navigation \| accuracy was critical for exploring their private code and \| enabling them to stay in flow (inaccurate results would break the \| train of thought because you'd have to actively think about how \| to navigate to the referenced symbol). We actually built out a \| language-agnostic search-based code navigation, but increasingly \| user feedback has driven us to adopt a more precise model, based \| at first on our own protocol (https://srclib.org) and also the \| LSIF protocol open-sourced by Microsoft that now enables code \| navigation for many popular editor extensions. \| \| This is not to say that GitHub's approach is wrong, but more to \| say that it's interesting how different goals and constraints \| have led to systems that are quite different despite tackling the \| same general problem. GitHub aiming to provide some level of \| navigation to every repository on GitHub, and Sourcegraph aiming \| to provide best-in-class navigation for private codebases and \| dependencies. \| \| (Btw, hats off to the GitHub team for open-sourcing tree-sitter, \| a great library which we've incorporated into parts of our stack. \| We actually hosted the creator of tree-sitter, Max Brunsfeld, on \| our podcast awhile back and it was a really fun and insightful \| conversation if people are interested in hearing some of the \| backstory of tree-sitter: \| https://about.sourcegraph.com/podcast/max-brunsfeld.) \| mistrial9 wrote: \| figure 2 is repeated for some reason? \| spatulon wrote: \| That's odd. The error was not present in the original \| publication of this article in ACM's Queue magazine: \| https://queue.acm.org/detail.cfm?id=3487022 \| marceloabsousa wrote: \| This article more about parsing at scale than static analysis at \| scale. \| dcreager wrote: \| Parsing is definitely a big part of it, and it's a fair point \| that for search-based Code Navigation, we don't have to do any \| real heavy lifting on the analysis side. That said, I think the \| article describes our non-functional requirements well (zero- \| config, incremental, language-agnostic). It's those non- \| functional requirements which are most important for the "at \| scale" part. I'd go so far as to suggest that any static \| analysis implementation that can't meet those requirements \| would be nigh impossible for us to roll out across the entire \| GitHub corpus. \| miohtama wrote: \| This is Big Code ___________________________________________________________________ (page generated 2022-03-30 23:01 UTC)