[HN Gopher] Static analysis at GitHub
___________________________________________________________________
 
Static analysis at GitHub
 
Author : trptcolin
Score  : 132 points
Date   : 2022-03-30 13:27 UTC (9 hours ago)
 
web link (cacm.acm.org)
w3m dump (cacm.acm.org)
 
| evmar wrote:
| I contributed to a similar system used within Google (partially
| open source at kythe.io), that took the very different approach
| of integrating with the language-native toolchain for each
| language.
| 
| As this article describes, doing this requires per-language
| integrations and also effectively being able to "run the build"
| for any given code (because e.g. the C++ header search path can
| vary on a per-source-file basis), which is untenable for a
| codebase as large and varied as GitHub's. However, if you can
| make it work, you get the benefit of having the compiler's
| understanding of the semantics of the code, which is especially
| finicky in complex languages like C++ or, say, Rust.
| 
| For example, if you look at this[1] method call it refers to a
| symbol generated by a chain of macros, but the browser is still
| able to point you at the definition of it.
| 
| It's an interesting tradeoff to make: the GitHub approach likely
| doesn't handle corner cases like the above but it makes up for it
| in broad applicability and performance. I recall an IDE developer
| once telling me they made a similar tradeoff in code completion,
| in that it's better DX to pop up completions quickly even if
| they're "only" 99% correct.
| 
| (To be clear, I absolutely think the approach taken in the
| article was the right one for the domain they're working in, I
| was just contrasting it against my experience in a similar
| problem where we took a very different approach.)
| 
| [1]
| https://source.chromium.org/chromium/chromium/src/+/main:v8/...
 
  | dcreager wrote:
  | Note that this article describes our implementation of "search-
  | based" or "ctags-like" Code Navigation, which definitely has
  | the imprecision that you describe. We've also been working over
  | the previous ~year on a framework called Stack Graphs [1,2,3],
  | which lets us tackle "precise" Code Navigation while still
  | having the zero-config and incremental aspects that are
  | described in the paper.
  | 
  | The build-based approach that you describe is also used by the
  | Language Server Protocol (LSP) ecosystem. You've summarized the
  | tradeoffs quite well! I've described a bit more about why we
  | decided against a build-based/LSP approach here [4]. One of the
  | biggest deciding factors is that at our scale, incremental
  | processing is an absolute necessity, not a nice-to-have.
  | 
  | [1] https://github.blog/2021-12-09-introducing-stack-graphs/
  | 
  | [2] https://dcreager.net/talks/2021-strange-loop/
  | 
  | [3] https://news.ycombinator.com/item?id=29500602
  | 
  | [4] https://news.ycombinator.com/item?id=29501824
 
    | evmar wrote:
    | I read about stack graphs before, it sounds interesting!
    | 
    | I think they help, but ultimately I expect you need a
    | compiler solve the absolute madness of the totality of C++.
    | For example I think getting argument-dependent lookup right
    | in the presence of 'auto' requires type information? And
    | there are other categories of things (like header search
    | paths) where I think you are forced to involve the build
    | system too.
 
      | dmoy wrote:
      | Yup, it is probably fair to say that C++ accounts for like
      | 50% of the complexity of Kythe at Google. Or certainly it
      | feels like it.
      | 
      | And it is also worth noting that Kythe goes a bit deeper
      | than what LSP can accomplish. In particular Kythe is built
      | around a sort of two-layer graph, where it separates the
      | physical code/line representation from a more abstract
      | semantic graph. This allows us to accomplish some things
      | that are very difficult to do in LSP.
      | 
      | Finally, Kythe at least internally has a big reliance on a
      | unified build system (Blaze, or Bazel). It becomes rapidly
      | more difficult to do when you have to hook in N different
      | build systems up front, which is why search-based
      | references are so appealing. Build integration is hard.
 
    | nerdponx wrote:
    | Has Tree Sitter been useful to projects like this? Does it
    | have promise to be useful in the future? It seems to be
    | gaining a lot of adoption among Neovim users and plugin
    | developers, but not really anywhere else. I'm curious if
    | that's because of lack of familiarity, or because it's
    | technically deficient somehow.
 
| [deleted]
 
| beliu wrote:
| Sourcegraph CTO here. It's interesting to read about GitHub's
| approach and how it contrasts with the approach we've taken at
| Sourcegraph. One of the key tradeoffs the article highlights is
| GitHub's decision to take the "shallow-but-wide" approach to code
| navigation, which has enabled them to provide some level of code
| navigation for most open-source repositories on GitHub, but at
| the expense of precision/accuracy (i.e., the system can't
| necessarily differentiate between different symbols with the same
| name).
| 
| Sourcegraph decided early on to take the opposite approach,
| favoring precision and accuracy over supporting every public
| codebase. Part of the reason why is that we aren't a code host
| that hosts millions of open-source repositories, so we didn't
| feel the need to support all of those at once. Another big reason
| is we heard from our users and customers that code navigation
| accuracy was critical for exploring their private code and
| enabling them to stay in flow (inaccurate results would break the
| train of thought because you'd have to actively think about how
| to navigate to the referenced symbol). We actually built out a
| language-agnostic search-based code navigation, but increasingly
| user feedback has driven us to adopt a more precise model, based
| at first on our own protocol (https://srclib.org) and also the
| LSIF protocol open-sourced by Microsoft that now enables code
| navigation for many popular editor extensions.
| 
| This is not to say that GitHub's approach is wrong, but more to
| say that it's interesting how different goals and constraints
| have led to systems that are quite different despite tackling the
| same general problem. GitHub aiming to provide some level of
| navigation to every repository on GitHub, and Sourcegraph aiming
| to provide best-in-class navigation for private codebases and
| dependencies.
| 
| (Btw, hats off to the GitHub team for open-sourcing tree-sitter,
| a great library which we've incorporated into parts of our stack.
| We actually hosted the creator of tree-sitter, Max Brunsfeld, on
| our podcast awhile back and it was a really fun and insightful
| conversation if people are interested in hearing some of the
| backstory of tree-sitter:
| https://about.sourcegraph.com/podcast/max-brunsfeld.)
 
| mistrial9 wrote:
| figure 2 is repeated for some reason?
 
  | spatulon wrote:
  | That's odd. The error was not present in the original
  | publication of this article in ACM's Queue magazine:
  | https://queue.acm.org/detail.cfm?id=3487022
 
| marceloabsousa wrote:
| This article more about parsing at scale than static analysis at
| scale.
 
  | dcreager wrote:
  | Parsing is definitely a big part of it, and it's a fair point
  | that for search-based Code Navigation, we don't have to do any
  | real heavy lifting on the analysis side. That said, I think the
  | article describes our non-functional requirements well (zero-
  | config, incremental, language-agnostic). It's those non-
  | functional requirements which are most important for the "at
  | scale" part. I'd go so far as to suggest that any static
  | analysis implementation that can't meet those requirements
  | would be nigh impossible for us to roll out across the entire
  | GitHub corpus.
 
| miohtama wrote:
| This is Big Code
 
___________________________________________________________________
(page generated 2022-03-30 23:01 UTC)