|
| evmar wrote:
| I contributed to a similar system used within Google (partially
| open source at kythe.io), that took the very different approach
| of integrating with the language-native toolchain for each
| language.
|
| As this article describes, doing this requires per-language
| integrations and also effectively being able to "run the build"
| for any given code (because e.g. the C++ header search path can
| vary on a per-source-file basis), which is untenable for a
| codebase as large and varied as GitHub's. However, if you can
| make it work, you get the benefit of having the compiler's
| understanding of the semantics of the code, which is especially
| finicky in complex languages like C++ or, say, Rust.
|
| For example, if you look at this[1] method call it refers to a
| symbol generated by a chain of macros, but the browser is still
| able to point you at the definition of it.
|
| It's an interesting tradeoff to make: the GitHub approach likely
| doesn't handle corner cases like the above but it makes up for it
| in broad applicability and performance. I recall an IDE developer
| once telling me they made a similar tradeoff in code completion,
| in that it's better DX to pop up completions quickly even if
| they're "only" 99% correct.
|
| (To be clear, I absolutely think the approach taken in the
| article was the right one for the domain they're working in, I
| was just contrasting it against my experience in a similar
| problem where we took a very different approach.)
|
| [1]
| https://source.chromium.org/chromium/chromium/src/+/main:v8/...
| dcreager wrote:
| Note that this article describes our implementation of "search-
| based" or "ctags-like" Code Navigation, which definitely has
| the imprecision that you describe. We've also been working over
| the previous ~year on a framework called Stack Graphs [1,2,3],
| which lets us tackle "precise" Code Navigation while still
| having the zero-config and incremental aspects that are
| described in the paper.
|
| The build-based approach that you describe is also used by the
| Language Server Protocol (LSP) ecosystem. You've summarized the
| tradeoffs quite well! I've described a bit more about why we
| decided against a build-based/LSP approach here [4]. One of the
| biggest deciding factors is that at our scale, incremental
| processing is an absolute necessity, not a nice-to-have.
|
| [1] https://github.blog/2021-12-09-introducing-stack-graphs/
|
| [2] https://dcreager.net/talks/2021-strange-loop/
|
| [3] https://news.ycombinator.com/item?id=29500602
|
| [4] https://news.ycombinator.com/item?id=29501824
| evmar wrote:
| I read about stack graphs before, it sounds interesting!
|
| I think they help, but ultimately I expect you need a
| compiler solve the absolute madness of the totality of C++.
| For example I think getting argument-dependent lookup right
| in the presence of 'auto' requires type information? And
| there are other categories of things (like header search
| paths) where I think you are forced to involve the build
| system too.
| dmoy wrote:
| Yup, it is probably fair to say that C++ accounts for like
| 50% of the complexity of Kythe at Google. Or certainly it
| feels like it.
|
| And it is also worth noting that Kythe goes a bit deeper
| than what LSP can accomplish. In particular Kythe is built
| around a sort of two-layer graph, where it separates the
| physical code/line representation from a more abstract
| semantic graph. This allows us to accomplish some things
| that are very difficult to do in LSP.
|
| Finally, Kythe at least internally has a big reliance on a
| unified build system (Blaze, or Bazel). It becomes rapidly
| more difficult to do when you have to hook in N different
| build systems up front, which is why search-based
| references are so appealing. Build integration is hard.
| nerdponx wrote:
| Has Tree Sitter been useful to projects like this? Does it
| have promise to be useful in the future? It seems to be
| gaining a lot of adoption among Neovim users and plugin
| developers, but not really anywhere else. I'm curious if
| that's because of lack of familiarity, or because it's
| technically deficient somehow.
| [deleted]
| beliu wrote:
| Sourcegraph CTO here. It's interesting to read about GitHub's
| approach and how it contrasts with the approach we've taken at
| Sourcegraph. One of the key tradeoffs the article highlights is
| GitHub's decision to take the "shallow-but-wide" approach to code
| navigation, which has enabled them to provide some level of code
| navigation for most open-source repositories on GitHub, but at
| the expense of precision/accuracy (i.e., the system can't
| necessarily differentiate between different symbols with the same
| name).
|
| Sourcegraph decided early on to take the opposite approach,
| favoring precision and accuracy over supporting every public
| codebase. Part of the reason why is that we aren't a code host
| that hosts millions of open-source repositories, so we didn't
| feel the need to support all of those at once. Another big reason
| is we heard from our users and customers that code navigation
| accuracy was critical for exploring their private code and
| enabling them to stay in flow (inaccurate results would break the
| train of thought because you'd have to actively think about how
| to navigate to the referenced symbol). We actually built out a
| language-agnostic search-based code navigation, but increasingly
| user feedback has driven us to adopt a more precise model, based
| at first on our own protocol (https://srclib.org) and also the
| LSIF protocol open-sourced by Microsoft that now enables code
| navigation for many popular editor extensions.
|
| This is not to say that GitHub's approach is wrong, but more to
| say that it's interesting how different goals and constraints
| have led to systems that are quite different despite tackling the
| same general problem. GitHub aiming to provide some level of
| navigation to every repository on GitHub, and Sourcegraph aiming
| to provide best-in-class navigation for private codebases and
| dependencies.
|
| (Btw, hats off to the GitHub team for open-sourcing tree-sitter,
| a great library which we've incorporated into parts of our stack.
| We actually hosted the creator of tree-sitter, Max Brunsfeld, on
| our podcast awhile back and it was a really fun and insightful
| conversation if people are interested in hearing some of the
| backstory of tree-sitter:
| https://about.sourcegraph.com/podcast/max-brunsfeld.)
| mistrial9 wrote:
| figure 2 is repeated for some reason?
| spatulon wrote:
| That's odd. The error was not present in the original
| publication of this article in ACM's Queue magazine:
| https://queue.acm.org/detail.cfm?id=3487022
| marceloabsousa wrote:
| This article more about parsing at scale than static analysis at
| scale.
| dcreager wrote:
| Parsing is definitely a big part of it, and it's a fair point
| that for search-based Code Navigation, we don't have to do any
| real heavy lifting on the analysis side. That said, I think the
| article describes our non-functional requirements well (zero-
| config, incremental, language-agnostic). It's those non-
| functional requirements which are most important for the "at
| scale" part. I'd go so far as to suggest that any static
| analysis implementation that can't meet those requirements
| would be nigh impossible for us to roll out across the entire
| GitHub corpus.
| miohtama wrote:
| This is Big Code
___________________________________________________________________
(page generated 2022-03-30 23:01 UTC) |