[HN Gopher] Scaling monorepo maintenance
___________________________________________________________________
 
Scaling monorepo maintenance
 
Author : pimterry
Score  : 250 points
Date   : 2021-04-30 09:52 UTC (1 days ago)
 
web link (github.blog)
w3m dump (github.blog)
 
| whymauri wrote:
| I love when people use Git in ways I haven't thought about
| before. Reminds me of the first time I played around with
| 'blobs.'
 
| debarshri wrote:
| It is a great writeup. I wonder how gitlab solves this problem.
 
  | lbotos wrote:
  | GL packs refs at various times and frequencies depending on
  | usage:
  | https://docs.gitlab.com/ee/administration/housekeeping.html
  | 
  | It works well for most repos but as you start to get out to the
  | edges of lots of commits it can cause slowness. GL admins can
  | repack reasonably safely at various times to get access
  | speedups, but the solution that is presented in the blog would
  | def speed packing up.
  | 
  | (I work as a Support Engineering Leader at GL but I'm reading
  | HN for fun <3)
 
  | masklinn wrote:
  | They might have yet to encounter it. Git is hosting some really
  | big repos.
 
    | lbotos wrote:
    | Oh, we def have. I've seen some large repos (50+GB) in some
    | GL installations.
 
| the_duke wrote:
| Very well written post and upstream work is always appreciated.
| 
| I also really like monorepos, but Git and GitHub really don't
| work well at all for them.
| 
| On the Git side there is no way to clone only parts of a repo or
| to limit access by user. All the Git tooling out there, from CLI
| to the various IDE integrations, are all very ill adjusted to a
| huge repo with lots of unrelated commits.
| 
| On the Github side there is no separation between the different
| parts of a monorepo in the UI (issues, prs, CI), the workflows,
| or the permission system. Sure, you can hack something together
| with labels and custom bots, but it always feels like a hack.
| 
| Using Git(hub) for monorepos is really painful in my experience.
| 
| There is a reason why Google, Facebook et all have heaps of
| custom tooling.
 
  | krasin wrote:
  | I really like monorepos. But I find that it's almost never a
  | good idea to hide parts of a source code from developers. And
  | if there's some secret sauce that's so sensitive that only
  | single-digit number of developers in the whole company can
  | access it, then it's probably okay to have a separate
  | repository just for it.
  | 
  | Working in environments where different people have partial
  | access to different parts of the code never felt productive to
  | me -- often, the time to figure out who can take on a task and
  | how to grant all the access might take longer than the task
  | itself.
 
  | jayd16 wrote:
  | I wouldn't call it painful exactly but I'll be happy when
  | shallow and sparse cloning become rock solid and boring.
 
  | Beowolve wrote:
  | On this note, GitHub does reach out to customers with Monorepos
  | and is aware of their short comings. I think overtime we will
  | see them change to have better support. It's only a matter of
  | time.
 
  | jeffbee wrote:
  | It's funny that you mention this as if monorepos of course
  | require custom tooling. Google started with off-the-shelf
  | Perforce and that was fine for many years, long after their
  | repo became truly huge. Only when it became _monstrously_ huge
  | did they need custom tools and even then they basically just
  | re-implemented Perforce instead of adopting git concepts. You,
  | too, can just use Perforce. It 's even free for up to five
  | users. You won't outgrow its limits until you get about a
  | million engineer-years under your belt.
  | 
  | The reason git doesn't have partial repo cloning is because it
  | was written by people without regard to past experience of
  | software development organizations. It is suited to the
  | radically decentralized group of Linux maintainers. It is
  | likely that your organization much more closely resembles
  | Google or Facebook than Linux. Perforce has had partial
  | checkout since ~always, because that's a pretty obvious
  | requirement when you stop and think about what software
  | development _companies_ do all day.
 
    | forrestthewoods wrote:
    | It's somewhat mind boggling that no one has made a better
    | Perforce. It has numerous issues and warts. But it's much
    | closer to what the majority of projects need that Git imho.
    | And for bonus points I can teach an artist/designer how to
    | safely and correctly use Perforce in about 10 minutes.
    | 
    | I've been using Git/Hg for years and I still run into the
    | occasional Gitastrophe where I have to Google how to unbreak
    | myself.
 
    | Chyzwar wrote:
    | git have added recently sparse checkout and there is also
    | Virtual File System for Git from Microsoft.
    | 
    | From my experience git/vcs is not an issue for monorepo.
    | Build, test, automation, deployments, CI/CD are way harder.
    | You will end with a bunch of shell scripts, make files, grunt
    | and a combination of ugly hacks. If you are smart you will
    | adopt something like bazel and have a dedicated tooling team.
    | If you see everything as nail, you will split monorepo into
    | an unmaintainable mess of small repos that slowly rot away.
 
  | throwaway894345 wrote:
  | I've always found that the biggest issue with monorepos is the
  | build tooling. I can't get my head around Bazel and other Blaze
  | derivatives enough to extend them to support any interesting
  | case, and Nix has too many usability issues to use productively
  | (and I've been in an organization that gave it an earnest
  | shot).
 
    | krasin wrote:
    | Can you please give an example of such an interesting case? I
    | am genuinely curious.
    | 
    | And I agree with the general point that monorepos require a
    | great build tooling as a match.
 
    | csnweb wrote:
    | With GitHub actions you can quite easily specify a workflow
    | for parts of your repo (simple file path filter
    | https://docs.github.com/en/actions/reference/workflow-
    | syntax...). So you can basically just write one workflow for
    | each project in the monorepo and have only those running
    | where changes occured.
 
| infogulch wrote:
| This was great! My summary:
| 
| A git packfile is an aggregated and indexed collection of
| historical git objects which reduce the time it takes to serve
| requests to those objects, implemented as two files: .pack and
| .idx. GitHub was having issues maintaining packfiles for very
| large repos in particular because regular repacking always has to
| repack the entire history into a single new packfile every time
| -- which is an expensive quadratic algorithm. GitHub's
| engineering team ameliorated this problem in two steps: 1. Enable
| repos to be served from multiple packfiles at once, 2. Design a
| packfile maintenance strategy that uses multiple packfiles
| sustainably.
| 
| Multi-pack indexes are a new git feature, but the initial
| implementation was missing performance-critial reachability
| bitmaps for multi-pack indexes. In general, index files store
| object names in lexicographic order and point to the named
| object's position in the associated packfile. As a first step to
| implement reachability bitmaps for multi-pack indexes, they
| introduced a reverse index file (.rev) which maps packfile object
| positions back to index file name offsets. This alone had a big
| performance improvement, but it also filled in the missing piece
| in order to implement multi-pack bitmaps.
| 
| With the issues of serving repos from multiple packs solved, they
| need to efficiently utilize multiple packs to reduce maintenance
| overhead. They chose to maintain historical packfiles in a
| geometrically increasing size. I.e., during the maintenance job,
| consider the first N most recent packfiles, if the sum of the
| size of all packfiles from [1, N] is less than the size of
| packfile N+1, then packfiles [1, N] are repacked into a single
| packfile, done; however if their summed size is greater than the
| size of packfile N+1, then iterate and consider all the packfiles
| [1, N+1] compared to packfile N+2 etc. This results in a set of
| packfiles where each file is roughly double the size of the
| previous when ordered by age, which has a number of beneficial
| properties for both serving and the average case maintenance run.
| Funny enough, this selection procedure struck me as similar to
| the game "2048".
 
| underdeserver wrote:
| 30 minute read + the Git object model = mind boggled.
| 
| I'd have appreciated a series of articles instead of one, for me
| it's way too much info to take in in one sitting.
 
  | iudqnolq wrote:
  | I'm currently working through the book Building Git. Best $30
  | I've spent in a while. It's about 700 pages, but 200 pages in
  | and I can stage files to/from the index, make commits, and see
  | the current status (although not on a repo with packfiles).
  | 
  | I'm thinking about writing a blog post where I write a git
  | commit with hexdump, zlib, and vim.
 
  | georgyo wrote:
  | It was a lot to digest, but it was also all one continuous
  | thought.
  | 
  | If it was broken up, I don't think it would have been nearly as
  | good. And I don't think I would have been able to keep all the
  | context to understand smaller chunks.
  | 
  | I really enjoyed the whole thing.
 
| swiley wrote:
| Mono-repos are like having a flat directory structure.
| 
| Sure it's simple but it makes it hard to find anything if you
| have a lot of stuff/people. submodules and package managers exist
| for a reason.
 
  | no_wizard wrote:
  | Note: for the sake of discussion I'm assuming when we say
  | monorepo we mean _monorepo and associated tools used to manage
  | them_
  | 
  | The trade off is simplified management of dependencies. With a
  | monorepo, I can control every version of a given dependency so
  | they're uniform across packages. If I update one package it is
  | always going to be linked to the other in its latest version. I
  | can simplify releases and managing my infrastructure in the
  | long term, though there is a trade off in initial complexity
  | for certain things if you want to do something like say, only
  | run tests for packages that have changed in CI (useful in some
  | cases).
  | 
  | It's all trade offs, but the quality of code has been higher
  | for our org in a monorepo on average
 
    | mr_tristan wrote:
    | I've found that many developers do not pay attention to
    | dependency management, so this approach of "it's either in
    | the repo or it doesn't exist" is actually a nice guard rail.
    | 
    | I'm reading between the lines here, but, I'm assuming you've
    | setup your tooling to enforce this. As in: the various
    | projects in the repo don't just optionally decide to have
    | external references, i.e., Maven central, npm, etc.
    | 
    | This puts quite a lot of "stuff" in the repo, but with
    | improvements like this article mentioned, makes monorepos in
    | git much easier to use.
    | 
    | I'd have to think, you could generate a lot of automation and
    | reports triggering out of commits pretty easily, too. I'd say
    | that would make the monorepo even easier to observe with a
    | modicum of the tooling required to maintain independent
    | repositories.
 
      | no_wizard wrote:
      | That is accurate, I wouldn't use a monorepo without
      | tooling, and in the JavaScript / TypeScript ecosystem, you
      | really can't do much without tooling (though npm supports
      | workspaces now, it doesn't support much else yet, like
      | plugins or hooks etc).
      | 
      | I have tried in the past, trying to achieve the same goals,
      | particularly around the dependency graph and not
      | duplicating functionality found in shared libraries (though
      | this concern goes in hand with solving another concern I
      | have, which is documentation enforcement), were just not
      | really possible in a way that I could automate with a high
      | degree of accuracy and confidence, without even more
      | complexity, like having to use some kind of CI integration
      | to pull dependency files across packages and compare them,
      | in a monorepo I have a single tool that does this for _all_
      | dependencies whenever any package.json file is updated or
      | the lock file is updated
      | 
      | If you care at all about your dependency graph, and in my
      | not so humble opinion every developer should have some
      | high-level awareness here in their given domain, I haven't
      | found a better solution that is less complex than
      | leveraging a monorepo
 
  | Denvercoder9 wrote:
  | _> Sure it 's simple but it makes it hard to find anything if
  | you have a lot of stuff/people._
  | 
  | I think this is a bad analogy. Looking up a file or directory
  | in a monorepo isn't harder than looking up a repository. In
  | fact, I'd argue it's easier, as we've developed decades of
  | tooling for searching through filesystems, while for searching
  | through remotely hosted repositories you're dependent on the
  | search function of the repository host, which is often worse.
 
| cryptica wrote:
| To scale a monorepo, you need to split it up into multiple repos;
| that way each repo can be maintained independently by a separate
| team...
| 
| We can call it a multi-monrepo, that way our brainwashed managers
| will agree to it.
 
  | Orphis wrote:
  | And that way, you can't have atomic updates across the
  | repositories and need to synchronize them all the time, great.
 
    | iudqnolq wrote:
    | What do atomic source updates get you if you don't have
    | atomic deploys? I'm just a student but my impression is that
    | literally no one serious has atomic deploys, not even Google,
    | because the only way to do it is scheduled downtime.
    | 
    | If you need to handle different versions talking to each
    | other in production it doesn't seem any harder to also deal
    | with different versions in source, and I'd worry atomic
    | updates to source would give a false sense of security in
    | deployment.
 
      | status_quo69 wrote:
      | > If you need to handle different versions talking to each
      | other in production it doesn't seem any harder to also deal
      | with different versions in source
      | 
      | It's much more annoying to deal with multi-repo setups and
      | it can be a real productivity killer. Additionally, if you
      | have a shared dependency, now you have to juggle managing
      | that shared dep. For example, repo A needs shared lib
      | Foo@1.2.0 and repo B needs Foo@1.3.4, because developers on
      | team A didn't update their dependencies often enough to
      | keep up with version bumps from the Foo team. Now there's a
      | really weird situation going on in your company where not
      | all teams are on the same page. A naiive monorepo forces
      | that shared dep change to be applied across the board at
      | once.
      | 
      | Edit: In regards to your "old code talking to new version"
      | problem, that's a culture problem IMO. At work we must
      | always consider the fact that a deployment rollout takes
      | time, so our changes in sensitive areas (controllers, jobs,
      | etc) should be as backwards compatible as possible for that
      | one deploy barring a rollback of some kind. We have linting
      | rules and a very stupid bot that posts a message reminding
      | us of that fact if we're trying to change something
      | sensitive to version changes, but the main thing that keeps
      | it all sane is we have it all collectively drilled in our
      | heads from the first time that we deploy to production that
      | we support N number of versions backwards. Since we're in a
      | monorepo, the PR to rip out the backwards compat check is
      | usually ripped out immediately after a deployment is
      | verified as good. In a multi-repo setup, ripping that
      | compat check out would require _another_ version bump and N
      | number of PRs to make sure that everyone is on the same
      | page. It really sucks.
 
    | slver wrote:
    | We have repository systems built for centralized atomic
    | updates, and giant monorepos, like SVN. Question is why are
    | we trying to have Git do this, which was explicitly designed
    | with the exact opposite goal? Is this an attempt to do SVN in
    | Git so we get to keep the benefits of the former, and the
    | cool buzzword-factor of the latter? I don't know.
    | 
    | Also when I try to think about reasons to have atomic cross-
    | project changes, my mind keeps drawing negative examples,
    | such as another team changing the code on your project, is
    | that a good practice? Not really. Well unless all projects
    | are owned by the same team, it'll happen in a monorepo.
    | 
    | Atomic updates not scaling beyond certain technical level is
    | often a good thing, because they also don't scale on human
    | and organizational level.
 
      | alexhutcheson wrote:
      | 1. You determine that a library used by a sizable fraction
      | of the code in your entire org has a problem that's
      | critical to fix (maybe a security issue, or maybe the
      | change could just save millions of dollars in compute
      | resources, etc.), but the fix requires updating the use of
      | that library in ~30 call sites spread across the codebases
      | of ~10 different teams.
      | 
      | 2. You create a PR that fixes the code and the problematic
      | call sites in a single commit. It gets merged and you're
      | done.
      | 
      | In the multi-repo world, you need to instead:
      | 
      | 1. Add conditional branching in your library so that it
      | supports both the old behavior and new behavior. This could
      | be an experiment flag, a new method DoSomethingV2, a new
      | constructor arg, etc. Depending on how you do this, you
      | might dramatically increase the number of call sites that
      | need to be modified.
      | 
      | 2. Either wait for all the problematic clients to update to
      | the new version of your library, or create PRs to manually
      | bump their version. Whoops - turns out a couple of them
      | were on a very old version, and the upgrade is non-trivial.
      | Now that's your problem to resolve before you proceed.
      | 
      | 3. Create PRs to modify the calling code in every repo that
      | includes problematic calls, and follow up with 10 different
      | reviewers to get them merged.
      | 
      | 4. If you still have the stamina, go through steps 1-3
      | again to clean up the conditional logic you added to your
      | library in step 1.
      | 
      | Basically, if code calls libraries that exist in different
      | repos, then making backwards-incompatible changes to those
      | libraries becomes extremely expensive. This is bad, because
      | sometimes backwards-incompatible changes would have very
      | high value.
      | 
      | If the numbers from my example were higher (e.g. 1000 call
      | sites across 100 teams), then the library maintainer in a
      | monorepo would probably still want to use a feature flag or
      | similar to avoid trying to merge a commit that affects 1000
      | files in one go. However, the library maintainer's job is
      | still dramatically easier, because they don't have to deal
      | with 100 individual repos, and they don't need to do
      | anything to ensure that everyone is using the latest
      | version of their library.
 
        | slver wrote:
        | Your monorepo scenario makes the following unlikely
        | assumptions:
        | 
        | 1. A critical security/performance fix has no other
        | recourse than breaking the interface compatibility of a
        | library. Far more common scenario is this can be fixed in
        | the implementation without BC breaks (otherwise systems
        | like semver wouldn't make sense).
        | 
        | 2. The person maintaining the library knows the codebases
        | of 10 teams better than the those 10 teams, so that
        | person can patch their projects better and faster than
        | the actual teams.
        | 
        | As a library maintainer, you know the interface of your
        | library. But that's merely the "how" on the other end of
        | those 30 call sites. You don't know the "why". You can
        | easily break their projects, despite your code compiles
        | just fine. So that'd be reckless of an approach.
        | 
        | Also your multi-repo scenario is artificially contrived.
        | No, you don't need conditional branching and all this
        | nonsense.
        | 
        | In the common scenario, you just push a patch that
        | maintains BC and tell the teams to update and that's it.
        | 
        | And if you do have BC breaks, then:
        | 
        | 1. Push a major version with the BC breaks and the fix.
        | 
        | 2. Push a patch version deprecating that release and
        | telling developers to update.
        | 
        | That's it. You don't need all this nonsense you listed.
 
        | hamandcheese wrote:
        | I've lived both lives. It absolutely is an ordeal making
        | changes across repos. The model you are highlighting
        | opens up substantial risk that folks don't update in a
        | timely manner. What you are describing is basically just
        | throwing code over the wall and hoping for the best.
 
        | howinteresting wrote:
        | Semver is a second-rate coping mechanism for when better
        | coordination systems don't exist.
 
        | slver wrote:
        | Patching the code of 10 projects you don't maintain isn't
        | an example of a "coordination system". It's an example of
        | avoiding having one.
        | 
        | In multithreading this would be basically mutable shared
        | state with no coordination. Every thread sees everything,
        | and is free to mutate any of it at any point. Which as we
        | all know is a best practice in multithreading /s
 
        | howinteresting wrote:
        | The same code can have multiple overlapping sets of
        | maintainers. For example, one team can be responsible for
        | business logic while another team can manage core
        | abstractions shared by many product teams. Yet another
        | team may be responsible for upgrading to newer toolchains
        | and language features. They'll all want to touch the same
        | code but make different, roughly orthogonal changes to
        | it.
        | 
        | Semver provides just a few bits of information, not
        | nearly enough to cover the whole gamut of shared and
        | distributed responsibility.
        | 
        | The comparison with multithreading is not really valid,
        | since monorepos typically linearize history.
 
        | slver wrote:
        | Semver was enough for me to resolve very simply a
        | scenario above that was presented as some kind of
        | unsurmountable nightmare. So I think Semver is just fine.
        | It's an example of a simple, well designed abstraction.
        | Having "more bits" is not a virtue here.
        | 
        | I could have some comments on your "overlapping
        | responsibilities" as well, but your description is too
        | abstract and vague to address, so I'm pass on that. But
        | you literally described the concept of library at one
        | point. There's nothing overlapping about it.
 
        | iudqnolq wrote:
        | > You create a PR that fixes the code and the problematic
        | call sites in a single commit. It gets merged and you're
        | done.
        | 
        | What happens when you roll this out and partway through
        | the rollout an old version talks to a new version? I
        | thought you still needed backwards compat? I'm a student
        | and I've never worked on a project with no-downtime
        | deploys, so I'm interested in how this can be possible.
 
      | howinteresting wrote:
      | Of course I want people who care about modernizing code to
      | come in and modernize my code (such as upgrades to newer
      | language versions). Why should the burden be distributed
      | when it can be concentrated among experts?
      | 
      | I leverage type systems and write tests to catch any
      | mistakes they might make.
 
    | swiley wrote:
    | Yes you can, it happens when you bump the sub module
    | reference. This is how reasonable people use git.
 
      | Denvercoder9 wrote:
      | Submodules often provide a terrible user experience
      | _because_ they are locked to a single version. To propagate
      | a single commit, you need to update every single dependent
      | repository. In some contexts that can be helpful, but in my
      | experience it 's mostly an enormous hassle.
      | 
      | Also it's awful that a simple git pull doesn't actually
      | pull updated submodules, you need to run git submodule
      | update (or sync or whatever it is) as well.
      | 
      | I don't want to work with git submodules ever again. The
      | idea is nice, but the user experience is really terrible.
 
        | fpoling wrote:
        | Looking back I just do not understand why git came up
        | with this awkward mess of submodules. Instead it should
        | have a way to say that a particular directory is self-
        | contained and any commit affecting it should be two
        | objects. The first is the commit object for the directory
        | using only relative paths. The second is commit for the
        | rest of code with a reference to it. Then one can just
        | pull any repository into the main repository without and
        | use it normally.
        | 
        | git subtree tries to emulate that, but it does not scale
        | to huge repositories as it needs to change all commits in
        | the subtree to use new nested paths.
 
        | mdaniel wrote:
        | And woe unto junior developers who change into the
        | submodule directory and do a git commit, then made
        | infinitely worse if it's followed by git push because now
        | there's a sha hanging out in the repo which works on one
        | machine but that no one else's submodule update will see
        | without surgery
        | 
        | I'm not at my computer to see if modern git prohibits
        | that behavior, but it is indicative of the "watch out"
        | that comes with advanced git usage: it is a very sharp
        | knife
 
      | dylan-m wrote:
      | Or define your interfaces properly, version them, and
      | publish libraries (precompiled, ideally) somewhere outside
      | of your source repo. Your associated projects depend on
      | those rather than random chunks of code that happen to be
      | in the same file structure. It's more work, but it
      | encourages better organization in general and saves an
      | incredible amount of time later on for any complex project.
 
        | throwaway894345 wrote:
        | I don't like this because it assumes that all of those
        | repositories are accessible all of the time to everyone
        | who might want to build something. If one repo for some
        | core artifact becomes unreachable, everyone is dead in
        | the water.
        | 
        | Ideally "cached on the network" could be a sort of
        | optional side effect, like with Nix, but you can still
        | reproducibly build from source. That said, I can't
        | recommend Nix, not for philosophical reasons, but for
        | lots of implementation details.
 
    | cryptica wrote:
    | If the project has good separation of concerns, you don't
    | need atomic updates. Good separation of concerns yields many
    | benefits beyond ease of project management. It requires a bit
    | more thought, but if done correctly, it's worth many times
    | the effort.
    | 
    | Good separation of concerns is like earning compound interest
    | on your code.
    | 
    | Just keep the dependencies generic and tailor the higher
    | level logic to the business domain. Then you rarely need to
    | update the dependencies.
    | 
    | I've been doing this on commercial projects (to much success)
    | for decades; before most of the down-voters on here even
    | wrote their first hello world programs.
 
    | [deleted]
 
| WayToDoor wrote:
| The article is really impressive. It's nice to see GitHub
| contribute changes back to the git project, and to know that the
| two work closely together.
 
  | slver wrote:
  | It's in their mutual interest. Imagine what happens to GIThub
  | if GIT goes out of fashion.
 
    | infogulch wrote:
    | Yes, isn't it nice when interests of multiple parties are
    | aligned such that they help each other make progress towards
    | their shared goals?
 
      | slver wrote:
      | Well, it's nice to see they're rational, indeed.
 
        | jackbravo wrote:
        | Other rational companies could try to fix this without
        | contributing upstream. Doing it upstream benefits
        | competitors like gitlab. So yeah! It's nice seeing this
        | kind of behavior
 
        | slver wrote:
        | First, they not only contributed upstream, upstream
        | developers contributed to this patch. I.e. they got help
        | outside GitHub to make this patch possible.
        | 
        | Second, if they had decided to fork Git, then they'd have
        | to maintain this fork forever.
        | 
        | Third, this fork could overtime become visibly or even
        | worse subtly incompatible with stock Git which is still
        | the Git running on GitHub users' machines, and both
        | should interact with each other in 100% compatible
        | manner.
        | 
        | So, in this case, not contributing upstream was literally
        | no-go. The only rational choice would be to not fork Git.
 
___________________________________________________________________
(page generated 2021-05-01 23:00 UTC)