proxy70

	[HN Gopher] Scaling monorepo maintenance ___________________________________________________________________ Scaling monorepo maintenance Author : pimterry Score : 250 points Date : 2021-04-30 09:52 UTC (1 days ago)
	web link (github.blog)
	w3m dump (github.blog)
	\| whymauri wrote: \| I love when people use Git in ways I haven't thought about \| before. Reminds me of the first time I played around with \| 'blobs.' \| debarshri wrote: \| It is a great writeup. I wonder how gitlab solves this problem. \| lbotos wrote: \| GL packs refs at various times and frequencies depending on \| usage: \| https://docs.gitlab.com/ee/administration/housekeeping.html \| \| It works well for most repos but as you start to get out to the \| edges of lots of commits it can cause slowness. GL admins can \| repack reasonably safely at various times to get access \| speedups, but the solution that is presented in the blog would \| def speed packing up. \| \| (I work as a Support Engineering Leader at GL but I'm reading \| HN for fun <3) \| masklinn wrote: \| They might have yet to encounter it. Git is hosting some really \| big repos. \| lbotos wrote: \| Oh, we def have. I've seen some large repos (50+GB) in some \| GL installations. \| the_duke wrote: \| Very well written post and upstream work is always appreciated. \| \| I also really like monorepos, but Git and GitHub really don't \| work well at all for them. \| \| On the Git side there is no way to clone only parts of a repo or \| to limit access by user. All the Git tooling out there, from CLI \| to the various IDE integrations, are all very ill adjusted to a \| huge repo with lots of unrelated commits. \| \| On the Github side there is no separation between the different \| parts of a monorepo in the UI (issues, prs, CI), the workflows, \| or the permission system. Sure, you can hack something together \| with labels and custom bots, but it always feels like a hack. \| \| Using Git(hub) for monorepos is really painful in my experience. \| \| There is a reason why Google, Facebook et all have heaps of \| custom tooling. \| krasin wrote: \| I really like monorepos. But I find that it's almost never a \| good idea to hide parts of a source code from developers. And \| if there's some secret sauce that's so sensitive that only \| single-digit number of developers in the whole company can \| access it, then it's probably okay to have a separate \| repository just for it. \| \| Working in environments where different people have partial \| access to different parts of the code never felt productive to \| me -- often, the time to figure out who can take on a task and \| how to grant all the access might take longer than the task \| itself. \| jayd16 wrote: \| I wouldn't call it painful exactly but I'll be happy when \| shallow and sparse cloning become rock solid and boring. \| Beowolve wrote: \| On this note, GitHub does reach out to customers with Monorepos \| and is aware of their short comings. I think overtime we will \| see them change to have better support. It's only a matter of \| time. \| jeffbee wrote: \| It's funny that you mention this as if monorepos of course \| require custom tooling. Google started with off-the-shelf \| Perforce and that was fine for many years, long after their \| repo became truly huge. Only when it became _monstrously_ huge \| did they need custom tools and even then they basically just \| re-implemented Perforce instead of adopting git concepts. You, \| too, can just use Perforce. It 's even free for up to five \| users. You won't outgrow its limits until you get about a \| million engineer-years under your belt. \| \| The reason git doesn't have partial repo cloning is because it \| was written by people without regard to past experience of \| software development organizations. It is suited to the \| radically decentralized group of Linux maintainers. It is \| likely that your organization much more closely resembles \| Google or Facebook than Linux. Perforce has had partial \| checkout since ~always, because that's a pretty obvious \| requirement when you stop and think about what software \| development _companies_ do all day. \| forrestthewoods wrote: \| It's somewhat mind boggling that no one has made a better \| Perforce. It has numerous issues and warts. But it's much \| closer to what the majority of projects need that Git imho. \| And for bonus points I can teach an artist/designer how to \| safely and correctly use Perforce in about 10 minutes. \| \| I've been using Git/Hg for years and I still run into the \| occasional Gitastrophe where I have to Google how to unbreak \| myself. \| Chyzwar wrote: \| git have added recently sparse checkout and there is also \| Virtual File System for Git from Microsoft. \| \| From my experience git/vcs is not an issue for monorepo. \| Build, test, automation, deployments, CI/CD are way harder. \| You will end with a bunch of shell scripts, make files, grunt \| and a combination of ugly hacks. If you are smart you will \| adopt something like bazel and have a dedicated tooling team. \| If you see everything as nail, you will split monorepo into \| an unmaintainable mess of small repos that slowly rot away. \| throwaway894345 wrote: \| I've always found that the biggest issue with monorepos is the \| build tooling. I can't get my head around Bazel and other Blaze \| derivatives enough to extend them to support any interesting \| case, and Nix has too many usability issues to use productively \| (and I've been in an organization that gave it an earnest \| shot). \| krasin wrote: \| Can you please give an example of such an interesting case? I \| am genuinely curious. \| \| And I agree with the general point that monorepos require a \| great build tooling as a match. \| csnweb wrote: \| With GitHub actions you can quite easily specify a workflow \| for parts of your repo (simple file path filter \| https://docs.github.com/en/actions/reference/workflow- \| syntax...). So you can basically just write one workflow for \| each project in the monorepo and have only those running \| where changes occured. \| infogulch wrote: \| This was great! My summary: \| \| A git packfile is an aggregated and indexed collection of \| historical git objects which reduce the time it takes to serve \| requests to those objects, implemented as two files: .pack and \| .idx. GitHub was having issues maintaining packfiles for very \| large repos in particular because regular repacking always has to \| repack the entire history into a single new packfile every time \| -- which is an expensive quadratic algorithm. GitHub's \| engineering team ameliorated this problem in two steps: 1. Enable \| repos to be served from multiple packfiles at once, 2. Design a \| packfile maintenance strategy that uses multiple packfiles \| sustainably. \| \| Multi-pack indexes are a new git feature, but the initial \| implementation was missing performance-critial reachability \| bitmaps for multi-pack indexes. In general, index files store \| object names in lexicographic order and point to the named \| object's position in the associated packfile. As a first step to \| implement reachability bitmaps for multi-pack indexes, they \| introduced a reverse index file (.rev) which maps packfile object \| positions back to index file name offsets. This alone had a big \| performance improvement, but it also filled in the missing piece \| in order to implement multi-pack bitmaps. \| \| With the issues of serving repos from multiple packs solved, they \| need to efficiently utilize multiple packs to reduce maintenance \| overhead. They chose to maintain historical packfiles in a \| geometrically increasing size. I.e., during the maintenance job, \| consider the first N most recent packfiles, if the sum of the \| size of all packfiles from [1, N] is less than the size of \| packfile N+1, then packfiles [1, N] are repacked into a single \| packfile, done; however if their summed size is greater than the \| size of packfile N+1, then iterate and consider all the packfiles \| [1, N+1] compared to packfile N+2 etc. This results in a set of \| packfiles where each file is roughly double the size of the \| previous when ordered by age, which has a number of beneficial \| properties for both serving and the average case maintenance run. \| Funny enough, this selection procedure struck me as similar to \| the game "2048". \| underdeserver wrote: \| 30 minute read + the Git object model = mind boggled. \| \| I'd have appreciated a series of articles instead of one, for me \| it's way too much info to take in in one sitting. \| iudqnolq wrote: \| I'm currently working through the book Building Git. Best $30 \| I've spent in a while. It's about 700 pages, but 200 pages in \| and I can stage files to/from the index, make commits, and see \| the current status (although not on a repo with packfiles). \| \| I'm thinking about writing a blog post where I write a git \| commit with hexdump, zlib, and vim. \| georgyo wrote: \| It was a lot to digest, but it was also all one continuous \| thought. \| \| If it was broken up, I don't think it would have been nearly as \| good. And I don't think I would have been able to keep all the \| context to understand smaller chunks. \| \| I really enjoyed the whole thing. \| swiley wrote: \| Mono-repos are like having a flat directory structure. \| \| Sure it's simple but it makes it hard to find anything if you \| have a lot of stuff/people. submodules and package managers exist \| for a reason. \| no_wizard wrote: \| Note: for the sake of discussion I'm assuming when we say \| monorepo we mean _monorepo and associated tools used to manage \| them_ \| \| The trade off is simplified management of dependencies. With a \| monorepo, I can control every version of a given dependency so \| they're uniform across packages. If I update one package it is \| always going to be linked to the other in its latest version. I \| can simplify releases and managing my infrastructure in the \| long term, though there is a trade off in initial complexity \| for certain things if you want to do something like say, only \| run tests for packages that have changed in CI (useful in some \| cases). \| \| It's all trade offs, but the quality of code has been higher \| for our org in a monorepo on average \| mr_tristan wrote: \| I've found that many developers do not pay attention to \| dependency management, so this approach of "it's either in \| the repo or it doesn't exist" is actually a nice guard rail. \| \| I'm reading between the lines here, but, I'm assuming you've \| setup your tooling to enforce this. As in: the various \| projects in the repo don't just optionally decide to have \| external references, i.e., Maven central, npm, etc. \| \| This puts quite a lot of "stuff" in the repo, but with \| improvements like this article mentioned, makes monorepos in \| git much easier to use. \| \| I'd have to think, you could generate a lot of automation and \| reports triggering out of commits pretty easily, too. I'd say \| that would make the monorepo even easier to observe with a \| modicum of the tooling required to maintain independent \| repositories. \| no_wizard wrote: \| That is accurate, I wouldn't use a monorepo without \| tooling, and in the JavaScript / TypeScript ecosystem, you \| really can't do much without tooling (though npm supports \| workspaces now, it doesn't support much else yet, like \| plugins or hooks etc). \| \| I have tried in the past, trying to achieve the same goals, \| particularly around the dependency graph and not \| duplicating functionality found in shared libraries (though \| this concern goes in hand with solving another concern I \| have, which is documentation enforcement), were just not \| really possible in a way that I could automate with a high \| degree of accuracy and confidence, without even more \| complexity, like having to use some kind of CI integration \| to pull dependency files across packages and compare them, \| in a monorepo I have a single tool that does this for _all_ \| dependencies whenever any package.json file is updated or \| the lock file is updated \| \| If you care at all about your dependency graph, and in my \| not so humble opinion every developer should have some \| high-level awareness here in their given domain, I haven't \| found a better solution that is less complex than \| leveraging a monorepo \| Denvercoder9 wrote: \| _> Sure it 's simple but it makes it hard to find anything if \| you have a lot of stuff/people._ \| \| I think this is a bad analogy. Looking up a file or directory \| in a monorepo isn't harder than looking up a repository. In \| fact, I'd argue it's easier, as we've developed decades of \| tooling for searching through filesystems, while for searching \| through remotely hosted repositories you're dependent on the \| search function of the repository host, which is often worse. \| cryptica wrote: \| To scale a monorepo, you need to split it up into multiple repos; \| that way each repo can be maintained independently by a separate \| team... \| \| We can call it a multi-monrepo, that way our brainwashed managers \| will agree to it. \| Orphis wrote: \| And that way, you can't have atomic updates across the \| repositories and need to synchronize them all the time, great. \| iudqnolq wrote: \| What do atomic source updates get you if you don't have \| atomic deploys? I'm just a student but my impression is that \| literally no one serious has atomic deploys, not even Google, \| because the only way to do it is scheduled downtime. \| \| If you need to handle different versions talking to each \| other in production it doesn't seem any harder to also deal \| with different versions in source, and I'd worry atomic \| updates to source would give a false sense of security in \| deployment. \| status_quo69 wrote: \| > If you need to handle different versions talking to each \| other in production it doesn't seem any harder to also deal \| with different versions in source \| \| It's much more annoying to deal with multi-repo setups and \| it can be a real productivity killer. Additionally, if you \| have a shared dependency, now you have to juggle managing \| that shared dep. For example, repo A needs shared lib \| Foo@1.2.0 and repo B needs Foo@1.3.4, because developers on \| team A didn't update their dependencies often enough to \| keep up with version bumps from the Foo team. Now there's a \| really weird situation going on in your company where not \| all teams are on the same page. A naiive monorepo forces \| that shared dep change to be applied across the board at \| once. \| \| Edit: In regards to your "old code talking to new version" \| problem, that's a culture problem IMO. At work we must \| always consider the fact that a deployment rollout takes \| time, so our changes in sensitive areas (controllers, jobs, \| etc) should be as backwards compatible as possible for that \| one deploy barring a rollback of some kind. We have linting \| rules and a very stupid bot that posts a message reminding \| us of that fact if we're trying to change something \| sensitive to version changes, but the main thing that keeps \| it all sane is we have it all collectively drilled in our \| heads from the first time that we deploy to production that \| we support N number of versions backwards. Since we're in a \| monorepo, the PR to rip out the backwards compat check is \| usually ripped out immediately after a deployment is \| verified as good. In a multi-repo setup, ripping that \| compat check out would require _another_ version bump and N \| number of PRs to make sure that everyone is on the same \| page. It really sucks. \| slver wrote: \| We have repository systems built for centralized atomic \| updates, and giant monorepos, like SVN. Question is why are \| we trying to have Git do this, which was explicitly designed \| with the exact opposite goal? Is this an attempt to do SVN in \| Git so we get to keep the benefits of the former, and the \| cool buzzword-factor of the latter? I don't know. \| \| Also when I try to think about reasons to have atomic cross- \| project changes, my mind keeps drawing negative examples, \| such as another team changing the code on your project, is \| that a good practice? Not really. Well unless all projects \| are owned by the same team, it'll happen in a monorepo. \| \| Atomic updates not scaling beyond certain technical level is \| often a good thing, because they also don't scale on human \| and organizational level. \| alexhutcheson wrote: \| 1. You determine that a library used by a sizable fraction \| of the code in your entire org has a problem that's \| critical to fix (maybe a security issue, or maybe the \| change could just save millions of dollars in compute \| resources, etc.), but the fix requires updating the use of \| that library in ~30 call sites spread across the codebases \| of ~10 different teams. \| \| 2. You create a PR that fixes the code and the problematic \| call sites in a single commit. It gets merged and you're \| done. \| \| In the multi-repo world, you need to instead: \| \| 1. Add conditional branching in your library so that it \| supports both the old behavior and new behavior. This could \| be an experiment flag, a new method DoSomethingV2, a new \| constructor arg, etc. Depending on how you do this, you \| might dramatically increase the number of call sites that \| need to be modified. \| \| 2. Either wait for all the problematic clients to update to \| the new version of your library, or create PRs to manually \| bump their version. Whoops - turns out a couple of them \| were on a very old version, and the upgrade is non-trivial. \| Now that's your problem to resolve before you proceed. \| \| 3. Create PRs to modify the calling code in every repo that \| includes problematic calls, and follow up with 10 different \| reviewers to get them merged. \| \| 4. If you still have the stamina, go through steps 1-3 \| again to clean up the conditional logic you added to your \| library in step 1. \| \| Basically, if code calls libraries that exist in different \| repos, then making backwards-incompatible changes to those \| libraries becomes extremely expensive. This is bad, because \| sometimes backwards-incompatible changes would have very \| high value. \| \| If the numbers from my example were higher (e.g. 1000 call \| sites across 100 teams), then the library maintainer in a \| monorepo would probably still want to use a feature flag or \| similar to avoid trying to merge a commit that affects 1000 \| files in one go. However, the library maintainer's job is \| still dramatically easier, because they don't have to deal \| with 100 individual repos, and they don't need to do \| anything to ensure that everyone is using the latest \| version of their library. \| slver wrote: \| Your monorepo scenario makes the following unlikely \| assumptions: \| \| 1. A critical security/performance fix has no other \| recourse than breaking the interface compatibility of a \| library. Far more common scenario is this can be fixed in \| the implementation without BC breaks (otherwise systems \| like semver wouldn't make sense). \| \| 2. The person maintaining the library knows the codebases \| of 10 teams better than the those 10 teams, so that \| person can patch their projects better and faster than \| the actual teams. \| \| As a library maintainer, you know the interface of your \| library. But that's merely the "how" on the other end of \| those 30 call sites. You don't know the "why". You can \| easily break their projects, despite your code compiles \| just fine. So that'd be reckless of an approach. \| \| Also your multi-repo scenario is artificially contrived. \| No, you don't need conditional branching and all this \| nonsense. \| \| In the common scenario, you just push a patch that \| maintains BC and tell the teams to update and that's it. \| \| And if you do have BC breaks, then: \| \| 1. Push a major version with the BC breaks and the fix. \| \| 2. Push a patch version deprecating that release and \| telling developers to update. \| \| That's it. You don't need all this nonsense you listed. \| hamandcheese wrote: \| I've lived both lives. It absolutely is an ordeal making \| changes across repos. The model you are highlighting \| opens up substantial risk that folks don't update in a \| timely manner. What you are describing is basically just \| throwing code over the wall and hoping for the best. \| howinteresting wrote: \| Semver is a second-rate coping mechanism for when better \| coordination systems don't exist. \| slver wrote: \| Patching the code of 10 projects you don't maintain isn't \| an example of a "coordination system". It's an example of \| avoiding having one. \| \| In multithreading this would be basically mutable shared \| state with no coordination. Every thread sees everything, \| and is free to mutate any of it at any point. Which as we \| all know is a best practice in multithreading /s \| howinteresting wrote: \| The same code can have multiple overlapping sets of \| maintainers. For example, one team can be responsible for \| business logic while another team can manage core \| abstractions shared by many product teams. Yet another \| team may be responsible for upgrading to newer toolchains \| and language features. They'll all want to touch the same \| code but make different, roughly orthogonal changes to \| it. \| \| Semver provides just a few bits of information, not \| nearly enough to cover the whole gamut of shared and \| distributed responsibility. \| \| The comparison with multithreading is not really valid, \| since monorepos typically linearize history. \| slver wrote: \| Semver was enough for me to resolve very simply a \| scenario above that was presented as some kind of \| unsurmountable nightmare. So I think Semver is just fine. \| It's an example of a simple, well designed abstraction. \| Having "more bits" is not a virtue here. \| \| I could have some comments on your "overlapping \| responsibilities" as well, but your description is too \| abstract and vague to address, so I'm pass on that. But \| you literally described the concept of library at one \| point. There's nothing overlapping about it. \| iudqnolq wrote: \| > You create a PR that fixes the code and the problematic \| call sites in a single commit. It gets merged and you're \| done. \| \| What happens when you roll this out and partway through \| the rollout an old version talks to a new version? I \| thought you still needed backwards compat? I'm a student \| and I've never worked on a project with no-downtime \| deploys, so I'm interested in how this can be possible. \| howinteresting wrote: \| Of course I want people who care about modernizing code to \| come in and modernize my code (such as upgrades to newer \| language versions). Why should the burden be distributed \| when it can be concentrated among experts? \| \| I leverage type systems and write tests to catch any \| mistakes they might make. \| swiley wrote: \| Yes you can, it happens when you bump the sub module \| reference. This is how reasonable people use git. \| Denvercoder9 wrote: \| Submodules often provide a terrible user experience \| _because_ they are locked to a single version. To propagate \| a single commit, you need to update every single dependent \| repository. In some contexts that can be helpful, but in my \| experience it 's mostly an enormous hassle. \| \| Also it's awful that a simple git pull doesn't actually \| pull updated submodules, you need to run git submodule \| update (or sync or whatever it is) as well. \| \| I don't want to work with git submodules ever again. The \| idea is nice, but the user experience is really terrible. \| fpoling wrote: \| Looking back I just do not understand why git came up \| with this awkward mess of submodules. Instead it should \| have a way to say that a particular directory is self- \| contained and any commit affecting it should be two \| objects. The first is the commit object for the directory \| using only relative paths. The second is commit for the \| rest of code with a reference to it. Then one can just \| pull any repository into the main repository without and \| use it normally. \| \| git subtree tries to emulate that, but it does not scale \| to huge repositories as it needs to change all commits in \| the subtree to use new nested paths. \| mdaniel wrote: \| And woe unto junior developers who change into the \| submodule directory and do a git commit, then made \| infinitely worse if it's followed by git push because now \| there's a sha hanging out in the repo which works on one \| machine but that no one else's submodule update will see \| without surgery \| \| I'm not at my computer to see if modern git prohibits \| that behavior, but it is indicative of the "watch out" \| that comes with advanced git usage: it is a very sharp \| knife \| dylan-m wrote: \| Or define your interfaces properly, version them, and \| publish libraries (precompiled, ideally) somewhere outside \| of your source repo. Your associated projects depend on \| those rather than random chunks of code that happen to be \| in the same file structure. It's more work, but it \| encourages better organization in general and saves an \| incredible amount of time later on for any complex project. \| throwaway894345 wrote: \| I don't like this because it assumes that all of those \| repositories are accessible all of the time to everyone \| who might want to build something. If one repo for some \| core artifact becomes unreachable, everyone is dead in \| the water. \| \| Ideally "cached on the network" could be a sort of \| optional side effect, like with Nix, but you can still \| reproducibly build from source. That said, I can't \| recommend Nix, not for philosophical reasons, but for \| lots of implementation details. \| cryptica wrote: \| If the project has good separation of concerns, you don't \| need atomic updates. Good separation of concerns yields many \| benefits beyond ease of project management. It requires a bit \| more thought, but if done correctly, it's worth many times \| the effort. \| \| Good separation of concerns is like earning compound interest \| on your code. \| \| Just keep the dependencies generic and tailor the higher \| level logic to the business domain. Then you rarely need to \| update the dependencies. \| \| I've been doing this on commercial projects (to much success) \| for decades; before most of the down-voters on here even \| wrote their first hello world programs. \| [deleted] \| WayToDoor wrote: \| The article is really impressive. It's nice to see GitHub \| contribute changes back to the git project, and to know that the \| two work closely together. \| slver wrote: \| It's in their mutual interest. Imagine what happens to GIThub \| if GIT goes out of fashion. \| infogulch wrote: \| Yes, isn't it nice when interests of multiple parties are \| aligned such that they help each other make progress towards \| their shared goals? \| slver wrote: \| Well, it's nice to see they're rational, indeed. \| jackbravo wrote: \| Other rational companies could try to fix this without \| contributing upstream. Doing it upstream benefits \| competitors like gitlab. So yeah! It's nice seeing this \| kind of behavior \| slver wrote: \| First, they not only contributed upstream, upstream \| developers contributed to this patch. I.e. they got help \| outside GitHub to make this patch possible. \| \| Second, if they had decided to fork Git, then they'd have \| to maintain this fork forever. \| \| Third, this fork could overtime become visibly or even \| worse subtly incompatible with stock Git which is still \| the Git running on GitHub users' machines, and both \| should interact with each other in 100% compatible \| manner. \| \| So, in this case, not contributing upstream was literally \| no-go. The only rational choice would be to not fork Git. ___________________________________________________________________ (page generated 2021-05-01 23:00 UTC)