|
| whymauri wrote:
| I love when people use Git in ways I haven't thought about
| before. Reminds me of the first time I played around with
| 'blobs.'
| debarshri wrote:
| It is a great writeup. I wonder how gitlab solves this problem.
| lbotos wrote:
| GL packs refs at various times and frequencies depending on
| usage:
| https://docs.gitlab.com/ee/administration/housekeeping.html
|
| It works well for most repos but as you start to get out to the
| edges of lots of commits it can cause slowness. GL admins can
| repack reasonably safely at various times to get access
| speedups, but the solution that is presented in the blog would
| def speed packing up.
|
| (I work as a Support Engineering Leader at GL but I'm reading
| HN for fun <3)
| masklinn wrote:
| They might have yet to encounter it. Git is hosting some really
| big repos.
| lbotos wrote:
| Oh, we def have. I've seen some large repos (50+GB) in some
| GL installations.
| the_duke wrote:
| Very well written post and upstream work is always appreciated.
|
| I also really like monorepos, but Git and GitHub really don't
| work well at all for them.
|
| On the Git side there is no way to clone only parts of a repo or
| to limit access by user. All the Git tooling out there, from CLI
| to the various IDE integrations, are all very ill adjusted to a
| huge repo with lots of unrelated commits.
|
| On the Github side there is no separation between the different
| parts of a monorepo in the UI (issues, prs, CI), the workflows,
| or the permission system. Sure, you can hack something together
| with labels and custom bots, but it always feels like a hack.
|
| Using Git(hub) for monorepos is really painful in my experience.
|
| There is a reason why Google, Facebook et all have heaps of
| custom tooling.
| krasin wrote:
| I really like monorepos. But I find that it's almost never a
| good idea to hide parts of a source code from developers. And
| if there's some secret sauce that's so sensitive that only
| single-digit number of developers in the whole company can
| access it, then it's probably okay to have a separate
| repository just for it.
|
| Working in environments where different people have partial
| access to different parts of the code never felt productive to
| me -- often, the time to figure out who can take on a task and
| how to grant all the access might take longer than the task
| itself.
| jayd16 wrote:
| I wouldn't call it painful exactly but I'll be happy when
| shallow and sparse cloning become rock solid and boring.
| Beowolve wrote:
| On this note, GitHub does reach out to customers with Monorepos
| and is aware of their short comings. I think overtime we will
| see them change to have better support. It's only a matter of
| time.
| jeffbee wrote:
| It's funny that you mention this as if monorepos of course
| require custom tooling. Google started with off-the-shelf
| Perforce and that was fine for many years, long after their
| repo became truly huge. Only when it became _monstrously_ huge
| did they need custom tools and even then they basically just
| re-implemented Perforce instead of adopting git concepts. You,
| too, can just use Perforce. It 's even free for up to five
| users. You won't outgrow its limits until you get about a
| million engineer-years under your belt.
|
| The reason git doesn't have partial repo cloning is because it
| was written by people without regard to past experience of
| software development organizations. It is suited to the
| radically decentralized group of Linux maintainers. It is
| likely that your organization much more closely resembles
| Google or Facebook than Linux. Perforce has had partial
| checkout since ~always, because that's a pretty obvious
| requirement when you stop and think about what software
| development _companies_ do all day.
| forrestthewoods wrote:
| It's somewhat mind boggling that no one has made a better
| Perforce. It has numerous issues and warts. But it's much
| closer to what the majority of projects need that Git imho.
| And for bonus points I can teach an artist/designer how to
| safely and correctly use Perforce in about 10 minutes.
|
| I've been using Git/Hg for years and I still run into the
| occasional Gitastrophe where I have to Google how to unbreak
| myself.
| Chyzwar wrote:
| git have added recently sparse checkout and there is also
| Virtual File System for Git from Microsoft.
|
| From my experience git/vcs is not an issue for monorepo.
| Build, test, automation, deployments, CI/CD are way harder.
| You will end with a bunch of shell scripts, make files, grunt
| and a combination of ugly hacks. If you are smart you will
| adopt something like bazel and have a dedicated tooling team.
| If you see everything as nail, you will split monorepo into
| an unmaintainable mess of small repos that slowly rot away.
| throwaway894345 wrote:
| I've always found that the biggest issue with monorepos is the
| build tooling. I can't get my head around Bazel and other Blaze
| derivatives enough to extend them to support any interesting
| case, and Nix has too many usability issues to use productively
| (and I've been in an organization that gave it an earnest
| shot).
| krasin wrote:
| Can you please give an example of such an interesting case? I
| am genuinely curious.
|
| And I agree with the general point that monorepos require a
| great build tooling as a match.
| csnweb wrote:
| With GitHub actions you can quite easily specify a workflow
| for parts of your repo (simple file path filter
| https://docs.github.com/en/actions/reference/workflow-
| syntax...). So you can basically just write one workflow for
| each project in the monorepo and have only those running
| where changes occured.
| infogulch wrote:
| This was great! My summary:
|
| A git packfile is an aggregated and indexed collection of
| historical git objects which reduce the time it takes to serve
| requests to those objects, implemented as two files: .pack and
| .idx. GitHub was having issues maintaining packfiles for very
| large repos in particular because regular repacking always has to
| repack the entire history into a single new packfile every time
| -- which is an expensive quadratic algorithm. GitHub's
| engineering team ameliorated this problem in two steps: 1. Enable
| repos to be served from multiple packfiles at once, 2. Design a
| packfile maintenance strategy that uses multiple packfiles
| sustainably.
|
| Multi-pack indexes are a new git feature, but the initial
| implementation was missing performance-critial reachability
| bitmaps for multi-pack indexes. In general, index files store
| object names in lexicographic order and point to the named
| object's position in the associated packfile. As a first step to
| implement reachability bitmaps for multi-pack indexes, they
| introduced a reverse index file (.rev) which maps packfile object
| positions back to index file name offsets. This alone had a big
| performance improvement, but it also filled in the missing piece
| in order to implement multi-pack bitmaps.
|
| With the issues of serving repos from multiple packs solved, they
| need to efficiently utilize multiple packs to reduce maintenance
| overhead. They chose to maintain historical packfiles in a
| geometrically increasing size. I.e., during the maintenance job,
| consider the first N most recent packfiles, if the sum of the
| size of all packfiles from [1, N] is less than the size of
| packfile N+1, then packfiles [1, N] are repacked into a single
| packfile, done; however if their summed size is greater than the
| size of packfile N+1, then iterate and consider all the packfiles
| [1, N+1] compared to packfile N+2 etc. This results in a set of
| packfiles where each file is roughly double the size of the
| previous when ordered by age, which has a number of beneficial
| properties for both serving and the average case maintenance run.
| Funny enough, this selection procedure struck me as similar to
| the game "2048".
| underdeserver wrote:
| 30 minute read + the Git object model = mind boggled.
|
| I'd have appreciated a series of articles instead of one, for me
| it's way too much info to take in in one sitting.
| iudqnolq wrote:
| I'm currently working through the book Building Git. Best $30
| I've spent in a while. It's about 700 pages, but 200 pages in
| and I can stage files to/from the index, make commits, and see
| the current status (although not on a repo with packfiles).
|
| I'm thinking about writing a blog post where I write a git
| commit with hexdump, zlib, and vim.
| georgyo wrote:
| It was a lot to digest, but it was also all one continuous
| thought.
|
| If it was broken up, I don't think it would have been nearly as
| good. And I don't think I would have been able to keep all the
| context to understand smaller chunks.
|
| I really enjoyed the whole thing.
| swiley wrote:
| Mono-repos are like having a flat directory structure.
|
| Sure it's simple but it makes it hard to find anything if you
| have a lot of stuff/people. submodules and package managers exist
| for a reason.
| no_wizard wrote:
| Note: for the sake of discussion I'm assuming when we say
| monorepo we mean _monorepo and associated tools used to manage
| them_
|
| The trade off is simplified management of dependencies. With a
| monorepo, I can control every version of a given dependency so
| they're uniform across packages. If I update one package it is
| always going to be linked to the other in its latest version. I
| can simplify releases and managing my infrastructure in the
| long term, though there is a trade off in initial complexity
| for certain things if you want to do something like say, only
| run tests for packages that have changed in CI (useful in some
| cases).
|
| It's all trade offs, but the quality of code has been higher
| for our org in a monorepo on average
| mr_tristan wrote:
| I've found that many developers do not pay attention to
| dependency management, so this approach of "it's either in
| the repo or it doesn't exist" is actually a nice guard rail.
|
| I'm reading between the lines here, but, I'm assuming you've
| setup your tooling to enforce this. As in: the various
| projects in the repo don't just optionally decide to have
| external references, i.e., Maven central, npm, etc.
|
| This puts quite a lot of "stuff" in the repo, but with
| improvements like this article mentioned, makes monorepos in
| git much easier to use.
|
| I'd have to think, you could generate a lot of automation and
| reports triggering out of commits pretty easily, too. I'd say
| that would make the monorepo even easier to observe with a
| modicum of the tooling required to maintain independent
| repositories.
| no_wizard wrote:
| That is accurate, I wouldn't use a monorepo without
| tooling, and in the JavaScript / TypeScript ecosystem, you
| really can't do much without tooling (though npm supports
| workspaces now, it doesn't support much else yet, like
| plugins or hooks etc).
|
| I have tried in the past, trying to achieve the same goals,
| particularly around the dependency graph and not
| duplicating functionality found in shared libraries (though
| this concern goes in hand with solving another concern I
| have, which is documentation enforcement), were just not
| really possible in a way that I could automate with a high
| degree of accuracy and confidence, without even more
| complexity, like having to use some kind of CI integration
| to pull dependency files across packages and compare them,
| in a monorepo I have a single tool that does this for _all_
| dependencies whenever any package.json file is updated or
| the lock file is updated
|
| If you care at all about your dependency graph, and in my
| not so humble opinion every developer should have some
| high-level awareness here in their given domain, I haven't
| found a better solution that is less complex than
| leveraging a monorepo
| Denvercoder9 wrote:
| _> Sure it 's simple but it makes it hard to find anything if
| you have a lot of stuff/people._
|
| I think this is a bad analogy. Looking up a file or directory
| in a monorepo isn't harder than looking up a repository. In
| fact, I'd argue it's easier, as we've developed decades of
| tooling for searching through filesystems, while for searching
| through remotely hosted repositories you're dependent on the
| search function of the repository host, which is often worse.
| cryptica wrote:
| To scale a monorepo, you need to split it up into multiple repos;
| that way each repo can be maintained independently by a separate
| team...
|
| We can call it a multi-monrepo, that way our brainwashed managers
| will agree to it.
| Orphis wrote:
| And that way, you can't have atomic updates across the
| repositories and need to synchronize them all the time, great.
| iudqnolq wrote:
| What do atomic source updates get you if you don't have
| atomic deploys? I'm just a student but my impression is that
| literally no one serious has atomic deploys, not even Google,
| because the only way to do it is scheduled downtime.
|
| If you need to handle different versions talking to each
| other in production it doesn't seem any harder to also deal
| with different versions in source, and I'd worry atomic
| updates to source would give a false sense of security in
| deployment.
| status_quo69 wrote:
| > If you need to handle different versions talking to each
| other in production it doesn't seem any harder to also deal
| with different versions in source
|
| It's much more annoying to deal with multi-repo setups and
| it can be a real productivity killer. Additionally, if you
| have a shared dependency, now you have to juggle managing
| that shared dep. For example, repo A needs shared lib
| Foo@1.2.0 and repo B needs Foo@1.3.4, because developers on
| team A didn't update their dependencies often enough to
| keep up with version bumps from the Foo team. Now there's a
| really weird situation going on in your company where not
| all teams are on the same page. A naiive monorepo forces
| that shared dep change to be applied across the board at
| once.
|
| Edit: In regards to your "old code talking to new version"
| problem, that's a culture problem IMO. At work we must
| always consider the fact that a deployment rollout takes
| time, so our changes in sensitive areas (controllers, jobs,
| etc) should be as backwards compatible as possible for that
| one deploy barring a rollback of some kind. We have linting
| rules and a very stupid bot that posts a message reminding
| us of that fact if we're trying to change something
| sensitive to version changes, but the main thing that keeps
| it all sane is we have it all collectively drilled in our
| heads from the first time that we deploy to production that
| we support N number of versions backwards. Since we're in a
| monorepo, the PR to rip out the backwards compat check is
| usually ripped out immediately after a deployment is
| verified as good. In a multi-repo setup, ripping that
| compat check out would require _another_ version bump and N
| number of PRs to make sure that everyone is on the same
| page. It really sucks.
| slver wrote:
| We have repository systems built for centralized atomic
| updates, and giant monorepos, like SVN. Question is why are
| we trying to have Git do this, which was explicitly designed
| with the exact opposite goal? Is this an attempt to do SVN in
| Git so we get to keep the benefits of the former, and the
| cool buzzword-factor of the latter? I don't know.
|
| Also when I try to think about reasons to have atomic cross-
| project changes, my mind keeps drawing negative examples,
| such as another team changing the code on your project, is
| that a good practice? Not really. Well unless all projects
| are owned by the same team, it'll happen in a monorepo.
|
| Atomic updates not scaling beyond certain technical level is
| often a good thing, because they also don't scale on human
| and organizational level.
| alexhutcheson wrote:
| 1. You determine that a library used by a sizable fraction
| of the code in your entire org has a problem that's
| critical to fix (maybe a security issue, or maybe the
| change could just save millions of dollars in compute
| resources, etc.), but the fix requires updating the use of
| that library in ~30 call sites spread across the codebases
| of ~10 different teams.
|
| 2. You create a PR that fixes the code and the problematic
| call sites in a single commit. It gets merged and you're
| done.
|
| In the multi-repo world, you need to instead:
|
| 1. Add conditional branching in your library so that it
| supports both the old behavior and new behavior. This could
| be an experiment flag, a new method DoSomethingV2, a new
| constructor arg, etc. Depending on how you do this, you
| might dramatically increase the number of call sites that
| need to be modified.
|
| 2. Either wait for all the problematic clients to update to
| the new version of your library, or create PRs to manually
| bump their version. Whoops - turns out a couple of them
| were on a very old version, and the upgrade is non-trivial.
| Now that's your problem to resolve before you proceed.
|
| 3. Create PRs to modify the calling code in every repo that
| includes problematic calls, and follow up with 10 different
| reviewers to get them merged.
|
| 4. If you still have the stamina, go through steps 1-3
| again to clean up the conditional logic you added to your
| library in step 1.
|
| Basically, if code calls libraries that exist in different
| repos, then making backwards-incompatible changes to those
| libraries becomes extremely expensive. This is bad, because
| sometimes backwards-incompatible changes would have very
| high value.
|
| If the numbers from my example were higher (e.g. 1000 call
| sites across 100 teams), then the library maintainer in a
| monorepo would probably still want to use a feature flag or
| similar to avoid trying to merge a commit that affects 1000
| files in one go. However, the library maintainer's job is
| still dramatically easier, because they don't have to deal
| with 100 individual repos, and they don't need to do
| anything to ensure that everyone is using the latest
| version of their library.
| slver wrote:
| Your monorepo scenario makes the following unlikely
| assumptions:
|
| 1. A critical security/performance fix has no other
| recourse than breaking the interface compatibility of a
| library. Far more common scenario is this can be fixed in
| the implementation without BC breaks (otherwise systems
| like semver wouldn't make sense).
|
| 2. The person maintaining the library knows the codebases
| of 10 teams better than the those 10 teams, so that
| person can patch their projects better and faster than
| the actual teams.
|
| As a library maintainer, you know the interface of your
| library. But that's merely the "how" on the other end of
| those 30 call sites. You don't know the "why". You can
| easily break their projects, despite your code compiles
| just fine. So that'd be reckless of an approach.
|
| Also your multi-repo scenario is artificially contrived.
| No, you don't need conditional branching and all this
| nonsense.
|
| In the common scenario, you just push a patch that
| maintains BC and tell the teams to update and that's it.
|
| And if you do have BC breaks, then:
|
| 1. Push a major version with the BC breaks and the fix.
|
| 2. Push a patch version deprecating that release and
| telling developers to update.
|
| That's it. You don't need all this nonsense you listed.
| hamandcheese wrote:
| I've lived both lives. It absolutely is an ordeal making
| changes across repos. The model you are highlighting
| opens up substantial risk that folks don't update in a
| timely manner. What you are describing is basically just
| throwing code over the wall and hoping for the best.
| howinteresting wrote:
| Semver is a second-rate coping mechanism for when better
| coordination systems don't exist.
| slver wrote:
| Patching the code of 10 projects you don't maintain isn't
| an example of a "coordination system". It's an example of
| avoiding having one.
|
| In multithreading this would be basically mutable shared
| state with no coordination. Every thread sees everything,
| and is free to mutate any of it at any point. Which as we
| all know is a best practice in multithreading /s
| howinteresting wrote:
| The same code can have multiple overlapping sets of
| maintainers. For example, one team can be responsible for
| business logic while another team can manage core
| abstractions shared by many product teams. Yet another
| team may be responsible for upgrading to newer toolchains
| and language features. They'll all want to touch the same
| code but make different, roughly orthogonal changes to
| it.
|
| Semver provides just a few bits of information, not
| nearly enough to cover the whole gamut of shared and
| distributed responsibility.
|
| The comparison with multithreading is not really valid,
| since monorepos typically linearize history.
| slver wrote:
| Semver was enough for me to resolve very simply a
| scenario above that was presented as some kind of
| unsurmountable nightmare. So I think Semver is just fine.
| It's an example of a simple, well designed abstraction.
| Having "more bits" is not a virtue here.
|
| I could have some comments on your "overlapping
| responsibilities" as well, but your description is too
| abstract and vague to address, so I'm pass on that. But
| you literally described the concept of library at one
| point. There's nothing overlapping about it.
| iudqnolq wrote:
| > You create a PR that fixes the code and the problematic
| call sites in a single commit. It gets merged and you're
| done.
|
| What happens when you roll this out and partway through
| the rollout an old version talks to a new version? I
| thought you still needed backwards compat? I'm a student
| and I've never worked on a project with no-downtime
| deploys, so I'm interested in how this can be possible.
| howinteresting wrote:
| Of course I want people who care about modernizing code to
| come in and modernize my code (such as upgrades to newer
| language versions). Why should the burden be distributed
| when it can be concentrated among experts?
|
| I leverage type systems and write tests to catch any
| mistakes they might make.
| swiley wrote:
| Yes you can, it happens when you bump the sub module
| reference. This is how reasonable people use git.
| Denvercoder9 wrote:
| Submodules often provide a terrible user experience
| _because_ they are locked to a single version. To propagate
| a single commit, you need to update every single dependent
| repository. In some contexts that can be helpful, but in my
| experience it 's mostly an enormous hassle.
|
| Also it's awful that a simple git pull doesn't actually
| pull updated submodules, you need to run git submodule
| update (or sync or whatever it is) as well.
|
| I don't want to work with git submodules ever again. The
| idea is nice, but the user experience is really terrible.
| fpoling wrote:
| Looking back I just do not understand why git came up
| with this awkward mess of submodules. Instead it should
| have a way to say that a particular directory is self-
| contained and any commit affecting it should be two
| objects. The first is the commit object for the directory
| using only relative paths. The second is commit for the
| rest of code with a reference to it. Then one can just
| pull any repository into the main repository without and
| use it normally.
|
| git subtree tries to emulate that, but it does not scale
| to huge repositories as it needs to change all commits in
| the subtree to use new nested paths.
| mdaniel wrote:
| And woe unto junior developers who change into the
| submodule directory and do a git commit, then made
| infinitely worse if it's followed by git push because now
| there's a sha hanging out in the repo which works on one
| machine but that no one else's submodule update will see
| without surgery
|
| I'm not at my computer to see if modern git prohibits
| that behavior, but it is indicative of the "watch out"
| that comes with advanced git usage: it is a very sharp
| knife
| dylan-m wrote:
| Or define your interfaces properly, version them, and
| publish libraries (precompiled, ideally) somewhere outside
| of your source repo. Your associated projects depend on
| those rather than random chunks of code that happen to be
| in the same file structure. It's more work, but it
| encourages better organization in general and saves an
| incredible amount of time later on for any complex project.
| throwaway894345 wrote:
| I don't like this because it assumes that all of those
| repositories are accessible all of the time to everyone
| who might want to build something. If one repo for some
| core artifact becomes unreachable, everyone is dead in
| the water.
|
| Ideally "cached on the network" could be a sort of
| optional side effect, like with Nix, but you can still
| reproducibly build from source. That said, I can't
| recommend Nix, not for philosophical reasons, but for
| lots of implementation details.
| cryptica wrote:
| If the project has good separation of concerns, you don't
| need atomic updates. Good separation of concerns yields many
| benefits beyond ease of project management. It requires a bit
| more thought, but if done correctly, it's worth many times
| the effort.
|
| Good separation of concerns is like earning compound interest
| on your code.
|
| Just keep the dependencies generic and tailor the higher
| level logic to the business domain. Then you rarely need to
| update the dependencies.
|
| I've been doing this on commercial projects (to much success)
| for decades; before most of the down-voters on here even
| wrote their first hello world programs.
| [deleted]
| WayToDoor wrote:
| The article is really impressive. It's nice to see GitHub
| contribute changes back to the git project, and to know that the
| two work closely together.
| slver wrote:
| It's in their mutual interest. Imagine what happens to GIThub
| if GIT goes out of fashion.
| infogulch wrote:
| Yes, isn't it nice when interests of multiple parties are
| aligned such that they help each other make progress towards
| their shared goals?
| slver wrote:
| Well, it's nice to see they're rational, indeed.
| jackbravo wrote:
| Other rational companies could try to fix this without
| contributing upstream. Doing it upstream benefits
| competitors like gitlab. So yeah! It's nice seeing this
| kind of behavior
| slver wrote:
| First, they not only contributed upstream, upstream
| developers contributed to this patch. I.e. they got help
| outside GitHub to make this patch possible.
|
| Second, if they had decided to fork Git, then they'd have
| to maintain this fork forever.
|
| Third, this fork could overtime become visibly or even
| worse subtly incompatible with stock Git which is still
| the Git running on GitHub users' machines, and both
| should interact with each other in 100% compatible
| manner.
|
| So, in this case, not contributing upstream was literally
| no-go. The only rational choice would be to not fork Git.
___________________________________________________________________
(page generated 2021-05-01 23:00 UTC) |