Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new RFC for monorepo #31

Merged
merged 3 commits into from Apr 19, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions rfcs/031-monorepo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
- Feature Name: monorepo
- Start Date: 2021-03-30
- RFC PR: [amundsen-io/rfcs#0030](/~https://github.com/amundsen-io/rfcs/pull/30)
- Amundsen Issue: [amundsen-io/amundsen#0000](/~https://github.com/amundsen-io/amundsen/issues/0000) (leave this empty for now)

# Move to Monorepo

## Summary

At the moment, Amnundsen project is divided into 6 repositories:
- amundsen
- amundsenfrontendlibrary
- amundsenmetadatalibrary
- amundsensearchlibrary
- amundsencommon
- amudnsendatabuilder

This RFC proposes merging all of these repositories into a single repo.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to add amundsengremlin to the list as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention for amundsengremlin to get merged into one of the proxies, or is the long-term intention to leave it as a separate package? I'm not super familiar with the code split there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a point that could have more attention in this rfc. We currently have two approaches for proxy development:

  1. 'old' proxies like neo4j, elasticsearch or atlas are kept directly in amundsenmetadata and amundsensearch respectively
  2. 'new' proxies like rds or gremlin are separate git repositories.

Maybe we should unify the approach ? afaik the idea with having proxies in different repositories made contributing easier - you could have a team that would be maintainers only for proxy repo, not core amundsen repos. @feng-tao would be more knowledgeable here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make more extensive use of codeowners files for each package, so we keep the ownership clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed some language to consider the incubating repos. Open to feedback!

My thought is the separate packages should only be used for incubating, unsupported packages. Once it it officially supported by the project, it should require CR and RFCs from approved maintainers. Otherwise, we will have officially-supported parts of the project that aren't necessarily governed by our normal processes (RFC and CR). If the outside contributor is still making meaningful contributions after the package is landed, we should consider suggesting them as a maintainer (in which case @Golodhros's suggestion of stronger codeowners would make sense -- it might make sense to have that new maintainer start as specialized to the specific area they were previously working on, and if they want to contribute outside of that, we add that on later via codeowner additions)

Copy link
Member

@feng-tao feng-tao Apr 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently neptune and RDS backend is introduced as part of RFC process. Though I think it is easier for maintain with different repos. But I wouldn't call them incubating repos. This is same as Atlas which the Lyft engineering team built the whole integration with neo4j while atlas is maintained by @bolkedebruin ING team. Both are officially supported backend in Amundsen for long time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense.

Though I think it is easier for maintain with different package

Can you elaborate on what is made easier here? I'd like to capture the benefits of those packages without the overhead of actually having separate repos, if that makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC here is to propose mono repo with different packages. For RDS, gremlin based, it seemed to be easier to maintain them separately given the people who developed those are not part of the core committer team if that makes sense. It will help to iterate those features faster without asking existing committer to signoff.

But once the repos/features are mature, we should bring them into same repo each of which to be a new package.


There will be no change to the built artifacts. Users who use PyPi packages or Docker images will have no change to their deployments. Users with forks or submodule based customization will have to make minor changes to their CI/CD pipelines to pull from the correct directories, but the required changes will be minor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smooth change management, nice !

However, with current structure, when someone is using the submodule approach can freely choose versions (even refs) for each product separately (I can use frontend in different version than metadata and search). This change somehow limits that capability. It's worth to explicitly mention it in drawbacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good point, added



## Motivation

The submodule architecture creates the following issues and is making it hard for companies to adopt Amundsen, and to customize it accordingly. It also causes substantial developer toil in the form of having to create multiple small PRs to merge one

1. Dependency Management: Python packages are pretty much the same for each proxy, which makes it hard to sync across all the above repositories. As a result most of the time, each proxy is running its own version of the dependency.
A few examples:

| Package | amundsenmetadata | amundsenfrontend |
| -------------- |:------------------:| -----------------:|
| amundsen-common | >=0.8.1 | ==0.6.0 |
| flake8 | ==3.5.0 | ==3.8.4 |
| flake8-tidy-imports | ==1.1.0 | ==4.2.1 |

and many more... This change will put all the packages in just one place, making practically to keep dependencies similar accross verisons, simply because it will be possible to open a single PR that updates multiple packages. Practically speaking this will improve the status quo slightly.

1. Development Efforts: Having multiple repositories makes it really hard to implement a feature. Implementation and testing require efforts to synchronize and then code reviews, and finally, all the PRs across multiple repositories need to land in master at a certain time or at the same time. This change results in faster development, one PR to fix a bug or a feature, no dependency hell as we are facing today, hence attract more contributors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a question - do we have any hard data (like results of questionnaires) to back the validity of those issues ? or is it more the result of interacting with users and observation ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't -- we did the call for feedback in December or January from community members but only got a handful of detailed responses. This is from our interactions with many community members (both longer-time installs) as well as people new to the project. Both groups are impacted, but in different ways.


1. Customization & Deployment: One of the frequently mentioned topics re Amundsen is the complexity of the architecture, and the efforts required to customize Amundsen. Having multiple repositories results in multiple components to install, customize, manage and fix. If you have forks, there is more work to apply your changes on a per-repo basis. Having a single repository to work from will reduce toil.

1. Code Duplication: From requirements to config files, models, helper functions, exceptions handling, CI/CD pipelines, Docker files, and other repository management like licenses, PR/Issues templates, etc., is pretty much the same in all these microservices. Change in one repository deviates that repository from others at the moment. This change will move all the above into a single repository making it easy to change anything without keeping track of multiple repositories, and redundant code.


### Contrib/incubating packages

From time-to-time, the Amundsen project creates third party packages to allow substantial new additions to be developed by non-maintainers. The monorepo per se doesn't preclude us having incubator repositories, they can be handled similarly as today.

We currently have two approaches for proxy development:

1. 'old' proxies like neo4j, elasticsearch or atlas are kept directly in amundsenmetadata and amundsensearch respectively
1. 'new' proxies like rds or gremlin are separate git repositories.

Our intention in the monorepo would be to keep incubating repos separate only until they are stable, upon which they should be merged into the main repo. Any officially-supported proxies should get CR by maintainers, not exist outside of that process.


## Transition Path

The transition will be done on a new branch on the `amundsen` repo, named `main`. We will move all code there and leave `master` as the default branch in GitHub until the migration is complete (plus a waiting period).

### Phase 1:
We create a new branch on the `amundsen` repo, named `main`. All work that follows in this description will occur in this branch.

### Phase 2:
Raw code import and CI. We import all of the code as a snapshot into the repository.

At this point, we migrate the CI/CD pipeline, and automated release mechanisms, from the consituent repositories to the main repository. They will publish packages using a `-beta` postfix.

We will merge the codeowners files at the same time here.

### Phase 3:
We freeze code in all the submodules: no new PRs in. Existing PRs will have 1 week to land. We will assist authors in getting nearly-mergable PRs landed. For PRs that don't make it, we will provide instructions to port their patches to the new repo, if they choose to re-submit.

### Phase 4:

We re-copy all code from the submodules into the main repository using [/~https://github.com/shopsys/monorepo-tools](/~https://github.com/shopsys/monorepo-tools) to preserve git commit history for each of the branches.

PRs against the main repo will now be allowed from the community. This means there was a 1 week period where no new PRs were accepted to the project. This is necessary for us to merge out the old remaining PRs, and to avoid any annoying merge conflicts once those land.

The old repositories will now be frozen to new contributions from non-maintainers (maintainers can still create critical hotfixes if absolutely necessary). There will be a deprecation notice added to the top of the README for each subrepo.

### Phase 4:
We mark the `main` branch on GitHub as the main branch, so that new pulls use the new project. We update the documentation pipeline to build from the Main branch, so that the new documentation goes onto the website.


### Phase 5:
After 1 year, we will remove all of the sub-repos, as well as the `master` branch on the Amundsen repository.

## How We Communicate This

- We'll promote this RFC frequently on Amundsen Slack and other social media channels, to get feedback from the community.
- During the community meetings, update on the progress and phases.
- We'll write a Blog/Tutorial about this change on how it's recommended to do customizations (we have been workshopping it here https://medium.com/stemma/amundsen-deployment-best-practices-740a1800518e and would like to bring it upstream once it's better tested with the monorepo).

## Drawbacks

There is a transition cost both internally, and for certain users who do customizations.

Currently, it is possible for implementors who do code-level customization to keep different versions of the various microservices. This will make that challenging or impossible within a single repo.


## Alternatives

We've discussed alternatives that also bundle together service-level


## Unresolved questions

None, yet

## Future possibilities

It may be desirable to make architectural changes after this code structure change is done, for example consolidating components or changing how shared dependencies are managed. However, we'd like to keep that out-of-scope for this RFC and keep those discussions separate (unless if it impacts this portion).

We will also have the option to use [pip constraint files](https://pip.pypa.io/en/stable/user_guide/#constraints-files), which would likely help with the dependency issues we've encountered. This could be considered deeper and implemented as a PR after the transition occurs.