Files
mongo/docs/modularity.md
Alex Li 50688f5c36 SERVER-112240 Add modularity docs to the docs/ folder (#47479)
GitOrigin-RevId: 98218a81412f15e4a3aa82874a69ae0e131f2305
2026-02-05 17:14:57 +00:00

366 lines
23 KiB
Markdown

# Modules API
## What is a module
A module:
- Provides a coherent public API
- Has internal details that are not intended to be directly accessed from outside the module
- Is a set of files covering the API (headers), implementations (headers and cpp files), and tests
### Submodules
TODO
## Why are we doing this?
Having a clear delineation between public and private APIs for each module will improve the
maintainability and velocity of our codebase. Teams will have more freedom to evolve their
internal implementation details without affecting consumers. Consumers will benefit from
knowing what APIs are intended for their consumption.
## Assigning files to modules
The file `modules_poc/modules.yaml` contains a list of modules, each containing
a list of files. Each file must be contained in only one module. Note that
module assignment is not required to map neatly to team ownership.
In cases where multiple globs match a file, the current rule is that the
longest glob wins. This is used as a simpler-to-implement version of
most-specific glob wins, which we may switch to in the future.
## How do I mark API visibility?
This section will just describe the basic process. Later sections will cover the tooling
available to help, along with caveats to be aware of.
First read the documentation in [src/mongo/util/modules.h](https://github.com/mongodb/mongo/blob/master/src/mongo/util/modules.h)
for the canonical list and description of visibility levels. As a brief overview of the main
levels from least to most restrictive:
- `OPEN`: This is available for usage _and inheritance_ from anywhere in the codebase
- `PUBLIC`: This is available for usage from anywhere in the codebase. For types, subclasses may
only be defined in the same module.
- `NEEDS_REPLACEMENT` and `USE_REPLACEMENT(...)`: These are collectively considered
"unfortunately public" and are available for use, but should be avoided
- `PARENT_PRIVATE`: This is similar to `PRIVATE`, but allows usage from any file in the parent
module, including other submodules
- `PRIVATE`: This may only be used from the current module or one of its submodules
- `FILE_PRIVATE`: This may only be used from the current "file family" (roughly, header \+ cpp
\+ tests). It may not be used by other files, even from the same module.
You can think of public vs private similarly to how you would the sections of a `class`: they
indicate whether something is intended to be part of the API or an implementation detail. The
difference is that they apply at a wider granularity of code than a single class, with
implementation details available to either the full module (and its submodules) for `PRIVATE`
or the file family for `FILE_PRIVATE`.
The macros in that header file are attached to declarations and set the visibility level for
that declaration and all of its "semantic children"[^1]. The macros are C++ attributes which
means that they need to go in specific places that differ based on what is being marked (for
templates, the location does not change and is always somewhere after the `template <...>` part):
- `MONGO_MOD_PUBLIC;` by itself as the first line after includes in a header sets the default
for that header (only `PUBLIC`, `PARENT_PRIVATE`, and `FILE_PRIVATE` are allowed here)
- `namespace MONGO_MOD mongo {` (this does not work with nested namespaces in a single
declaration like `namespace mongo::repl`)
- `class MONGO_MOD Foo {` (Ditto for `enum`, `struct`, and `union`)
- `MONGO_MOD void func(...);`
- `MONGO_MOD int var;`
- `concept isFooable MONGO_MOD {`
For the cases where it goes at the beginning of the line, if clang-format chooses an unfortunate
place to break the line, it usually helps to undo the formatting then put the macro on its own
line above the declaration.
APIs are marked one header at a time, by including `"mongo/util/modules.h"` in the header.
This causes the header to be treated as "modularized" which has the following effects:
- All declarations in that header (not transitive includes) default to `PRIVATE`, meaning that
the public API is what must be marked.
- Members in `private:` sections in classes default to `PRIVATE`, regardless of the visibility
of the class. The only way the language would allow them to be used from outside of the module
is if you have cross-module friendships, which should generally be avoided. If needed
temporarily, favor `NEEDS_REPLACEMENT` over `PUBLIC` for these declarations.
- Declarations ending in `_forTest` default to `FILE_PRIVATE` to support the common case where
they are only intended for testing that class. If they are actually intended to support testing
of consumers, not just the type they are defined on, they can be explicitly given `PUBLIC` or
`PRIVATE` visibility.
- Internal and detail namespaces default to `PRIVATE` and cannot be made less restricted, but
can still be marked as `FILE_PRIVATE`. Individual declarations within the namespace can be
exposed as necessary, but they cannot be exposed in bulk without changing the name of the
namespace to something that doesn't imply private.
For internal headers of a module which do not contribute to its public API, simply including
`modules.h` is sufficient. There is a [tool](#the-private-header-marker) to automate this
process. You may additionally want to consider whether any APIs should be marked `FILE_PRIVATE`,
but that is optional.
For IDL files, you mark visibility of whole types (`struct`, `enum`, and `command`) with the
`mod_visibility` option. The value should be the same as one of the `MONGO_MOD` macros, but
lowercase and without the prefix, for example `mod_visibility: public`. You can set the default
visibility for all types in that IDL file by putting that in the `global:` section. You cannot
control visibility of individual functions within the type. Please let us know if you have a
compelling use case for this.
## What tooling exists to help me?
Note that all tooling should be run from within a properly set-up python virtual environment.
This includes running `buildscripts/poetry_sync.sh` to ensure you have the correct dependencies.
### The scanner and merger
The merger generates a cross reference of all first-party usages of first-party code and stores
it in `merged_decls.json`, which is used by the rest of our tooling. It is also where we validate
that there are no disallowed accesses. It will be invoked for you by the browser when you ask it
to rescan, or you can also manually run it as `modules_poc/merge_decls.py`. If you are interested
in analyzing that file, [`jq`](https://jqlang.org/) is a powerful tool, or you can just write
some python.
As a rather extreme example of what you can do with `jq`, here is how the progress reports are
generated:
```shell
# For each mod (and TOTAL):
# For each file:
# consider it marked if it has no UNKNOWNs
# Compute a done percentage
# Format to a nice string
jq 'map(., .mod = "TOTAL") | group_by(.mod)[] | group_by(.loc | split(":")[0]) | {mod: .[0].[0].mod, total: length, marked: map(select(any(.visibility == "UNKNOWN") | not)) | length} | .done = (1000 * .marked / .total | round) / 10 | "\(.mod): \(" " * (.mod | 40-length)) \(.done)% (\(.marked) / \(.total))"' -r merged_decls.json
```
Internally, the merger will internally invoke `bazel build --config=mod-scanner //src/mongo/...`
to run the scanner over the whole codebase (or the parts that have changed since the last scan),
taking advantage of bazel remote execution to achieve very high levels of parallelism.
### The browser
The main piece of tooling to run is the browser, which is launched by running
`modules_poc/browse.py`. If you haven't scanned the codebase recently, it will offer to run it
for you which will take a few minutes. After modifying the source code, you can rescan at any
time by pressing `r`. It will only rescan files that have been modified or that transitively
include modified headers.
The browser is primarily intended to assist in labeling public APIs, so the files are sorted
with the most number of unlabeled declarations ("unknowns") first. You can search for a file
by pressing `f` or press `m` to filter the files by module.
The list of available key bindings is shown on the right. You can toggle that by pressing `?`.
Other keybinding of note are that you can press `g` to go to the currently highlighted
declaration or location in your editor (only when running in the vscode or nvim terminal),
and `p` to toggle an inline preview of the location within the browser. You can press `Tab ↹`
to toggle between the tree and the code preview. The mouse is fully supported for scrolling
and expanding rows in the tree, and there are aliases for some basic vim keybinds (`hjkl/`).
### The private header marker
Once you have scanned the codebase and produced a `merged_decls.json`,
`modules_poc/private_headers.py` can be used to find all header and IDL files where there are
no currently detected external usages and automatically mark them as fully private to the
module. This does not necessarily mean that all automatically marked headers are intended to
be private. A human should review to ensure that the marked headers match intent. You can pass
flags to filter on any/all of module, owning team, or path glob. For headers matching the filter,
the script will also warn of usages of `_forTest` external to the file family that may need to
be marked `PRIVATE` to make them available to the whole module since they default to only being
available to the file family for marked headers.
Make sure to run `buildscripts/clang_format.py format-my` or `bazel run format` after using it
to modify any C++ files.
Example usage:
```shell
./modules_poc/private_headers.py --team=server_programmability --module=core --glob="src/mongo/executor/*"
```
`--dry-run` can be added to view all of the changes without applying them.
### The PR comment generator
You can run `modules_poc/mod_diff.py` to output a brief summary of all of the API (including
visibility levels and usages counts) for each file modified in your branch. When putting up a PR
to mark API visibility, you should add a comment with its output to the PR as an aide to
reviewers. The output is intended to be close enough to C++ that you should put it in a
` ```cpp ` block when making your PR comment to make it more readable. You can also
pipe it through `bat -lcpp` to make it colorful locally. Note that it will use the last
scan output, so if you've modified any headers, you should run a rescan prior to running this
tool.
## Workflow
The general workflow for each PR will generally be the same:
1. Ensure that you are in a python virtualenv, creating one if needed, and run
`buildscripts/poetry_sync.sh` to update python deps.
2. Run [the merger](#the-scanner-and-merger) to scan the code base: `modules_poc/merge_decls.py`
3. Mark some headers
4. Rerun the merger to ensure that there are no violations, and update `merged_decls.json`
5. Run [the pr comment generator](#the-pr-comment-generator) to show the APIs that you have marked
- Look through this to ensure that everything is as you expect.
6. Put up a PR and include the generated comment in a ` ```cpp ` block
- I suggest keeping PRs small (say, no more than 10 files at a time) so that they are
manageable by reviewers. As an exception it seems reasonable to auto-mark many headers as
private in a single PR, as long as those PRs are separate from those containing any manual
marking.
When first starting to mark a module, I suggest running the [`modules_poc/private_headers.py`](#the-private-header-marker)
script with `--dry-run` (or `-n`) and `--module=YOUR_MODULE`. For larger modules (in particular,
the `query` mega module) you may want to pass a `--glob` so that you can focus on a smaller
subset of the code initially. That will give you an overview of the files that are used from
outside your module (which contain defacto public APIs today) and those that do not (which can
automatically be marked as private implementation details).
If all of the defacto private headers seem like they should be private, you can remove the
dry-run flag to have it automatically mark them as private. Be sure to validate that their
contents are actually intended to be private. Remember that the point of having a human doing
the marking is to ensure that we correctly capture intent. You can optionally mark implementation
details within each header as `FILE_PRIVATE`, if you would like to prevent them from being used
elsewhere even within the module.
You can then open [the browser](#the-browser) (`modules_poc/browse.py`) to look at the remaining
headers. It will show you what is used and from where. It will be particularly useful for things
that seem like they should be private, but are being used externally.
### What should I do when an internal API is currently being used?
1. If it is only used from a small number of external files, first check if those files should
actually belong to your module. We tried to correctly map all files in phase 1, but some files
may have been assigned to the wrong module. If that happens, try adjusting the globs in
`modules_poc/modules.yaml` to move them.
2. If there is already a public API that callers should use instead, mark it as
`USE_REPLACEMENT(better_api)`. The argument accepts any C++ tokens, but the intent is where
possible to use the name of the replacement. This will generate a ticket for all teams using
that code.
1. If there are very few users, consider just cleaning them up.
3. Reconsider making this API public if other modules need its functionality, and this is
the only way to get it.
4. Otherwise, if there is no public API that fulfills the needs of the callers, but you
don't want the current API to remain public long-term, use `NEEDS_REPLACEMENT`. This will
generate a ticket for the team that owns that code.
1. If the API was "obviously" intended to be private (eg it is in a `details` namespace)
and callers would be reasonably able to implement the functionality themselves, possibly
by writing their own version, it seems acceptable to use
`USE_REPLACEMENT(do not use internal details)`
## Caveats and Limitations
**OVERARCHING GUIDELINE**: Always try to mark declarations correctly according to intent,
even if it will not be enforced by the current tooling. This is both to provide the correct
information to human readers, as well as to avoid issues if we improve the tooling in the
future to eliminate these limitations
The rest of this section is fairly technical and probably not necessary for most readers unless
they notice something "weird" going on and want to dive into why. Most of these limitations are
more likely to affect the core modules since most of the rest of our code does not expose APIs
via macros and templates or have APIs only consumed by templates, and those are where most of
these issues come up.
- We do not track usages of namespaces at all, only the declarations within namespaces. When
a namespace is marked with a visibility, it does not affect the visibility of the namespace
itself (since it doesn't have one), it sets the default visibility for all declarations within
**that namespace block**. Each time a namespace is reopened it is a separate block and the
visibility markers on other blocks of the same namespace do not apply.
- The scanner only knows about declarations that it sees being used. For implementation reasons,
it only discovers declarations by seeing what every usage is using. This can either cause or be
caused by other limitations.
- Usages in templates may not be seen. This is especially the case for "dependent types and
values" which are things that are not known by the compiler before the template is instantiated.
- This is a problem for functions where any arguments are dependent if it can't figure out
which overload will be selected. It is even worse for free-functions called unqualified
(`f(blah)` rather than `ns::f(blah)` or `x.f(blah)`) since due to ADL, overload resolution
is _always_ delayed for them.
- Everything that results from a macro expansion is treated as-if it was written at the point
of expansion. This applies to both declarations and usages. If you have an API that should
only be used via the defined macros, mark it as `MOD_PUBLIC_FOR_TECHNICAL_REASONS` to signal
to readers that they should avoid direct usage, even if the tooling won't prevent it. We may
improve this in the future.
- Template variables are completely ignored due to some unfortunate clang bugs. Still, try
to mark them correctly since we may change this in the future.
- Method calls are assigned to the static type at the call site. This has two important effects:
- A subclass's overridden method may seem unused if it is only used via calls through a base
class pointer/reference
- Calls through a base class pointer/reference count as calls of that class's method, not of
the interface's
- Defaulted members (methods, ctors, dtors) are treated as usages of the class itself,
regardless of whether they implicitly or explicitly defaulted. This is because clang does not
provide an API to distinguish between those cases.
- Template normalization woes: we try really hard to report declarations as the template
`foo<T>` rather than separate instantiations like `foo<int>`, `foo<string>`, etc, **unless**
they are explicitly specialized, meaning that the instantiation has its own definition different
from the main template. Unfortunately, clang does a bad job at this and we have a number of
kludgy workarounds. The most important effects:
- Explicit specializations of function and variable templates are ignored and always converted
to the primary template.
- We do treat explicit specializations of types as separate (using the heuristic of having a
separate location than the main template), because they can have a different shape and API than
the main template. In general they should probably have the same visibility though, unless the
instantiation is using a private type which should be unavailable to consumers anyway.
- Clang assigns many locations to the site of explicit template instantiations and extern
template declarations, even when there is a better location that it can see. Luckily these
are fairly rare.
- Sometimes clang reports the resolved destination of `using` declarations and type alias, but
usually it reports the `using` declaration itself. A few notable cases (these are trends and
may not be absolute\!)
- `using Base::foo;` to expose a member of a base class is resolved as a usage of `Base::foo`
rather than `Derived::foo`. This is especially notable when the `Base` class is intended to be
a private implementation detail. You will need to mark all exposed methods as public.
- `using Base::Base;` to pull in the base constructors is the opposite and is recorded as a
usage of `Derived::Base(args)`, which is odd because such a declaration doesn't actually exist.
- Internal/details namespaces (currently defined as matching the regex `(detail|internal)s?$`)
implicitly have implicit default visibility of private if `modules.h` is included. It is not
possible to give the namespace a public visibility, but you can restrict it further with
`FILE_PRIVATE`. If you want declarations inside it to be usable from outside your module you
must mark children of the namespace explicitly, or rename it to not use a name that implies
that it is for internal usage only. A somewhat common case will be marking internal declarations
that are only intended to be used via macros with `PUBLIC_FOR_TECHNICAL_REASONS`.
- Be very careful with forward declarations. Try to avoid them wherever possible (unless there
is a significant benefit). Especially avoid forward declaring anything from another module\!
Where forward declarations must be used, make sure that they have the same visibility as the
real definition. As an exception, if every TU that sees the forward declaration will also see
the definition it is OK to omit marking the forward definition. This may happen when they are
both in the same header, or the forward declaration is in a private implementation detail header
which is included by the defining header. Be aware of the implicit visibility marking which also
applies to forward declaration, if they are the only declaration seen in the TU.
- Never forward declare functions to avoid including a header. They are much more problematic
than types, both in general in C++ and specifically for this tooling.
- We try to use the definition location for types defined in headers, but the "canonical"
location (clang's term for the first declaration seen in the current TU) for everything else.
If the type is defined in a .cpp, we use the canonical location.
- We only consider declarations in headers, never in .cpp files.
- Be mindful of `_forTest` functions. They default to `FILE_PRIVATE` since they are typically
intended only for use when testing the type they are defined on, not when testing consumers.
In the cases where they _are_ intended as part of the API for testing consumers, you can
explicitly mark them `PUBLIC` or `PRIVATE` depending on whether they should be usable from
outside your module or not.
- Things used implicitly (eg implicit conversion operators) are still counted as usages even
if they are not specifically named at the call site
- When merging information from multiple TUs, definitions always replace the metadata gathered
from TUs that only saw a declaration.
- Note that we aren't guaranteed to see every definition, in particular for functions that
are not called from the TU that they are defined in. So this cannot be used to find places
where we deleted the definition but forgot to delete the declaration (we wouldn't see them
anyway, since we only track things that are used, and undefined things can't really be used,
except trivially, without breaking the build).
- `private` members of classes are implicitly `PRIVATE`, and must be explicitly marked otherwise
if desired. They should probably never be made `PUBLIC` since that implies cross-module
friendship. In the few places where we have that today, they have been made one of the flavors
of unfortunately public: `NEEDS_REPLACEMENT` or `USE_INSTEAD`.
- `public` members of `private` types do not inherit the implicit `PRIVATE` and follow the
normal rule of looking for their nearest semantic parent with an explicit marker. That means
that they may be `PUBLIC`. However, the language rules still apply and as long as an
instance of the type is never handed to consumers they will have no way of accessing those
members.
- `protected` members do not default to `PRIVATE`, but because we only allow subclassing from
`OPEN` classes, the language visibility rules will disallow access from outside the module
unless you choose to allow it by use `OPEN` classes or `friend`s. Note that making any
subclass `OPEN` exposes all `protected` members of parents unless they are marked `PRIVATE`.
- `friend` declarations are mostly ignored, except when they are a definition. So the
definitions using the "hidden friend" pattern are tracked, but we ignore it if the definition
is in a cpp file.
[^1]:
Clang distinguishes between "semantic" and "lexical" parents. The primary differences
are that members of classes (including member types) are semantic children of the class even
when defined out of line, and conversely `friend` declarations are not, and instead are
considered semantic children of the nearest namespace.