23 KiB
Modules API
What is a module
A module:
- Provides a coherent public API
- Has internal details that are not intended to be directly accessed from outside the module
- Is a set of files covering the API (headers), implementations (headers and cpp files), and tests
Submodules
TODO
Why are we doing this?
Having a clear delineation between public and private APIs for each module will improve the maintainability and velocity of our codebase. Teams will have more freedom to evolve their internal implementation details without affecting consumers. Consumers will benefit from knowing what APIs are intended for their consumption.
Assigning files to modules
The file modules_poc/modules.yaml contains a list of modules, each containing
a list of files. Each file must be contained in only one module. Note that
module assignment is not required to map neatly to team ownership.
In cases where multiple globs match a file, the current rule is that the longest glob wins. This is used as a simpler-to-implement version of most-specific glob wins, which we may switch to in the future.
How do I mark API visibility?
This section will just describe the basic process. Later sections will cover the tooling available to help, along with caveats to be aware of.
First read the documentation in src/mongo/util/modules.h for the canonical list and description of visibility levels. As a brief overview of the main levels from least to most restrictive:
OPEN: This is available for usage and inheritance from anywhere in the codebasePUBLIC: This is available for usage from anywhere in the codebase. For types, subclasses may only be defined in the same module.NEEDS_REPLACEMENTandUSE_REPLACEMENT(...): These are collectively considered "unfortunately public" and are available for use, but should be avoidedPARENT_PRIVATE: This is similar toPRIVATE, but allows usage from any file in the parent module, including other submodulesPRIVATE: This may only be used from the current module or one of its submodulesFILE_PRIVATE: This may only be used from the current "file family" (roughly, header + cpp + tests). It may not be used by other files, even from the same module.
You can think of public vs private similarly to how you would the sections of a class: they
indicate whether something is intended to be part of the API or an implementation detail. The
difference is that they apply at a wider granularity of code than a single class, with
implementation details available to either the full module (and its submodules) for PRIVATE
or the file family for FILE_PRIVATE.
The macros in that header file are attached to declarations and set the visibility level for
that declaration and all of its "semantic children"1 . The macros are C++ attributes which
means that they need to go in specific places that differ based on what is being marked (for
templates, the location does not change and is always somewhere after the template <...> part):
MONGO_MOD_PUBLIC;by itself as the first line after includes in a header sets the default for that header (onlyPUBLIC,PARENT_PRIVATE, andFILE_PRIVATEare allowed here)namespace MONGO_MOD mongo {(this does not work with nested namespaces in a single declaration likenamespace mongo::repl)class MONGO_MOD Foo {(Ditto forenum,struct, andunion)MONGO_MOD void func(...);MONGO_MOD int var;concept isFooable MONGO_MOD {
For the cases where it goes at the beginning of the line, if clang-format chooses an unfortunate place to break the line, it usually helps to undo the formatting then put the macro on its own line above the declaration.
APIs are marked one header at a time, by including "mongo/util/modules.h" in the header.
This causes the header to be treated as "modularized" which has the following effects:
- All declarations in that header (not transitive includes) default to
PRIVATE, meaning that the public API is what must be marked. - Members in
private:sections in classes default toPRIVATE, regardless of the visibility of the class. The only way the language would allow them to be used from outside of the module is if you have cross-module friendships, which should generally be avoided. If needed temporarily, favorNEEDS_REPLACEMENToverPUBLICfor these declarations. - Declarations ending in
_forTestdefault toFILE_PRIVATEto support the common case where they are only intended for testing that class. If they are actually intended to support testing of consumers, not just the type they are defined on, they can be explicitly givenPUBLICorPRIVATEvisibility. - Internal and detail namespaces default to
PRIVATEand cannot be made less restricted, but can still be marked asFILE_PRIVATE. Individual declarations within the namespace can be exposed as necessary, but they cannot be exposed in bulk without changing the name of the namespace to something that doesn't imply private.
For internal headers of a module which do not contribute to its public API, simply including
modules.h is sufficient. There is a tool to automate this
process. You may additionally want to consider whether any APIs should be marked FILE_PRIVATE,
but that is optional.
For IDL files, you mark visibility of whole types (struct, enum, and command) with the
mod_visibility option. The value should be the same as one of the MONGO_MOD macros, but
lowercase and without the prefix, for example mod_visibility: public. You can set the default
visibility for all types in that IDL file by putting that in the global: section. You cannot
control visibility of individual functions within the type. Please let us know if you have a
compelling use case for this.
What tooling exists to help me?
Note that all tooling should be run from within a properly set-up python virtual environment.
This includes running buildscripts/poetry_sync.sh to ensure you have the correct dependencies.
The scanner and merger
The merger generates a cross reference of all first-party usages of first-party code and stores
it in merged_decls.json, which is used by the rest of our tooling. It is also where we validate
that there are no disallowed accesses. It will be invoked for you by the browser when you ask it
to rescan, or you can also manually run it as modules_poc/merge_decls.py. If you are interested
in analyzing that file, jq is a powerful tool, or you can just write
some python.
As a rather extreme example of what you can do with jq, here is how the progress reports are
generated:
# For each mod (and TOTAL):
# For each file:
# consider it marked if it has no UNKNOWNs
# Compute a done percentage
# Format to a nice string
jq 'map(., .mod = "TOTAL") | group_by(.mod)[] | group_by(.loc | split(":")[0]) | {mod: .[0].[0].mod, total: length, marked: map(select(any(.visibility == "UNKNOWN") | not)) | length} | .done = (1000 * .marked / .total | round) / 10 | "\(.mod): \(" " * (.mod | 40-length)) \(.done)% (\(.marked) / \(.total))"' -r merged_decls.json
Internally, the merger will internally invoke bazel build --config=mod-scanner //src/mongo/...
to run the scanner over the whole codebase (or the parts that have changed since the last scan),
taking advantage of bazel remote execution to achieve very high levels of parallelism.
The browser
The main piece of tooling to run is the browser, which is launched by running
modules_poc/browse.py. If you haven't scanned the codebase recently, it will offer to run it
for you which will take a few minutes. After modifying the source code, you can rescan at any
time by pressing r. It will only rescan files that have been modified or that transitively
include modified headers.
The browser is primarily intended to assist in labeling public APIs, so the files are sorted
with the most number of unlabeled declarations ("unknowns") first. You can search for a file
by pressing f or press m to filter the files by module.
The list of available key bindings is shown on the right. You can toggle that by pressing ?.
Other keybinding of note are that you can press g to go to the currently highlighted
declaration or location in your editor (only when running in the vscode or nvim terminal),
and p to toggle an inline preview of the location within the browser. You can press Tab ↹
to toggle between the tree and the code preview. The mouse is fully supported for scrolling
and expanding rows in the tree, and there are aliases for some basic vim keybinds (hjkl/).
The private header marker
Once you have scanned the codebase and produced a merged_decls.json,
modules_poc/private_headers.py can be used to find all header and IDL files where there are
no currently detected external usages and automatically mark them as fully private to the
module. This does not necessarily mean that all automatically marked headers are intended to
be private. A human should review to ensure that the marked headers match intent. You can pass
flags to filter on any/all of module, owning team, or path glob. For headers matching the filter,
the script will also warn of usages of _forTest external to the file family that may need to
be marked PRIVATE to make them available to the whole module since they default to only being
available to the file family for marked headers.
Make sure to run buildscripts/clang_format.py format-my or bazel run format after using it
to modify any C++ files.
Example usage:
./modules_poc/private_headers.py --team=server_programmability --module=core --glob="src/mongo/executor/*"
--dry-run can be added to view all of the changes without applying them.
The PR comment generator
You can run modules_poc/mod_diff.py to output a brief summary of all of the API (including
visibility levels and usages counts) for each file modified in your branch. When putting up a PR
to mark API visibility, you should add a comment with its output to the PR as an aide to
reviewers. The output is intended to be close enough to C++ that you should put it in a
```cpp block when making your PR comment to make it more readable. You can also
pipe it through bat -lcpp to make it colorful locally. Note that it will use the last
scan output, so if you've modified any headers, you should run a rescan prior to running this
tool.
Workflow
The general workflow for each PR will generally be the same:
- Ensure that you are in a python virtualenv, creating one if needed, and run
buildscripts/poetry_sync.shto update python deps. - Run the merger to scan the code base:
modules_poc/merge_decls.py - Mark some headers
- Rerun the merger to ensure that there are no violations, and update
merged_decls.json - Run the pr comment generator to show the APIs that you have marked
- Look through this to ensure that everything is as you expect.
- Put up a PR and include the generated comment in a
```cppblock- I suggest keeping PRs small (say, no more than 10 files at a time) so that they are manageable by reviewers. As an exception it seems reasonable to auto-mark many headers as private in a single PR, as long as those PRs are separate from those containing any manual marking.
When first starting to mark a module, I suggest running the modules_poc/private_headers.py
script with --dry-run (or -n) and --module=YOUR_MODULE. For larger modules (in particular,
the query mega module) you may want to pass a --glob so that you can focus on a smaller
subset of the code initially. That will give you an overview of the files that are used from
outside your module (which contain defacto public APIs today) and those that do not (which can
automatically be marked as private implementation details).
If all of the defacto private headers seem like they should be private, you can remove the
dry-run flag to have it automatically mark them as private. Be sure to validate that their
contents are actually intended to be private. Remember that the point of having a human doing
the marking is to ensure that we correctly capture intent. You can optionally mark implementation
details within each header as FILE_PRIVATE, if you would like to prevent them from being used
elsewhere even within the module.
You can then open the browser (modules_poc/browse.py) to look at the remaining
headers. It will show you what is used and from where. It will be particularly useful for things
that seem like they should be private, but are being used externally.
What should I do when an internal API is currently being used?
- If it is only used from a small number of external files, first check if those files should
actually belong to your module. We tried to correctly map all files in phase 1, but some files
may have been assigned to the wrong module. If that happens, try adjusting the globs in
modules_poc/modules.yamlto move them. - If there is already a public API that callers should use instead, mark it as
USE_REPLACEMENT(better_api). The argument accepts any C++ tokens, but the intent is where possible to use the name of the replacement. This will generate a ticket for all teams using that code.- If there are very few users, consider just cleaning them up.
- Reconsider making this API public if other modules need its functionality, and this is the only way to get it.
- Otherwise, if there is no public API that fulfills the needs of the callers, but you
don't want the current API to remain public long-term, use
NEEDS_REPLACEMENT. This will generate a ticket for the team that owns that code.- If the API was "obviously" intended to be private (eg it is in a
detailsnamespace) and callers would be reasonably able to implement the functionality themselves, possibly by writing their own version, it seems acceptable to useUSE_REPLACEMENT(do not use internal details)
- If the API was "obviously" intended to be private (eg it is in a
Caveats and Limitations
OVERARCHING GUIDELINE: Always try to mark declarations correctly according to intent, even if it will not be enforced by the current tooling. This is both to provide the correct information to human readers, as well as to avoid issues if we improve the tooling in the future to eliminate these limitations
The rest of this section is fairly technical and probably not necessary for most readers unless they notice something "weird" going on and want to dive into why. Most of these limitations are more likely to affect the core modules since most of the rest of our code does not expose APIs via macros and templates or have APIs only consumed by templates, and those are where most of these issues come up.
- We do not track usages of namespaces at all, only the declarations within namespaces. When a namespace is marked with a visibility, it does not affect the visibility of the namespace itself (since it doesn't have one), it sets the default visibility for all declarations within that namespace block. Each time a namespace is reopened it is a separate block and the visibility markers on other blocks of the same namespace do not apply.
- The scanner only knows about declarations that it sees being used. For implementation reasons, it only discovers declarations by seeing what every usage is using. This can either cause or be caused by other limitations.
- Usages in templates may not be seen. This is especially the case for "dependent types and
values" which are things that are not known by the compiler before the template is instantiated.
- This is a problem for functions where any arguments are dependent if it can't figure out
which overload will be selected. It is even worse for free-functions called unqualified
(
f(blah)rather thanns::f(blah)orx.f(blah)) since due to ADL, overload resolution is always delayed for them.
- This is a problem for functions where any arguments are dependent if it can't figure out
which overload will be selected. It is even worse for free-functions called unqualified
(
- Everything that results from a macro expansion is treated as-if it was written at the point
of expansion. This applies to both declarations and usages. If you have an API that should
only be used via the defined macros, mark it as
MOD_PUBLIC_FOR_TECHNICAL_REASONSto signal to readers that they should avoid direct usage, even if the tooling won't prevent it. We may improve this in the future. - Template variables are completely ignored due to some unfortunate clang bugs. Still, try to mark them correctly since we may change this in the future.
- Method calls are assigned to the static type at the call site. This has two important effects:
- A subclass's overridden method may seem unused if it is only used via calls through a base class pointer/reference
- Calls through a base class pointer/reference count as calls of that class's method, not of the interface's
- Defaulted members (methods, ctors, dtors) are treated as usages of the class itself, regardless of whether they implicitly or explicitly defaulted. This is because clang does not provide an API to distinguish between those cases.
- Template normalization woes: we try really hard to report declarations as the template
foo<T>rather than separate instantiations likefoo<int>,foo<string>, etc, unless they are explicitly specialized, meaning that the instantiation has its own definition different from the main template. Unfortunately, clang does a bad job at this and we have a number of kludgy workarounds. The most important effects:- Explicit specializations of function and variable templates are ignored and always converted to the primary template.
- We do treat explicit specializations of types as separate (using the heuristic of having a separate location than the main template), because they can have a different shape and API than the main template. In general they should probably have the same visibility though, unless the instantiation is using a private type which should be unavailable to consumers anyway.
- Clang assigns many locations to the site of explicit template instantiations and extern template declarations, even when there is a better location that it can see. Luckily these are fairly rare.
- Sometimes clang reports the resolved destination of
usingdeclarations and type alias, but usually it reports theusingdeclaration itself. A few notable cases (these are trends and may not be absolute!)using Base::foo;to expose a member of a base class is resolved as a usage ofBase::foorather thanDerived::foo. This is especially notable when theBaseclass is intended to be a private implementation detail. You will need to mark all exposed methods as public.using Base::Base;to pull in the base constructors is the opposite and is recorded as a usage ofDerived::Base(args), which is odd because such a declaration doesn't actually exist.
- Internal/details namespaces (currently defined as matching the regex
(detail|internal)s?$) implicitly have implicit default visibility of private ifmodules.his included. It is not possible to give the namespace a public visibility, but you can restrict it further withFILE_PRIVATE. If you want declarations inside it to be usable from outside your module you must mark children of the namespace explicitly, or rename it to not use a name that implies that it is for internal usage only. A somewhat common case will be marking internal declarations that are only intended to be used via macros withPUBLIC_FOR_TECHNICAL_REASONS. - Be very careful with forward declarations. Try to avoid them wherever possible (unless there
is a significant benefit). Especially avoid forward declaring anything from another module!
Where forward declarations must be used, make sure that they have the same visibility as the
real definition. As an exception, if every TU that sees the forward declaration will also see
the definition it is OK to omit marking the forward definition. This may happen when they are
both in the same header, or the forward declaration is in a private implementation detail header
which is included by the defining header. Be aware of the implicit visibility marking which also
applies to forward declaration, if they are the only declaration seen in the TU.
- Never forward declare functions to avoid including a header. They are much more problematic than types, both in general in C++ and specifically for this tooling.
- We try to use the definition location for types defined in headers, but the "canonical" location (clang's term for the first declaration seen in the current TU) for everything else. If the type is defined in a .cpp, we use the canonical location.
- We only consider declarations in headers, never in .cpp files.
- Be mindful of
_forTestfunctions. They default toFILE_PRIVATEsince they are typically intended only for use when testing the type they are defined on, not when testing consumers. In the cases where they are intended as part of the API for testing consumers, you can explicitly mark themPUBLICorPRIVATEdepending on whether they should be usable from outside your module or not. - Things used implicitly (eg implicit conversion operators) are still counted as usages even if they are not specifically named at the call site
- When merging information from multiple TUs, definitions always replace the metadata gathered
from TUs that only saw a declaration.
- Note that we aren't guaranteed to see every definition, in particular for functions that are not called from the TU that they are defined in. So this cannot be used to find places where we deleted the definition but forgot to delete the declaration (we wouldn't see them anyway, since we only track things that are used, and undefined things can't really be used, except trivially, without breaking the build).
privatemembers of classes are implicitlyPRIVATE, and must be explicitly marked otherwise if desired. They should probably never be madePUBLICsince that implies cross-module friendship. In the few places where we have that today, they have been made one of the flavors of unfortunately public:NEEDS_REPLACEMENTorUSE_INSTEAD.publicmembers ofprivatetypes do not inherit the implicitPRIVATEand follow the normal rule of looking for their nearest semantic parent with an explicit marker. That means that they may bePUBLIC. However, the language rules still apply and as long as an instance of the type is never handed to consumers they will have no way of accessing those members.protectedmembers do not default toPRIVATE, but because we only allow subclassing fromOPENclasses, the language visibility rules will disallow access from outside the module unless you choose to allow it by useOPENclasses orfriends. Note that making any subclassOPENexposes allprotectedmembers of parents unless they are markedPRIVATE.
frienddeclarations are mostly ignored, except when they are a definition. So the definitions using the "hidden friend" pattern are tracked, but we ignore it if the definition is in a cpp file.
-
Clang distinguishes between "semantic" and "lexical" parents. The primary differences are that members of classes (including member types) are semantic children of the class even when defined out of line, and conversely
frienddeclarations are not, and instead are considered semantic children of the nearest namespace. ↩︎