Architectural Decision Log¶
Mapping to ISO 27001 Controls¶
- A.14.1.1 "Information security requirements analysis and specification"
- A.14.2.4 "Restrictions on Changes to Software Packages"
What are architectural decisions?¶
Architectural decisions are high-level technical decisions that affect most stakeholders, in particular Welkin developers, administrators and users. A non-exhaustive list of architectural decisions is as follows:
- adding or removing tools;
- adding or removing components;
- changing what component talks to what other component;
- major (in the SemVer sense) component upgrades.
Architectural decisions should be taken as directions to follow for future development and not issues to be fixed immediately.
What triggers an architectural decision?¶
An architectural decision generally starts with one of the following:
- A new features was requested by product management.
- An improvement was requested by engineering management.
- A new risk was discovered, usually by the architect, but also by any stakeholder.
- A new technology was discovered, that may help with a new feature, an improvement or to mitigate a risk.
How are architectural decisions captured?¶
Architectural decisions are captured via Architectural Decision Records or the tech radar. Both are stored in Git, hence a decision log is also captured as part of the Git commit messages.
How are architectural decisions taken?¶
Architectural decisions need to mitigate the following information security risks:
- a component might not fulfill advertised expectations;
- a component might be abandoned;
- a component might change direction and deviate from expectations;
- a component might require a lot of (initial or ongoing) training;
- a component might not take security seriously;
- a component might change its license, prohibiting its reuse or making its use expensive.
The Welkin architect is overall responsible for this risk.
How are these risks mitigated?¶
Before taking in any new component to Welkin, we investigate and evaluate them. We prefer components that are:
- community-driven open-source projects, to reduce the risk of a component becoming abandoned, changing its license or changing direction in the interest of a single entity; as far as possible, we choose CNCF projects (preferably graduated ones) or projects which are governed by at least 3 different entities;
- projects with a good security track record, to avoid unexpected security vulnerabilities or delays in fixing security vulnerabilities; as far as possible, we choose projects with a clear security disclosure process and a clear security announcement process;
- projects that are popular, both from a usage and contribution perspective; as far as possible, we choose projects featuring well-known users and many Maintainers;
- projects that rely on technologies that our team is already trained on, to reduce the risk of requiring a lot of (initial or ongoing) training; as far as possible, we choose projects that overlap with the projects already on our tech radar;
- projects that are simple to install and manage, to reduce required training and burden on administrators.
Often, it is not possible to fulfill the above criteria. In that case, we take the following mitigations:
- Architectural Decision Records include recommendations on training to be taken by administrators.
- Closed-source or "as-a-Service" alternatives are used, if they are easy to replace thanks to broad API compatibility or standardization.
These mitigations may be relaxed for components that are part of alpha or beta features, as these features -- and required components -- can be removed at our discretion.
ADRs¶
This log lists the architectural decisions for Welkin.
- ADR-0000 - Use Markdown Architectural Decision Records
- ADR-0001 - Use Rook for Storage Orchestrator
- ADR-0002 - Use Kubespray for Cluster Life-cycle
- ADR-0003 - [Superseded by ADR-0019] Push Metrics via InfluxDB
- ADR-0004 - Plan for Usage without Wrapper Scripts
- ADR-0005 - Use Individual SSH Keys
- ADR-0006 - Use Standard Kubeconfig Mechanisms
- ADR-0007 - Make Monitoring Forwarders Storage Independent
- ADR-0008 - Use HostNetwork or LoadBalancer for Ingress
- ADR-0009 - Use ClusterIssuers for Let's Encrypt
- ADR-0010 - Run managed services in Workload Cluster
- ADR-0011 - [Superseded by ADR-0046] Let upstream projects handle CRDs
- ADR-0012 - [Superseded by ADR-0017] Do not persist Dex
- ADR-0013 - Configure Alerts in On-call Management Tool (e.g., Opsgenie)
- ADR-0014 - Use bats for testing bash wrappers
- ADR-0015 - We believe in community-driven open source
- ADR-0016 - [Superseded by ADR-0040] gid=0 is okay, but not by default
- ADR-0017 - Persist Dex
- ADR-0018 - Use Probe to Measure Uptime of Internal Welkin Services
- ADR-0019 - Push Metrics via Thanos
- ADR-0020 - Filter by cluster label then data source
- ADR-0021 - Default to TLS for performance-insensitive additional services
- ADR-0022 - Use Dedicated Nodes for Additional Services
- ADR-0023 - [Superseded by ADR-0056] Only allow Ingress Configuration Snippet Annotations after Proper Risk Acceptance
- ADR-0024 - Allow a Harbor robot account that can create other robot accounts with full privileges
- ADR-0025 - Use local-volume-provisioner for Managed Services that requires high-speed disks
- ADR-0026 - Use
environment-name
as the default root of Hierarchical Namespace Controller (HNC) - ADR-0027 - PostgreSQL - Enable external replication
- ADR-0028 - Harder Pod eviction when nodes are going OOM
- ADR-0029 - Expose Jaeger UI in WC
- ADR-0030 - Run ArgoCD on the Elastisys nodes
- ADR-0031 - Run csi-cinder-controllerplugin on the Elastisys nodes
- ADR-0032 - Boot disk size on nodes
- ADR-0033 - Run Cluster API controllers on Management Cluster
- ADR-0034 - How to run multiple AMS packages of the same type in the same environment
- ADR-0035 - Run Tekton on Management Cluster
- ADR-0036 - Run Ingress-NGINX as a DaemonSet
- ADR-0037 - Enforce TTL on Jobs
- ADR-0038 - Replace the starboard-operator with the trivy-operator
- ADR-0039 - Application developer privilege elevation
- ADR-0040 - Allow running containers with primary and supplementary group id 0
- ADR-0041 - Rely on Infrastructure Provider for encryption-at-rest
- ADR-0042 - ArgoCD with dynamic HNC namespaces
- ADR-0043 - Rclone and Encryption adheres Cryptography Policy
- ADR-0044 - ArgoCD is not allowed to manage its own namespace
- ADR-0045 - Use specialised prebuilt images
- ADR-0046 - Handle all CRDs with the standard Helm CRD management
- ADR-0047 - When to upgrade to new Kubernetes versions
- ADR-0048 - Access Management for Additional Managed Services (AMS-es)
- ADR-0049 - Running NGINX with Chroot Option
- ADR-0050 - Use Cluster Isolation to separate the application and its traces from its logs and metrics
- ADR-0051 - Open cert-manager Network Policies
- ADR-0052 - Azure Encryption-at-Rest for Object Storage and Block Storage
- ADR-0053 - Do not expose platform observability services to end-users
- ADR-0054 - Allow Application Developer write access to Endpoints and EndpointSlices after Proper Risk Acceptance
- ADR-0055 - Welkin to consist of both public and private open source
- ADR-0056 - Only allow Ingress Snippet Annotations after Proper Risk Acceptance
For new ADRs, please use template.md as basis. More information on MADR is available at https://adr.github.io/madr/. General information about architectural decision records is available at https://adr.github.io/.
Index Regeneration¶
Pre-requisites:
Run make -C docs/adr
, then run pre-commit run --all-files
.