Architectural Decision Log¶

Mapping to ISO 27001 Controls¶

A.14.1.1 "Information security requirements analysis and specification"
A.14.2.4 "Restrictions on Changes to Software Packages"

What are architectural decisions?¶

Architectural decisions are high-level technical decisions that affect most stakeholders, in particular Welkin developers, administrators and users. A non-exhaustive list of architectural decisions is as follows:

adding or removing tools;
adding or removing components;
changing what component talks to what other component;
major (in the SemVer sense) component upgrades.

Architectural decisions should be taken as directions to follow for future development and not issues to be fixed immediately.

What triggers an architectural decision?¶

An architectural decision generally starts with one of the following:

A new features was requested by product management.
An improvement was requested by engineering management.
A new risk was discovered, usually by the architect, but also by any stakeholder.
A new technology was discovered, that may help with a new feature, an improvement or to mitigate a risk.

How are architectural decisions captured?¶

Architectural decisions are captured via Architectural Decision Records or the tech radar. Both are stored in Git, hence a decision log is also captured as part of the Git commit messages.

How are architectural decisions taken?¶

Architectural decisions need to mitigate the following information security risks:

a component might not fulfill advertised expectations;
a component might be abandoned;
a component might change direction and deviate from expectations;
a component might require a lot of (initial or ongoing) training;
a component might not take security seriously;
a component might change its license, prohibiting its reuse or making its use expensive.

The Welkin architect is overall responsible for this risk.

How are these risks mitigated?¶

Before taking in any new component to Welkin, we investigate and evaluate them. We prefer components that are:

community-driven open-source projects, to reduce the risk of a component becoming abandoned, changing its license or changing direction in the interest of a single entity; as far as possible, we choose CNCF projects (preferably graduated ones) or projects which are governed by at least 3 different entities;
projects with a good security track record, to avoid unexpected security vulnerabilities or delays in fixing security vulnerabilities; as far as possible, we choose projects with a clear security disclosure process and a clear security announcement process;
projects that are popular, both from a usage and contribution perspective; as far as possible, we choose projects featuring well-known users and many Maintainers;
projects that rely on technologies that our team is already trained on, to reduce the risk of requiring a lot of (initial or ongoing) training; as far as possible, we choose projects that overlap with the projects already on our tech radar;
projects that are simple to install and manage, to reduce required training and burden on administrators.

Often, it is not possible to fulfill the above criteria. In that case, we take the following mitigations:

Architectural Decision Records include recommendations on training to be taken by administrators.
Closed-source or "as-a-Service" alternatives are used, if they are easy to replace thanks to broad API compatibility or standardization.

These mitigations may be relaxed for components that are part of alpha or beta features, as these features -- and required components -- can be removed at our discretion.

ADRs¶

This log lists the architectural decisions for Welkin.

ADR-0000 - Use Markdown Architectural Decision Records
ADR-0001 - Use Rook for Storage Orchestrator
ADR-0002 - Use Kubespray for Cluster Life-cycle
ADR-0003 - [Superseded by ADR-0019] Push Metrics via InfluxDB
ADR-0004 - Plan for Usage without Wrapper Scripts
ADR-0005 - Use Individual SSH Keys
ADR-0006 - Use Standard Kubeconfig Mechanisms
ADR-0007 - Make Monitoring Forwarders Storage Independent
ADR-0008 - Use HostNetwork or LoadBalancer for Ingress
ADR-0009 - Use ClusterIssuers for Let's Encrypt
ADR-0010 - Run managed services in Workload Cluster
ADR-0011 - [Superseded by ADR-0046] Let upstream projects handle CRDs
ADR-0012 - [Superseded by ADR-0017] Do not persist Dex
ADR-0013 - Configure Alerts in On-call Management Tool (e.g., Opsgenie)
ADR-0014 - Use bats for testing bash wrappers
ADR-0015 - We believe in community-driven open source
ADR-0016 - [Superseded by ADR-0040] gid=0 is okay, but not by default
ADR-0017 - Persist Dex
ADR-0018 - Use Probe to Measure Uptime of Internal Welkin Services
ADR-0019 - Push Metrics via Thanos
ADR-0020 - Filter by Cluster label then data source
ADR-0021 - Default to TLS for performance-insensitive additional services
ADR-0022 - Use Dedicated Nodes for Additional Services
ADR-0023 - [Superseded by ADR-0056] Only allow Ingress Configuration Snippet Annotations after Proper Risk Acceptance
ADR-0024 - Allow a Harbor robot account that can create other robot accounts with full privileges
ADR-0025 - Use local-volume-provisioner for Managed Services that requires high-speed disks
ADR-0026 - Use environment-name as the default root of Hierarchical Namespace Controller (HNC)
ADR-0027 - PostgreSQL - Enable external replication
ADR-0028 - Harder Pod eviction when Nodes are going OOM
ADR-0029 - Expose Jaeger UI in WC
ADR-0030 - Run ArgoCD on the Elastisys Nodes
ADR-0031 - Run csi-cinder-controllerplugin on the Elastisys Nodes
ADR-0032 - Boot disk size on Nodes
ADR-0033 - Run Cluster API controllers on Management Cluster
ADR-0034 - How to run multiple AMS packages of the same type in the same environment
ADR-0035 - Run Tekton on Management Cluster
ADR-0036 - Run Ingress-NGINX as a DaemonSet
ADR-0037 - Enforce TTL on Jobs
ADR-0038 - Replace the starboard-operator with the trivy-operator
ADR-0039 - Application developer privilege elevation
ADR-0040 - Allow running containers with primary and supplementary group id 0
ADR-0041 - Rely on Infrastructure Provider for encryption-at-rest
ADR-0042 - ArgoCD with dynamic HNC namespaces
ADR-0043 - Rclone and Encryption adheres Cryptography Policy
ADR-0044 - ArgoCD is not allowed to manage its own namespace
ADR-0045 - Use specialised prebuilt images
ADR-0046 - Handle all CRDs with the standard Helm CRD management
ADR-0047 - When to upgrade to new Kubernetes versions
ADR-0048 - Access Management for Additional Managed Services (AMS-es)
ADR-0049 - Running NGINX with Chroot Option
ADR-0050 - Use Cluster Isolation to separate the application and its traces from its logs and metrics
ADR-0051 - Open cert-manager Network Policies
ADR-0052 - Azure Encryption-at-Rest for Object Storage and Block Storage
ADR-0053 - Do not expose platform observability services to end-users
ADR-0054 - Allow Application Developer write access to Endpoints and EndpointSlices after Proper Risk Acceptance
ADR-0055 - [Superseded by ADR-0059]Welkin to consist of both public and private open source
ADR-0056 - Only allow Ingress Snippet Annotations after Proper Risk Acceptance
ADR-0057 - Do Not Use Managed Kubernetes Services
ADR-0058 - Boot disk size on Nodes
ADR-0059 - Welkin to Consist of Public Open Source Code and Proprietary Documentation

For new ADRs, please use template.md as basis. More information on MADR is available at https://adr.github.io/madr/. General information about architectural decision records is available at https://adr.github.io/.

Index Regeneration¶

Pre-requisites:

Install npm
Install adr-log
Install make

Run make -C docs/adr, then run pre-commit run --all-files.