Skip to content

Architectural Decision Log

Mapping to ISO 27001 Controls

  • A.14.1.1 "Information security requirements analysis and specification"
  • A.14.2.4 "Restrictions on Changes to Software Packages"

What are architectural decisions?

Architectural decisions are high-level technical decisions that affect most stakeholders, in particular Welkin developers, administrators and users. A non-exhaustive list of architectural decisions is as follows:

  • adding or removing tools;
  • adding or removing components;
  • changing what component talks to what other component;
  • major (in the SemVer sense) component upgrades.

Architectural decisions should be taken as directions to follow for future development and not issues to be fixed immediately.

What triggers an architectural decision?

An architectural decision generally starts with one of the following:

  • A new features was requested by product management.
  • An improvement was requested by engineering management.
  • A new risk was discovered, usually by the architect, but also by any stakeholder.
  • A new technology was discovered, that may help with a new feature, an improvement or to mitigate a risk.

How are architectural decisions captured?

Architectural decisions are captured via Architectural Decision Records or the tech radar. Both are stored in Git, hence a decision log is also captured as part of the Git commit messages.

How are architectural decisions taken?

Architectural decisions need to mitigate the following information security risks:

  • a component might not fulfill advertised expectations;
  • a component might be abandoned;
  • a component might change direction and deviate from expectations;
  • a component might require a lot of (initial or ongoing) training;
  • a component might not take security seriously;
  • a component might change its license, prohibiting its reuse or making its use expensive.

The Welkin architect is overall responsible for this risk.

How are these risks mitigated?

Before taking in any new component to Welkin, we investigate and evaluate them. We prefer components that are:

  • community-driven open-source projects, to reduce the risk of a component becoming abandoned, changing its license or changing direction in the interest of a single entity; as far as possible, we choose CNCF projects (preferably graduated ones) or projects which are governed by at least 3 different entities;
  • projects with a good security track record, to avoid unexpected security vulnerabilities or delays in fixing security vulnerabilities; as far as possible, we choose projects with a clear security disclosure process and a clear security announcement process;
  • projects that are popular, both from a usage and contribution perspective; as far as possible, we choose projects featuring well-known users and many Maintainers;
  • projects that rely on technologies that our team is already trained on, to reduce the risk of requiring a lot of (initial or ongoing) training; as far as possible, we choose projects that overlap with the projects already on our tech radar;
  • projects that are simple to install and manage, to reduce required training and burden on administrators.

Often, it is not possible to fulfill the above criteria. In that case, we take the following mitigations:

  • Architectural Decision Records include recommendations on training to be taken by administrators.
  • Closed-source or "as-a-Service" alternatives are used, if they are easy to replace thanks to broad API compatibility or standardization.

These mitigations may be relaxed for components that are part of alpha or beta features, as these features -- and required components -- can be removed at our discretion.

ADRs

This log lists the architectural decisions for Welkin.

  • ADR-0000 - Use Markdown Architectural Decision Records
  • ADR-0001 - Adopt Rook as Storage Orchestrator for Multi-Cloud PersistentVolumeClaim Support
  • ADR-0002 - Adopt Kubespray for Multi-Cloud Cluster Lifecycle to Leverage Large Community and Kubeadm Support
  • ADR-0003 - [Superseded by ADR-0019] Adopt InfluxDB for Metric Pushing to Support Current Use-Cases with Minimal Effort
  • ADR-0004 - Retain Wrapper Scripts as Optional Helpers to Ensure Base Tool Accessibility and Flexibility
  • ADR-0005 - Adopt Individual SSH Keys via Ansible for Enhanced Auditability and Security Management
  • ADR-0006 - Adopt Standard Kubeconfig Mechanisms for Better Tool Integration and Least Astonishment
  • ADR-0007 - Use emptyDir for Monitoring Forwarders to Ensure Storage Independence and Self-Healing
  • ADR-0008 - Adopt HostNetwork or LoadBalancer Ingress to Support Hybrid Cloud and Bare-Metal Flexibility
  • ADR-0009 - Adopt ClusterIssuers for Let's Encrypt to Reduce Certificate Fragility and Simplify Management
  • ADR-0010 - Run Managed Services in Workload Cluster for Low Latency and Reusable Network Security
  • ADR-0011 - [Superseded by ADR-0046] Delegate CRD Management to Upstream Projects to Reduce Maintenance and Follow Helm 3 Standards
  • ADR-0012 - [Superseded by ADR-0017] Use Memory Storage for Dex to Simplify Operations and Avoid CRD-Based Complexity
  • ADR-0013 - [Superseded by ADR-0060] Centralize Alert Logic in On-Call Management Tools to Enable Flexible Notification and Escalation
  • ADR-0014 - Adopt bats for Testing Bash Wrappers to Ensure Operational Reliability with Minimal Tool Sprawl
  • ADR-0015 - Prioritize Community-Driven Open Source to Ensure Vendor Independence and Business Continuity
  • ADR-0016 - [Superseded by ADR-0040] Disallow GID 0 by Default to Maintain Secure Defaults While Allowing Case-by-Case Exceptions
  • ADR-0017 - Adopt CRD-Based Persistence for Dex to Improve User Experience During Security Patching
  • ADR-0018 - Use Prometheus Operator Probes to Measure Internal Service Uptime for SLA Compliance
  • ADR-0019 - Adopt Thanos Receive for Metric Pushing to Enable Multi-Cluster Scaling and Community Alignment
  • ADR-0020 - Filter by Cluster Label then Data Source for Seamless Multi-Cluster Dashboards and Tenancy
  • ADR-0021 - Default to TLS for performance-insensitive additional services to Balance Security and Portability
  • ADR-0022 - Use Dedicated Nodes with Taints for Additional Services to Improve Stability and Security
  • ADR-0023 - [Superseded by ADR-0056] Restrict Ingress Configuration Snippets to Require Risk Acceptance for Stability and Security
  • ADR-0024 - Allow Harbor Robot Accounts to Manage Other Robot Accounts to Enable Self-Service Automation
  • ADR-0025 - Adopt local-volume-provisioner for High-Performance Managed Services Needing Local Disk Speed
  • ADR-0026 - Use environment-name as HNC Root to Enhance Developer Autonomy and Cluster Security
  • ADR-0027 - Adopt S3 Backup Cloning for PostgreSQL to Enable User Autonomy Without Risking Cluster Stability
  • ADR-0028 - Adopt Hard Eviction and Priority Classes to Protect Critical Services During Node Memory Pressure
  • ADR-0029 - Expose Jaeger UI via OAuth2-Proxy to Enable Audited Access While Maintaining Platform Security
  • ADR-0030 - Run ArgoCD on Elastisys Nodes to Improve Cost-Efficiency and Platform Stability
  • ADR-0031 - Run Cinder CSI Plugin on Elastisys Nodes to Isolate Storage Logic and Protect Node Stability
  • ADR-0032 - [Superseded by ADR-0058] Increase Boot Disk Size to 100GB to Improve Image Capacity and Reduce Operational Alerts
  • ADR-0033 - Centralize Cluster API Controllers in the Management Cluster for Better Security and Maintenance
  • ADR-0034 - Scale AMS Nodes Horizontally with Dedicated Labels to Ensure Stability and Resource Isolation
  • ADR-0035 - Centralize Tekton on Management Cluster to Secure High-Privilege Credentials and Enable Rollbacks
  • ADR-0036 - Run Ingress-NGINX as DaemonSet to Simplify Operations and Preserve Client Source IP
  • ADR-0037 - Automate Job TTL via Gatekeeper Mutation to Prevent API Server Bloat and Metric Noise
  • ADR-0038 - Adopt Trivy Operator to Replace Deprecated Starboard for Continued Security and Compliance
  • ADR-0039 - Manage Developer Privilege Elevation Case-by-Case to Balance Flexibility and Security
  • ADR-0040 - Allow Group ID 0 by Default to Align with Upstream Kubernetes Pod Security Standards
  • ADR-0041 - Rely on Infrastructure Providers for Encryption-at-Rest to Avoid Security Theater and Complexity
  • ADR-0042 - Require Manual Namespace Creation for ArgoCD to Ensure Security until HNC Support Matures
  • ADR-0043 - Encrypt rclone Replication via Salsa20/Poly1305 to Adhere to Cryptography Policy and Data Privacy
  • ADR-0044 - Exclude argocd-system from Managed Namespaces to Prevent Privilege Escalation and Ensure Security
  • ADR-0045 - Adopt Specialised Prebuilt Images to Secure the Supply Chain and Ensure Runtime Reliability
  • ADR-0046 - Use Standard Helm CRD Management to Prevent Accidental Data Loss and Ensure Layer Separation
  • ADR-0047 - Align Kubernetes Versions between CAPI and Kubespray to Maximize Stability and Reduce Skew
  • ADR-0048 - Secure AMS Access via Pod Labels and Network Policies to Enforce Network Segregation
  • ADR-0049 - Enable Ingress-NGINX Chroot and Seccomp to Mitigate Secret Exposure and CVE Exploitation
  • ADR-0050 - Isolate Logs/Metrics from Apps and Traces via Dedicated Clusters to Align with NIS2/Security Zones
  • ADR-0051 - Open cert-manager Egress for DNS and HTTP to Streamline Certificate Issuance and Reduce Alerts
  • ADR-0052 - Use Azure Default Server-Side Encryption for Storage to Maintain Security and Minimize Complexity
  • ADR-0053 - Restricted Observability Ingestion: Permit Trusted External Clients but Block Direct Public Access
  • ADR-0054 - Grant Endpoint Write Access Only After Formal Risk Acceptance to Mitigate Cross-Namespace Vulnerabilities
  • ADR-0055 - REMOVED
  • ADR-0056 - Restrict Ingress Snippet Annotations to Risk-Accepted Use Cases to Prevent Global Cluster Downtime
  • ADR-0057 - Reject Managed K8s Services to Ensure Stack Portability and Full Control over Compliance
  • ADR-0058 - Optimize Node Boot Disk Sizes by Allowing 50GB on Request to Reduce Infrastructure Footprint
  • ADR-0059 - REMOVED
  • ADR-0060 - Leverage Alertmanager Grouping to Reduce Alert Fatigue and Streamline Incident Response
  • ADR-0061 - Isolate Azure Customers at the Subscription Level to Simplify Billing and Security Governance
  • ADR-0062 - Replace Azure Bastion with Load Balancer NAT Rules to Optimize Cost and SSH Access

For new ADRs, please use template.md as basis. More information on MADR is available at https://adr.github.io/madr/. General information about architectural decision records is available at https://adr.github.io/.

Index Regeneration

Pre-requisites:

Run make -C docs/adr, then run pre-commit run --all-files.