Prepare Your Application¶
To make the most out of Welkin, prepare your application so it features:
- some REST endpoints: NodeJS, .NET;
- structured logging: NodeJS, .NET;
- metrics endpoint: NodeJS, .NET;
- Dockerfile, which showcases:
- Helm Chart, which showcases:
- Grafana dashboards for metrics visualization;
- script for local development and testing;
Bonus:
- ability to make it crash (
/crash
).
Feel free to clone our user demo for inspiration:
git clone https://github.com/elastisys/welkin/
cd welkin/user-demo
Make Sure Your Application Can Terminate Gracefully¶
In Kubernetes Pods and their Containers will sometimes be terminated. The cause can differ a lot, everything from you updating your application to a new version, to a Node being replaced or the Node running out of memory. Regardless of the cause, your application needs to be able to handle terminations unexpectedly.
When a Pod termination is started there is usually a grace period where the Pod can clean up and then shut down gracefully. This grace period is usually 30 seconds, but can sometimes differ. If the Pod is not done shutting down at the end of this period, then it will be forcefully shut down. This process usually looks something like this:
- Something triggers the Pod termination
- Any
preStop
hooks in the Pod are triggered. - TERM signal is sent to each Container in the Pod.
- If the
preStop
hook or the Pod has not terminated gracefully within the grace period, then the KILL signal is sent to all processes in the Pod.
Your application might need to do some cleanup before terminating, like finishing transactions, closing connections, writing data to disk, etc.
If that is the case, then you have two options to utilize the grace period before the Pod is forcefully terminated.
You can utilize the preStop
hook to start a script in a container or it can make a HTTP call to a container.
You can have one preStop
hook per Container in your Pod.
You can also utilize the TERM signal that is sent to the containers by catching them in you application and having that trigger a graceful shutdown.
You can both have preStop
hooks and catch the TERM signal for the same container.
You can read more about the Pod termination process in the official Kubernetes documentation.
Make Sure Your Application Tolerates Nodes Replacement¶
Important
This section helps you implement ISO 27001, specifically:
- A.12.6.1 Management of Technical Vulnerabilities
Welkin recommends against PodDisruptionBudgets (PDBs). PDBs can easily be misconfigured to block draining Nodes, which interferes with automatic OS patching and compromises the security posture of the environment. Instead, prefer engineering your application to deal with disruptions. The user demo already showcases how to achieve this with replication and topologySpreadConstraints. Make sure to move state, even soft state, to specialized services.
Further reading¶
List of Non-Functional Requirements¶
In some contexts, it is useful to have a more-or-less exhaustive list of non-functional requirements which the application needs to fulfill to make the best use of the underlying platform. Sometimes, these may loosely be called "IT requirements".
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Code | Requirement |
Justification for exclusion
OR Evidence OR Comment |
---|---|---|
R1 | Architectural requirements | |
R1.1 |
Application components MUST be loosely coupled. Data sharing MUST happen exclusively via versioned APIs.
Why is this important?This allows the application team to scale. Each application developer only needs to work within a bounded context, reducing cognitive load. |
|
R1.2 |
The application MUST use separate containers for synchronous API requests and running asynchronous batch jobs.
Why is this important?This allows the application to make better use of resources and better tolerate failures. Being more exposed, synchronous code needs to be better tested and more secure than it's asynchronous counterpart. Furthermore, the asynchronous code may be managed by a queue to better handle spikes in work. The synchronous code can then focus on quick response times and good interaction with the user, without having to block for things such as sending a confirmation email. |
|
R1.3 |
The application SHOULD crash if a fatal condition is encountered, such as bad configuration or inability to connect to a downstream service.
Why is this important?This allows Kubernetes to stop a rolling update before end-user traffic is impacted. |
|
R2 | API requirements | |
R2.1 |
The application SHOULD accept inbound requests using the RESTful API.
Why is this important?This allows the application to take advantage of the Ingress Controller for improved security via HTTPS, HSTS, IP allowlisting, rate-limiting, etc. |
|
R2.2 |
All application APIs, internally and externally facing, synchronous (i.e. REST) and asynchronous (i.e. messages) MUST be versioned, and components MUST be able to process old API calls for as long as there are producers of these calls in production.
Why is this important?This allows each application component to be release independently with zero downtime via a rolling update strategy. |
|
R2.3 |
The application MUST validate all incoming API requests as well as responses to outgoing API requests.
Why is this important?This is good security hygiene. |
|
R2.4 |
If the application has asynchronous parts, then the application MUST use the message queue provided by the platform for communicating between its synchronous and asynchronous parts.
Why is this important?The platform team will have put a lot of effort into making sure that the message queue is fault-tolerant. The application team can build upon this to improve application fault-tolerance, as opposed to reinventing the wheel. |
|
R2.5 |
The asynchronous components in the application that consume messages MUST be compatible with old message formats until no producers of the old message format remain deployed.
Why is this important?This allows the application to employ rolling update between components which communicate via the message queue. |
|
R2.6 |
The application MUST set reasonable TTL for all messages sent via the platform-provide message queue.
Why is this important?This ensures that messages don't accumulate in the message queue in case of application bugs, potentially leading to capacity issues. In best case, such capacity issues lead to unnecessary costs. In worst case, such capacity issues can lead to new messages not being accepted and application malfunctioning. |
|
R2.7 |
The application MUST gracefully handle connection resets (e.g., due to fail-over) of downstream components.
Why is this important?Sometimes the platform team needs to migrate Pods from one Node to another, e.g., for maintenance. The application component consuming a Pod which moved needs to tolerate this. |
|
R3 | State management requirements | |
R3.1 | Database requirements | |
R3.1.1 |
The application MUST store structured state in an PostgreSQL-compatible database provided by the platform.
Why is this important?The platform team will have invested a lot of effort in making sure that the database is fault-tolerant, backed up, etc. Instead of reinventing the wheel, the application team can build upon this. |
|
R3.1.2 |
The application MUST perform database migration in a backwards-compatible manner.
Why is this important?This allows the application team to rollback a buggy application. Despite best QA, some bugs may only manifest with production data and real users. |
|
R3.1.3 |
The application MUST have a plan for rolling back changes, including dealing with database migrations.
Why is this important?This allows the application team to rollback a buggy application. Despite best QA, some bugs may only manifest with production data and real users. |
|
R3.1.4 |
The application MAY perform database migration in a Kubernetes init container.
Why is this important?Database migration may take a long time. It's better to separate this from the container which is providing the actual service. |
|
R3.2 | Non-persistant state requirements | |
R3.2.1 |
The application MUST store non-persistant data, such as session information and cache, in the Redis-compatible key-value store provided by the platform.
Why is this important?The platform team will have invested a lot of effort in making the key-value store fault-tolerant. By moving session state into the key-value store, each application replica is equal. No sticky load-balancing is needed and rolling updates are possible without disturbing the end-user. |
|
R3.2.2 |
The application MUST set reasonable TTL for all key-value pairs.
Why is this important?This ensures that the key-value store does not get filled with keys which are no longer in use, e.g., due to lack of cleanup or application crash. |
|
R3.2.3 |
The application MUST gracefully tolerate loss of non-persistant data, e.g., ask the user to re-login.
Why is this important?This is really a test for "is this session data"? |
|
R3.2.4 |
The application SHOULD use the Redis Sentinel protocol to facilitate a highly available key-value store.
Why is this important?This ensure the application can take advantage of the fault-tolerant key-value store provided by the platform. |
|
R3.3 | Object storage requirements | |
R3.3.1 |
The application MUST store large and/or unstructured data, like images, videos and PDF reports, in an S3-compatible object storage provided by the platform.
Why is this important?This ensures the application is stateless. This in turn means that the application can be updated at will, without needing to implement a complicated data replication protocol. |
|
R4 | Configuration management requirements | |
R4.1 |
The application MUST accept configuration only via clearly documented configuration files and environment variables.
Why is this important?This is good practice. It allows a container image to be built once and configured in many different ways. |
|
R4.2 |
The application MUST separate secret from non-secret configuration information. Examples of secret configuration includes API keys and database access password.
Why is this important?This allows tighter access control around secret configuration via Kubernetes RBAC. |
|
R5 | Observability requirements | |
R5.1 | Observability requirements (metrics) | |
R5.1.1 |
The application SHOULD provide a metrics endpoint exposing application metrics in Prometheus Exposition format.
Why is this important?This allows the application to take advantage of the observability stack provided by the platform. |
|
R5.1.2 |
The metrics provided by the application SHOULD allow the application team to understand if it functions correctly.
Why is this important?This allows the application team to close the DevOps loop and understand how their code runs in production. |
|
R5.1.3 |
The application SHOULD provide alerting rules based on threshold on metrics.
Why is this important?This allows the application team to discover issues with the application in production, before they are noticed by end-users. |
|
R5.1.4 |
The application SHOULD provide metrics to understand if its deprecated APIs are still in use.
Why is this important?This allows the application team to safely remove deprecated APIs, reducing technical debt without fear of disappointing end-users. |
|
R5.2 | Observability requirements (logs) | |
R5.2.1 |
The application MUST produce structured logs in multi-line JSON format on stdout.
Why is this important?This allows the application to take advantage of the observability stack provided by the platform. |
|
R5.2.2 |
The application MUST log exceptions and errors.
Why is this important?This allows the application team to understand if something is malfunctioning with their application in production and issue bugfixes. |
|
R5.2.3 |
The application SHOULD log all major boundary events, e.g., incoming and outgoing API requests.
Why is this important?This allows the application team to understand how their application is working in production. |
|
R5.2.4 |
The application MUST mark each log record with the relevant log level, e.g., debug, info, warn, error, exception.
Why is this important?This allows the application team to filter log records based on importance and direct their attention to where it is needed. |
|
R5.3 | Observability requirements (tracing) | |
R5.3.1 |
The application SHOULD push traces to an endpoint provided by the platform using the OpenTelemetry standard.
Why is this important?This allows the application team the finest possible observability of their application. They could, for example, determine which particular function is slow in some hard-to-replicate conditions, paving the path to a bugfix. |
|
R5.4 | Observability requirements (probes) | |
R5.4.1 |
The application MUST provide startup, readiness and liveliness probes, as relevant to the application component.
Why is this important?This allows Kubernetes to understand if the container of the application is running proparly and issue corrective actions, such as directing traffic to a different replica or restarting the container. |
|
R5.4.2 |
The application MUST fail its startup probe during database migration.
Why is this important?If application initialization takes a long time, then a startup probe allows the application team to specify a different timeout for that phase. |
|
R5.4.3 |
The application SHOULD fail its readiness probe if a downstream service is unavailable.
Why is this important?This is readiness probe best practice. If the application cannot deliver a useful service, it should make this clear to the outside world to allow Kubernetes to take corrective actions. |
|
R5.4.4 |
The application MUST handle SIGTERM by failing its readiness probe, draining connections and exiting gracefully.
Why is this important?This allows "hitless" rolling updates. |
|
R6 | Build requirements | |
R6.1 |
The application MUST be containerized according to the OCI standard.
Why is this important?This is pretty much a given nowadays, but is specified just to make sure. |
|
R6.2 |
The application MUST run on linux/amd64 OCI platform.
Why is this important?Welkin only supports Linux Nodes. |
|
R6.3 |
The application MUST run as non-root.
Why is this important?This is a security guardrail provided with Welkin. It improves security by ensuring that the application runs according to the least privilege principle. |
|
R7 | Deployment requirements | |
R7.1 |
Applications consisting of multiple components SHOULD be packaged in a versioned way (e.g., via Helm Chart).
Why is this important?This ensure that the application is tested and deployed as a whole. |
|
R7.2 |
The application SHOULD scale horizontally, i.e., its throughput should increase as more replicas are added.
Why is this important?This ensure that the application can handle load spikes in a cost-efficient manner. |
|
R7.3 |
The application MUST specify resource requests and limits for CPU and memory, and make suitable runtime configuration.
Why is this important?This is a security guardrail provided with Welkin. It enforces good capacity management practices and reduces the risk of downtime due to capacity exhaustion. |
|
R7.4 |
The application MUST adhere to the principle of least privilege in terms of network communication, and have the strictest possible set of firewall rules (i.e. Kubernetes Network Policies) in place.
Why is this important?This is a security guardrail provided with Welkin. Good NetworkPolicies reduce the success of exploiting some vulnerabilities. |
|
R7.5 |
The application SHOULD use a Blue/Green or Canary deployment strategy.
Why is this important?This gives the application team a chance to detect a bug before it affects too many end-users and rollback. |
|
R7.6 |
The application SHOULD use HorizontalPodAutoscaler to scale the number of replicas, as needed to react to load spikes.
Why is this important?This ensure that the application can handle load spikes in a cost-efficient manner. |
|
R7.8 |
The application MAY use a rolling upgrade strategy to ensure it can be upgraded with zero downtime.
Why is this important?This allows the application team to deliver new features at high velocity, without worrying about downtime. |
|
R8 | Availability requirements | |
R8.1 |
The application MUST follow the "rule of 2". Every container needs to have at least two replicas.
Why is this important?This ensures the application team cannot introduce bugs, e.g., state management via global variables, which compromise application high-availability and scalability. |
|
R8.2 |
The application MUST tolerate its replicas running in different datacenters with latencies of up to 10 ms.
Why is this important?This ensure the application can tolerate a datacenter failure, if this is a requirement. |
|
R8.3 |
The application MUST tolerate the failure of a replica.
Why is this important?This ensures the application is highly available and can tolerate the failure of a Node or (if needed) datacenter. |