NullOps Project Architecture

Purpose of this project

The idea for this project came when we faced a similar challenge in our workplace. Manual QA was always behind developers due to how complex features were and the lack of parallel testing environments. Only one complex feature could be tested at a time.

So I've started my research. I've found several solutions, but none of them were exactly what I was looking for.

Here's a breakdown of solutions I've found:

Solution	Why it didn't work out
just a k8s cluster	Doesn't have UI and will require DevOps engineers to create each new environment.
OpenShift	Bound by corporate licenses. Lacks automation features. Complex to set up and use.
Developer Portals	Either require a lot of coding or paid.
Manual docker-compose deployment	Requires a developer to SSH onto target machine and manage running environments.
Automatic docker-compose deployment via CI/CD	Does not provide flexibility, complex to set up, no UI to select service versions/view active deployments.

None of those solutions fit my needs, so I decided to create my own. An OSS platform that will allow developers to easily create and manage their own QA environments with setup pipelines configured by DevOps engineers.

User Roles and Access Model

Authentication

Local user accounts stored in PostgreSQL database
Users authenticate with username/password
SSO integration is planned as an after-MVP feature

User Roles

Role	Description
`Admin`	DevOps engineer. Manages environments, projects, pipelines, secrets, teams, and team assignments.
`User`	Software engineer or QA engineer. Can create and manage deployments based on their team permissions.

Teams and Permissions

Teams are organizational units created and managed by admins. Teams provide access control to projects and environments.

Team Access Levels:

Full Access: Can create, modify, and delete deployments. Can view logs, metrics, and configuration.
Read-Only: Can view deployment status, logs, metrics, and configuration. Cannot create or modify deployments.

Team Assignment:

Admins create teams (e.g., "QA Team", "Backend Team", "Frontend Team")
Admins assign teams to projects or environments with specific access levels
Multiple teams can be assigned to the same project/environment with different access levels
Users are added to teams by admins
A user's permissions are determined by their team memberships

Example: Admin creates "QA Team" and "SE Team". When creating a project, admin assigns "QA Team" with "Read-Only" access and "SE Team" with "Full Access". All QA team members can view deployments but only SE team members can create and manage them.

Domain-specific terms

Term	Description
Environment	A Docker/k8s cluster with several deployment nodes. Has default auto-cleanup settings.
Service	A service configuration. Defines container registry, image name, and version tag mode (pinned or selectable).
Project	A pipeline of services and components that create a deployable set. Describes required services, databases scripts and checks that are required to get a working QA environment.
Deployment	A project that has been deployed to the environment. Includes its own ingress to access services inside the project.
Component	Extendable action in pipeline. Examples would be: "Create DB", "Execute SQL script", "Create Keycloak user" and etc.
Plugin	Docker container that implements a component. Receives JSON input via stdin, returns JSON output via stdout.
Team	A group of users with shared access to projects or environments. Created and managed by admins.
Agent	API service running on deployment nodes. Maintains internal state of deployments on that node and executes pipelines.
Container Registry	Docker registry configured by admins. Can be public (DockerHub) or private registries with authorization credentials.
Tag Mode	Service version selection mode: either "pinned" (locked to specific tag) or "selectable" (users choose from pattern-matched tags).

Architecture and expected use-case

DevOps engineer (Admin) will define services and a project. Create secrets and build a pipeline for a said project. Pipeline describes what services to run and in what order and what actions to perform in between. Each project describes restrictions on RAM and CPU time.

Admin creates teams and assigns team access to projects or environments with appropriate permission levels.

Then a software engineer (User with Full Access) creates a deployment for a project and selects service versions for all services (unless restricted by DevOps). First our service checks if there are enough resources in the environment (on any of the nodes). If resources are exhausted, the user receives an error and deployment is not created. If resources are available, this will trigger the pipeline and deploy the project to the environment.

Later the software engineer can access the project's services via ingress at http://servicename.deploymentname.envname.ingress.lan. Or provide the QA engineer with a link to UI.

When deployment is no longer needed, a user with Full Access can delete it. This will trigger automatic cleanup of the environment.

System Component Interaction

Component Communication:

From	To	Purpose
Users	UI	User interactions and deployment management
UI	Control Plane	API calls for all operations
Control Plane	Agents	Persistent WebSocket connection for commands and status queries
Control Plane	Ingress	Dynamic routing configuration updates
Agents	Docker Containers	Manage and monitor local deployments
Agents	Container Registries	Pull images with credentials
Ingress	Docker Containers	Route external traffic to services

Note: Control plane initiates WebSocket connections to agents. Agents expose WebSocket endpoints and do not need to know control plane URL. If connection drops, control plane continuously attempts reconnection until agent is disabled by admin.

Deployment Creation Flow

Resource Management

Allocation and Limits

Each project defines CPU and RAM restrictions that apply to all its deployments
Before creating a deployment, the control plane checks if any online node in the environment has sufficient resources
Resource checks include both allocated resources and actual OS-level resource consumption
If resources are exhausted, deployment creation fails immediately with an error message
Users do not see available resources unless deployment creation fails

Resource Monitoring

OS-Level Monitoring: Agents monitor actual resource consumption from the operating system
Docker Limits: Docker enforces CPU and RAM limits on running containers
Resource Accounting:
- Allocated resources tracked by control plane
- Actual consumption monitored by agents
- Both system resources and Docker container resources are counted
Resource Drift Prevention: Docker container limits prevent processes from exceeding allocated resources

Runtime Behavior

Docker enforces CPU and RAM limits on running containers
If a deployment attempts to exceed its limits, Docker prevents additional resource consumption
Processes within the deployment cannot bypass these restrictions
If Docker crashes and resources aren't freed properly, this is considered a user/infrastructure error

Cleanup and Monitoring

Agents maintain internal state of which deployments exist on their node
Agents automatically clean up:
- Services from deleted deployments
- Orphaned resources (zombie deployments)
- Resources from failed deployments after rollback
Control plane communicates state changes to agents via WebSocket to keep internal state synchronized

Multi-tenancy and Isolation

Network Isolation

Deployments are isolated from each other using network policies
Deployments from different teams or projects cannot interfere with each other
Inter-service communication within a single deployment is managed via Docker networks
Central ingress service provides controlled access to deployments

Data Isolation

Each deployment can have its own database instance as defined by the pipeline
Pipeline specifies which databases to spin up for the deployment
Database connection strings are managed and injected by the pipeline configuration

Ingress Routing

Each service within a deployment is accessible via subdomain-based routing
URL format example: http://servicename.deploymentname.envname.example.com
Subdomain Configuration: Base subdomain is configurable per environment in environment settings
DNS Setup Required: DevOps engineer must manually configure DNS records to point to the ingress entry point
Ingress URLs remain stable even when deployments are recreated
Deployments can call external APIs as needed
Ingress configuration is dynamically updated when deployments are created or destroyed
Control plane notifies ingress service of routing changes

Security

Communication Security

WebSocket Connections: Control plane and agents communicate via persistent WebSocket connections
TLS Encryption: All WebSocket connections use TLS encryption
Certificate-based Trust:
- Agent generates a self-signed certificate on first startup
- Certificate thumbprint is stored in control plane database during agent registration
- On each connection, control plane validates agent's certificate thumbprint
- If thumbprint changes (certificate regenerated), agent is marked as untrusted
- DevOps engineer must manually approve new thumbprint through UI before agent is trusted again

Token Management

Agent Authentication Tokens: Generated by control plane when admin adds a new node
Token Rotation: Admins can rotate agent tokens through UI
- New token is generated
- Agent must be reconfigured with new token
- Old token is immediately invalidated
Token Revocation: Admin can revoke agent token to immediately disconnect agent
Token Storage: Tokens are hashed in control plane database

Encryption at Rest

Secret Encryption: All pipeline secrets and container registry credentials are encrypted with AES-256
Encryption Key Management:
- AES encryption key is provided by DevOps engineer during control plane setup
- Key is stored in environment variable or external Vault (after-MVP)
- Both control plane/Vault AND database must be compromised to decrypt secrets
Encrypted Data:
- Container registry credentials
- Pipeline secrets (passwords, API keys, connection strings)
- Agent authentication tokens (hashed, not encrypted)

Access Control

User authentication via username/password stored in PostgreSQL
Team-based permissions enforced at control plane level
All operations are validated against user's team memberships and access levels

Service Configuration and Container Registries

Container Registry Management

Admins configure available container registries through the UI
Default Registries: DockerHub and other public registries are included by default
Private Registries: Admins can add private registries with authorization credentials
Credentials are securely stored and used by agents when pulling images

Service Definition

When creating a service, admins specify:

Container Registry: Which registry to use
Image Name: The container image name (e.g., postgres, mycompany/backend-api)
Tag Mode: How version tags are selected

Tag Modes

Pinned Mode:

Service version is locked to a specific tag
Users cannot change the version when creating deployments
Example use case: Databases that should always use a stable version
UI displays the locked tag (e.g., "postgres:14.5")

Selectable Mode:

Users can choose from available tags when creating deployments
Admins can specify a pattern to filter tags (e.g., features/*, v*, 1.2.*)
UI displays a dropdown menu with tags pulled from the container registry
Tags are fetched dynamically and filtered by the specified pattern
Example: Allow selecting any feature branch tag matching features/*

Version Selection Flow

User initiates deployment creation
For each service in the project:
- If tag mode is "pinned": Display locked version, no user input
- If tag mode is "selectable": Query container registry for available tags
Filter tags based on admin-defined pattern (if specified)
Display filtered tags in dropdown for user selection
User selects versions and proceeds with deployment

Pipeline Configuration

Pipeline Editor

Admins configure pipelines using a UI-based node editor
Drag-and-drop interface for adding and arranging components
Visual connection of nodes to define execution order
Each node represents a component (service, database, script, etc.)

Pipeline Structure

Nodes are connected to define dependencies and execution sequence
Sequential execution: Components run in the order defined by connections
Pipeline defines:
- Which services to deploy
- Which components to execute and when
- Configuration for each component
- Resource limits (CPU/RAM)

Pipeline Execution Flow

Agent receives pipeline definition from control plane
Agent executes components following the node graph
Each component must complete successfully before proceeding to next
Any failure triggers rollback of entire pipeline

Agent Architecture

Agent Responsibilities

Run as API services on each deployment node
Maintain persisted internal state of deployments on the node
Execute pipelines for new deployments
Monitor and cleanup resources
Accept and maintain persistent WebSocket connections from control plane
Generate and manage self-signed TLS certificate
Pull and execute plugin containers

Communication Model

WebSocket Connection: Persistent, bidirectional WebSocket connection between control plane and each agent
Connection Initiation: Control plane initiates WebSocket connection to agent
TLS Encryption: All WebSocket connections are encrypted using TLS
Agent as Server: Agent exposes WebSocket endpoint; control plane connects as client
Connection Lifecycle:
- Control plane establishes WebSocket connection to agent on agent registration
- Connection remains open for real-time communication
- If connection drops, control plane attempts to reconnect
- Agent accepts reconnection attempts from control plane
Connection Loss Handling:
- Agent is marked as "offline" when WebSocket connection is lost
- Control plane continuously attempts to restore connection
- Reconnection attempts continue until agent is manually disabled by admin
- When agent comes back online, state is synchronized from control plane

Agent Health and Availability

Health Status: Determined by WebSocket connection state
Online: WebSocket connection is active and healthy
Offline: WebSocket connection is lost or failed
When Agent Goes Offline:
- All deployments on that agent are marked as offline (not accessible)
- Ingress is automatically reconfigured to ignore the offline node
- Deployments are NOT migrated to other nodes
- Deployments remain in offline state until agent reconnects
Version Compatibility:
- Control plane checks agent version on connection
- Mixed agent versions are NOT supported
- Agent with incompatible version is rejected and marked as incompatible

State Management

Control Plane Authority: Control plane is always the source of truth for deployment state
State Reconciliation: When agent reconnects after being offline:
- Agent receives full deployment state from control plane
- Agent's internal state is updated to match control plane
- Deployments are verified and restarted if necessary
State Corruption: If agent's internal state becomes corrupted:
- Only mounted volume data may be lost
- Agent is fully reconfigured by control plane on next connection
- Deployment state is restored from control plane database

Internal State

Track which deployments are active on the node
Store deployment configurations
Maintain pipeline execution history
Persist state locally to survive agent restarts
State synchronization with control plane on reconnection

Pipeline Execution and Plugin System

Component System

Components are implemented as Docker container plugins
Custom components can be added to extend pipeline functionality
Component dependencies (Component A requiring Component B) are not supported
Examples: "Create DB", "Execute SQL script", "Create Keycloak user"

Docker Container Plugin Architecture

Plugin Contract:

Plugins are Docker containers that implement a simple stdin/stdout interface
Input: JSON via stdin
Output: JSON via stdout
Exit code: 0 for success, non-zero for failure

Example Plugin Input:

{
  "action": "create",
  "parameters": {
    "database": "myapp_qa",
    "version": "14"
  },
  "secrets": {
    "adminPassword": "encrypted_value"
  }
}

Example Plugin Output:

{
  "success": true,
  "output": "Database created successfully",
  "outputs": {
    "connectionString": "postgres://localhost:5432/myapp_qa",
    "databaseName": "myapp_qa"
  }
}

Plugin Execution:

# Agent executes plugin like this:
echo '{"action":"create",...}' | docker run --rm plugin-image:v1

Plugin Distribution and Management

Storage: Plugins are stored as Docker images in container registries
Configuration: Admins configure available plugins through the UI by specifying Docker image names
Distribution: When plugin list changes, control plane notifies all agents via WebSocket
Loading: Agents pull plugin images from configured container registries (same registries used for services)
Discovery: Agents maintain a registry of available plugin images
Versioning: Plugin versions are managed through Docker image tags (e.g., mycompany/plugin-postgres:v1.2.0)
Plugin Updates: Updating a plugin version does not affect running deployments (they use the plugin version specified in their pipeline version)
Isolation: Each plugin execution runs in an isolated Docker container
Failed Downloads: If plugin image pull fails, the component using that plugin will fail during pipeline execution

Plugin Registration in UI:

Admins specify plugin details:
- Plugin name (e.g., "PostgreSQL Database Creator")
- Docker image (e.g., mycompany/plugin-postgres:v1.2.0)
- Container registry (select from configured registries)
- Description and parameter schema
- Resource limits (CPU/RAM for plugin execution)

Plugin Development:

Plugins can be written in any language (Python, Go, Bash, Node.js, etc.)
Plugin developers package their code as Docker images
Plugins can use any libraries or tools available in their container
Plugins must read JSON from stdin and write JSON to stdout
Exit code 0 indicates success, non-zero indicates failure

Example Plugin Implementation (Python):

#!/usr/bin/env python3
import json
import sys
import psycopg2

def main():
    # Read input from stdin
    input_data = json.load(sys.stdin)
    
    try:
        # Execute plugin logic
        conn = psycopg2.connect(input_data['secrets']['adminConnection'])
        conn.autocommit = True
        cursor = conn.cursor()
        cursor.execute(f"CREATE DATABASE {input_data['parameters']['database']}")
        
        # Return success result
        result = {
            "success": True,
            "output": f"Database {input_data['parameters']['database']} created",
            "outputs": {
                "databaseName": input_data['parameters']['database']
            }
        }
        json.dump(result, sys.stdout)
        sys.exit(0)
        
    except Exception as e:
        # Return failure result
        result = {
            "success": False,
            "output": f"Failed to create database: {str(e)}"
        }
        json.dump(result, sys.stdout)
        sys.exit(1)

if __name__ == "__main__":
    main()

Pipeline Versioning

Version on Save: Each time a pipeline is modified and saved, a new version is created
Running Deployments: Currently running/active deployments use the specific pipeline version they were created with
Immutable Versions: Historical pipeline versions cannot be modified
Admin Modifications: Admins can modify pipelines while deployments are running without affecting those deployments
New Deployments: New deployments always use the latest pipeline version
Plugin Versions: Pipeline versions store specific plugin image tags, ensuring deployments always use the same plugin versions

Execution Flow

Pipelines are executed by agents on deployment nodes
Control plane sends pipeline definition (specific version) to agent via WebSocket
Control plane monitors pipeline execution status in real-time
For each component in the pipeline:
1. Agent pulls the plugin Docker image (if not already cached)
2. Agent prepares JSON input with parameters and secrets
3. Agent executes plugin container: docker run --rm plugin-image < input.json > output.json
4. Agent captures stdout, stderr, and exit code
5. Agent parses JSON output and checks exit code
6. If successful, proceed to next component
7. If failed, trigger rollback
Pipeline executes components sequentially in defined order
If any component fails, the entire pipeline fails
Failed deployments trigger automatic rollback by the agent
Users can retry failed deployments through the UI (uses latest pipeline version)

Plugin Execution Environment

Plugins run in isolated Docker containers
Plugins have no access to agent internals or other deployments
Plugins can be configured with resource limits (CPU/RAM per execution)
Plugin containers are ephemeral (removed after execution)
Plugin execution logs are captured and sent to control plane for user visibility
Plugins can access deployment network if needed (configured per plugin)

Failure Handling

Component failure = pipeline failure
Rollback removes all resources created during the failed deployment attempt
Agents ensure cleanup of partial deployments
Error details are displayed to all users (admins and regular users see the same errors)
External Service Failures: If container registry is unreachable or external service fails during pipeline execution, deployment fails
Plugin Failures: If plugin container crashes or returns non-zero exit code, deployment fails
Plugin Timeout: Plugins can be configured with execution timeout (default: 5 minutes)

Secrets and Configuration Management

Secret Injection

Secrets are managed and defined in the pipeline configuration
Pipeline specifies how secrets are injected into plugin containers
Secrets are passed to plugins via JSON input (encrypted in transit)
Admins configure secret sources and injection methods per project
Secrets are stored in control plane database (encrypted at rest)
Agents retrieve secrets from control plane when executing pipelines
Secrets are never persisted on agent nodes

Configuration Override

Users can only override service versions when creating deployments (if tag mode is "selectable")
All other configurations are locked and defined by the pipeline
Pipeline-defined configurations include:
- Environment variables
- Database connection strings
- Component parameters
- Resource limits
- Plugin parameters

Deployment Lifecycle

Creation and Versioning

Users initiate deployment creation for a project
For each service in the project:
- Pinned tag mode: Version is locked, displayed but not selectable
- Selectable tag mode: Users choose from available tags (filtered by pattern if specified)
Users can only modify service versions, all other configuration is locked
To use different versions, users must create a new deployment (versions cannot be changed after creation)
Each deployment gets unique ingress URLs that remain stable across recreations
Deployments use the current pipeline version at the time of creation

Deployment Cancellation

In-Progress Cancellation: Users can cancel deployments that are currently being created
Cancellation = Deletion: Cancelling a deploying deployment is treated the same as deleting it
Rollback on Cancel: Agent performs rollback and cleanup of partially created resources
Delete During Creation: If user attempts to delete a deployment while it's being created, it is cancelled and rolled back

Deployment Cloning

Users can clone an existing deployment
Cloning creates a new deployment with the same configuration
Users can modify service versions during cloning (if tag mode is selectable)
Cloned deployment gets new unique ingress URLs
Cloned deployment uses the latest pipeline version (not the original deployment's pipeline version)

Auto-cleanup Configuration

Environment Level: Default auto-cleanup timer (e.g., 7 days of inactivity)
Project Level: Can override environment default with project-specific timer
Configurable per environment and overrideable per project

Auto-cleanup Behavior

Timer starts counting from last deployment activity
UI displays countdown timer showing time remaining before auto-deletion
Users with Full Access can extend deployment lifetime through UI
Lifetime Extension: When user extends deployment, timer resets completely
Maximum Lifetime: No hard limit on deployment lifetime (can extend indefinitely)
System sends no additional warnings - timer in UI is the warning mechanism
Manual deletion by users with Full Access triggers immediate cleanup
Deployment pausing is not supported

Observability

Logging

UI provides access to logs from all services within a deployment
Control plane queries agents for service logs
Logs are available to all users with access to the deployment (based on team permissions)
Plugin execution logs are captured and displayed during pipeline execution

Metrics and Health

Health checks are visible through the UI
Metrics are available for monitoring deployment performance and resource usage
Agents report metrics back to control plane

Audit and History

Deployment history tracks who deployed what and when
Actions are logged for audit purposes
History is accessible through the UI based on team permissions

Edge Cases and Failure Scenarios

Network Connection Loss

During Deployment: If WebSocket connection drops during deployment:
- Agent is marked as offline
- Deployment is paused in its current state
- Agent continues attempting to complete pipeline locally
- When connection is restored, agent reports current state to control plane
- Control plane resumes monitoring the deployment
Connection Recovery: Control plane and agent both attempt to restore connection
Deployment State After Recovery: Deployment continues from where it left off

Node Failures

If a deployment node goes down completely, all deployments on that node are marked offline
Ingress automatically reconfigures to remove offline node from routing
Deployments are NOT automatically migrated to other nodes
When node comes back online:
- Agent reconnects to control plane
- State is synchronized from control plane
- Deployments are verified and services restarted if needed
- Ingress is reconfigured to include the node again

Node Maintenance and Decommissioning

Graceful Decommissioning:
1. Admin disables node for new deployments through UI
2. Existing deployments continue running
3. Admin waits for deployments to be deleted naturally or manually deletes them
4. Once node is empty, admin can safely remove it
Node Reboot:
- Services continue running in Docker during reboot
- Agent reconnects on startup
- Agent re-syncs state with control plane
- No deployment interruption if Docker services remain running
Future Enhancement: Automatic deployment migration between nodes (after-MVP)

Long-running Operations

System supports long-running deployments and pipeline operations
Large Image Pulls: If container image pull takes excessive time (30+ minutes), this is considered user error
Users should optimize container images for reasonable pull times
No specific timeout enforced, but extremely long operations may cause issues
Progress Indication: Real-time pipeline status and logs visible in UI during long operations
Plugin Timeouts: Individual plugins can be configured with execution timeouts (default: 5 minutes)

Concurrent Deployments

Multiple users can deploy the same project simultaneously
Each deployment is isolated and receives unique ingress URLs
Resource availability is checked independently for each deployment request
Race Conditions: If two users deploy simultaneously and resources are borderline, no special handling
- System designed for simplicity over complex resource locking
- One deployment may fail due to insufficient resources

Agent Version Compatibility

Control plane checks agent version on connection
Mixed agent versions are NOT supported across the environment
Agents with incompatible versions are rejected
All agents must be upgraded together when system version changes

Environment Scaling and Node Management

Adding Nodes to Environment

Admins can add new deployment nodes to an existing environment dynamically
Nodes are manually configured by DevOps engineers through the UI
Control plane does not automatically discover agents
Admin provides:
- Node hostname or IP address (where agent is accessible)
- Agent API port
- Authentication token for secure communication
New nodes must be:
- Provisioned with Docker/k8s infrastructure
- Have agent service installed and running
- Accessible from control plane (network connectivity)
- Registered in the control plane via UI by admin
- Configured with appropriate network policies for deployment isolation
DNS Configuration: Admin must configure DNS for ingress subdomain to point to ingress entry point

Agent Installation Process

Admin navigates to environment configuration in UI
Admin adds new node to the environment, providing:
- Node hostname or IP address (where control plane can reach the agent)
- Agent WebSocket port
- Authentication token (generated by UI)
Admin spins up agent as a Docker container on the target node:
docker run -d \ -p 5000:5000 \ -e AGENT_TOKEN=<token-from-ui> \ -v /var/run/docker.sock:/var/run/docker.sock \ -v /var/lib/nullops-agent:/data \ nullops/agent:latest
Agent starts and generates self-signed TLS certificate (if not already present)
Agent exposes WebSocket endpoint and waits for control plane connection
Control plane initiates WebSocket connection to agent using configured hostname and port
Control plane authenticates using token and stores agent certificate thumbprint
Control plane synchronizes plugin list and deployment state with agent
Node becomes available for deployment allocation

Certificate Management:

Agent generates self-signed certificate on first startup
Certificate is stored in mounted volume (/data) and persists across restarts
If certificate is deleted or regenerated, agent will be untrusted
Admin must approve new certificate thumbprint through UI

Note: Agent does not need to know control plane URL. Agent exposes a WebSocket endpoint and control plane connects to it.

Node Configuration in Control Plane

Admin provides node connection details in UI:
- Hostname or IP address (where agent is accessible)
- Agent WebSocket port
- Authentication token
Control plane initiates WebSocket connection to agent endpoint
Control plane authenticates using token and stores agent certificate thumbprint
Control plane synchronizes plugin list and state with agent via WebSocket
Node becomes available for deployment allocation
Node Disabling: Admin can disable node to prevent new deployments while existing ones complete

Resource Management and Quotas

Resource Limits

This system is designed for internal use only within an organization
No per-user or per-team resource quotas are implemented
No limits on concurrent deployments per user or team
Resource availability is managed at the project level (CPU/RAM limits per deployment)
Trust-based model: Users are expected to be responsible with resource usage

Internal Use Scope

System assumes a cooperative environment
All users are internal team members
Resource management relies on project-level restrictions and available hardware
Fair use policies are organizational rather than system-enforced

Future Enhancements (After-MVP)

Planned Features

SSO Integration: Support for Single Sign-On authentication providers
Vault Integration: Store AES encryption key in external Vault instead of environment variable
Deployment Migration: Automatically migrate deployments between nodes during maintenance
Deployment Diff: Compare configuration differences between two deployments
Pipeline Dry-Run: Test/validate pipelines without actually deploying
Advanced Monitoring and Alerting:
- System-wide metrics dashboard
- Alerting when control plane or agents have issues
- Comprehensive audit logs with detailed "who did what when"
Deployment Templates: DevOps engineers can create templates for common configurations
Disaster Recovery: Comprehensive backup and restore procedures for environments and deployments
JavaScript Plugins: Optional lightweight scripting plugins for simple transformations and validations

Technical Architecture Components

Control Plane / Main Server

Central orchestration service
Manages users, teams, projects, environments, and container registries
Handles authentication and authorization (local accounts)
Maintains persistent WebSocket connections with all agents
Monitors agent health via WebSocket connection status
Queries agents for status, logs, and metrics via WebSocket
Notifies ingress service of deployment changes for dynamic routing updates
Provides API for UI
Data Persistence: All state persisted to PostgreSQL database with connection pooling
User Management: Stores user accounts, team memberships, and permissions
Configuration: Stores projects, pipelines (with versions), environments, plugin configurations, and container registry credentials
Secret Management: Securely stores and provides encrypted secrets to agents during pipeline execution
Certificate Management: Stores agent certificate thumbprints for trust validation
Schema Management: Automated database migrations, no manual schema changes required

Agents

API services running on each deployment node
Maintain internal state of local deployments
Execute pipelines locally
Monitor and cleanup resources
Accept and maintain persistent WebSocket connections from control plane
Expose WebSocket endpoint for control plane connections
Image Management: Pull container images from configured registries using provided credentials
Plugin Execution: Pull and execute plugin containers during pipeline execution
Authentication: Validate incoming WebSocket connections using configured token
Certificate Management: Generate and manage self-signed TLS certificate for secure WebSocket connections
Installation: Deployed as Docker containers with generated tokens from admin UI
Version Compatibility: Must match control plane version; mixed versions not supported
Health Reporting: Connection status indicates agent health (online/offline)
State Synchronization: Receive full state from control plane on reconnection
Plugin Safety: Execute plugins in isolated Docker containers with proper error handling
Network Model: Agent exposes WebSocket endpoint; control plane connects as client; agent does not need control plane URL

Ingress Service

Central routing for all deployment services
Subdomain-based routing (format: servicename.deploymentname.envname.<configured-domain>)
Network isolation enforcement
Dynamic Configuration: Automatically reconfigures routing when deployments are created or destroyed
Receives routing updates from control plane
Maintains routing table for all active deployments across all online nodes
DNS Requirements: DevOps engineer must configure external DNS to point subdomain to ingress entry point
Base subdomain is configurable per environment
High Availability: Hosted near control plane; HA not required for internal-use system
Automatically removes offline nodes from routing table

Plugin System

Docker container-based plugin architecture
Language-agnostic component execution
Used by agents to run pipeline components
Packaging Format: Docker images
Repository: Container registries (same as used for services)
Notification: Control plane notifies agents via WebSocket when plugin configuration changes
Distribution: Agents pull plugin images from configured container registries
Versioning: Plugin versions managed through Docker image tags
Isolation: Each plugin execution runs in isolated container
Contract: Simple stdin/stdout JSON interface
Developer Freedom: Plugins can use any language, libraries, or tools packaged in the container

UI

Web interface for users and admins
Displays deployment status and countdown timers
Provides log viewing and real-time log streaming
Allows deployment management based on permissions
Admin Capabilities:
- Configure container registries (add/remove registries, manage credentials)
- Define services with tag modes and patterns
- Create and edit pipelines using visual node-based editor (creates new version on save)
- Configure plugins (specify Docker images from container registries)
- Manage nodes and environments (including ingress subdomain configuration and DNS setup)
- Create and manage teams
- Assign team permissions to projects/environments
- Generate and rotate agent authentication tokens
- Approve agent certificate thumbprints after certificate changes
- Configure node connection details (hostname, port, token, WebSocket URL)
- Disable/enable nodes for maintenance
User Capabilities:
- Create deployments with service version selection
- Clone existing deployments with version modifications
- View deployment status, logs, metrics, and health checks
- Cancel in-progress deployments
- Delete deployments (with Full Access permission)
- Extend deployment lifetime before auto-cleanup

19 October 2025