Saltar a contenido

Guía de Operaciones - Calmia Nexus

Audiencia: DevOps, SRE, Administradores de Sistemas Última actualización: Febrero 2026 Versión: 1.0.0


Índice

  1. Arquitectura de Infraestructura
  2. Servicios y Componentes
  3. Docker y Contenedores
  4. Variables de Entorno
  5. Base de Datos PostgreSQL
  6. Message Broker RabbitMQ
  7. Monitoreo y Observabilidad
  8. Health Checks
  9. Deployment a Producción
  10. Runbooks Operativos
  11. Troubleshooting
  12. Backup y Recovery
  13. Seguridad Operativa

1. Arquitectura de Infraestructura

1.1 Diagrama de Arquitectura

graph TB
    subgraph "Internet"
        USER[Usuarios]
        WH_SENTRY[Webhooks Sentry]
        WH_WA[Webhooks WhatsApp]
        WH_CU[Webhooks ClickUp]
    end

    subgraph "Cloudflare"
        CF[Cloudflare DNS + WAF]
    end

    subgraph "Producción - ns3032187"
        subgraph "IIS Web Server"
            ADMIN[Admin UI<br/>app.kalmiazen.com<br/>:443]
            API[API REST<br/>api.kalmiazen.com<br/>:443]
            MCP[MCP Server<br/>mcp.kalmiazen.com<br/>:443]
        end

        subgraph "Servicios Locales"
            WORKERS[Orchestrator.Workers<br/>Background Jobs]
            WA_SVC[WhatsApp Service<br/>:8000]
        end

        subgraph "Infraestructura"
            PG[(PostgreSQL 16<br/>:5432)]
            RMQ[(RabbitMQ<br/>:5672/:15672)]
        end

        subgraph "Tunnels"
            NGROK[ngrok<br/>Webhooks]
        end
    end

    subgraph "Servicios Externos"
        CLAUDE[Anthropic Claude API]
        CLICKUP[ClickUp API]
        SENTRY[Sentry.io]
        ULTRAMSG[UltraMsg WhatsApp]
        HOLDED[Holded API]
    end

    USER --> CF --> ADMIN & API & MCP
    WH_SENTRY --> NGROK --> API
    WH_WA --> NGROK --> WA_SVC
    WH_CU --> API

    ADMIN --> API
    API --> PG & RMQ & WORKERS
    WORKERS --> PG & RMQ
    WA_SVC --> PG & RMQ

    API --> CLAUDE & CLICKUP & SENTRY & HOLDED
    WA_SVC --> ULTRAMSG

1.2 Stack Tecnológico

Capa Tecnología Versión Puerto
Web Server IIS 10 Windows Server 2019 80/443
Runtime .NET 9 9.0.x -
Database PostgreSQL 16 Alpine 5432
Message Broker RabbitMQ 3-management 5672/15672
Reverse Proxy Cloudflare - -
Containers Docker Desktop 4.x -
Tunneling ngrok 3.x -

1.3 URLs de Producción

Servicio URL Descripción
Admin UI https://app.kalmiazen.com Panel de administración Blazor
API REST https://api.kalmiazen.com Endpoints de la API
API Swagger https://api.kalmiazen.com/swagger Documentación OpenAPI
MCP Server https://mcp.kalmiazen.com Model Context Protocol
MCP Swagger https://mcp.kalmiazen.com/swagger Documentación MCP
RabbitMQ UI http://localhost:15672 Management (solo local)

2. Servicios y Componentes

2.1 Mapa de Servicios

graph LR
    subgraph "Frontend"
        A[Orchestrator.Admin<br/>Blazor SSR]
    end

    subgraph "Backend"
        B[Orchestrator.Api<br/>REST + SignalR]
        C[Orchestrator.Workers<br/>Background Jobs]
        D[Orchestrator.Mcp.Remote<br/>MCP Server]
    end

    subgraph "Messaging"
        E[WhatsappService<br/>UltraMsg Integration]
    end

    subgraph "Shared"
        F[Shared.Admin<br/>Entities + Services]
        G[Shared.Events<br/>DTOs + Events]
    end

    A --> B
    B --> C
    B --> F & G
    C --> F & G
    D --> B
    E --> F & G

2.2 Responsabilidades por Servicio

Servicio Responsabilidad Dependencias
Orchestrator.Admin UI Blazor, Workspace, Chat API
Orchestrator.Api REST endpoints, SignalR hubs, Auth PostgreSQL, RabbitMQ
Orchestrator.Workers Jobs en background, sincronización PostgreSQL, RabbitMQ
Orchestrator.Mcp.Remote MCP OAuth2, Tools, Resources API
WhatsappService Webhooks UltraMsg, mensajes PostgreSQL, RabbitMQ

2.3 Puertos por Entorno

Servicio Desarrollo Docker Producción (IIS)
Admin 5002 - 443 (app.kalmiazen.com)
API 5001 - 443 (api.kalmiazen.com)
MCP 5003 5003 443 (mcp.kalmiazen.com)
Workers - - Background
WhatsApp 5004 8000 8000
PostgreSQL 5432 5432 5432
RabbitMQ 5672/15672 5672/15672 5672/15672

3. Docker y Contenedores

3.1 Docker Compose - Desarrollo

# docker-compose.yml
version: "3.9"

services:
  rabbitmq:
    image: rabbitmq:3-management
    container_name: rabbitmq
    ports:
      - "5672:5672"   # AMQP
      - "15672:15672" # Management UI
    environment:
      - RABBITMQ_DEFAULT_USER=${RABBITMQ_DEFAULT_USER:-admin}
      - RABBITMQ_DEFAULT_PASS=${RABBITMQ_DEFAULT_PASS:-admin}
    volumes:
      - rabbitmq_data:/var/lib/rabbitmq
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "check_running"]
      interval: 5s
      timeout: 5s
      retries: 10
      start_period: 30s

  postgres:
    image: postgres:16-alpine
    container_name: nexus-db
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_USER=${POSTGRES_USER:-nexus}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-nexus_dev_2026}
      - POSTGRES_DB=${POSTGRES_DB:-nexus}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U nexus -d nexus"]
      interval: 5s
      timeout: 5s
      retries: 5

  whatsapp-service:
    build:
      context: ./Messaging/WhatsappService
      dockerfile: Dockerfile
    container_name: whatsapp-service
    ports:
      - "8000:80"
    environment:
      - ASPNETCORE_ENVIRONMENT=Docker
      - ASPNETCORE_URLS=http://+:80
      - ConnectionStrings__NexusDb=${ConnectionStrings__NexusDb}
      - RabbitMq__ConnectionString=${RabbitMq__ConnectionString}
      - UltraMsg__InstanceId=${UltraMsg__InstanceId}
      - UltraMsg__Token=${UltraMsg__Token}
      - Sentry__Dsn=${Sentry__Dsn}
    depends_on:
      rabbitmq:
        condition: service_healthy
      postgres:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 10s
    volumes:
      - ./logs/whatsapp:/app/logs

  mcp-remote:
    build:
      context: ./Orchestrator/src/Orchestrator.Mcp.Remote
      dockerfile: Dockerfile
    container_name: mcp-remote
    profiles:
      - mcp
    ports:
      - "5003:5003"
    environment:
      - ASPNETCORE_ENVIRONMENT=Docker
      - ASPNETCORE_URLS=http://+:5003
      - NEXUS_MCP_REMOTE_Orchestrator__BaseUrl=http://host.docker.internal:5001
      - NEXUS_MCP_REMOTE_OAuth2__SigningKey=${OAuth2__SigningKey}
      - NEXUS_MCP_REMOTE_OAuth2__Issuer=https://mcp.kalmiazen.com
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:5003/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 10s

volumes:
  rabbitmq_data:
  postgres_data:

3.2 Comandos Docker Frecuentes

# Iniciar infraestructura (PostgreSQL + RabbitMQ)
docker-compose up -d postgres rabbitmq

# Iniciar WhatsApp Service
docker-compose up -d whatsapp-service

# Iniciar MCP Remote (requiere profile)
docker-compose --profile mcp up -d mcp-remote

# Ver logs en tiempo real
docker-compose logs -f whatsapp-service

# Reconstruir imagen específica
docker-compose build --no-cache whatsapp-service

# Limpiar volúmenes (CUIDADO: borra datos)
docker-compose down -v

# Ver estado de servicios
docker-compose ps

# Ejecutar comando en contenedor
docker exec -it nexus-db psql -U nexus -d nexus

3.3 Dockerfile - WhatsApp Service

FROM mcr.microsoft.com/dotnet/aspnet:9.0 AS base
WORKDIR /app
EXPOSE 80
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

FROM mcr.microsoft.com/dotnet/sdk:9.0 AS build
WORKDIR /src
COPY ["WhatsappService.csproj", "."]
COPY ["../Shared/", "../Shared/"]
RUN dotnet restore "WhatsappService.csproj"
COPY . .
RUN dotnet build "WhatsappService.csproj" -c Release -o /app/build

FROM build AS publish
RUN dotnet publish "WhatsappService.csproj" -c Release -o /app/publish

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "WhatsappService.dll"]

3.4 Dockerfile - MCP Remote (con seguridad)

FROM mcr.microsoft.com/dotnet/sdk:9.0 AS build
WORKDIR /src

COPY ["Orchestrator.Mcp.Remote.csproj", "./"]
RUN dotnet restore "Orchestrator.Mcp.Remote.csproj"

COPY . .
RUN dotnet build "Orchestrator.Mcp.Remote.csproj" -c Release -o /app/build
RUN dotnet publish "Orchestrator.Mcp.Remote.csproj" -c Release -o /app/publish

FROM mcr.microsoft.com/dotnet/aspnet:9.0 AS final
WORKDIR /app

# Instalar curl para health checks
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

# Crear usuario no-root
RUN adduser --disabled-password --gecos "" appuser
USER appuser

COPY --from=build /app/publish .

ENV ASPNETCORE_URLS=http://+:5003
ENV ASPNETCORE_ENVIRONMENT=Production

EXPOSE 5003

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:5003/health || exit 1

ENTRYPOINT ["dotnet", "Orchestrator.Mcp.Remote.dll"]

4. Variables de Entorno

4.1 Archivo .env.example

# ============================================
# CALMIA NEXUS - VARIABLES DE ENTORNO
# ============================================
# Copia este archivo como .env y completa los valores
# NUNCA commits el archivo .env al repositorio

# --------------------------------------------
# BASE DE DATOS - PostgreSQL
# --------------------------------------------
DB_HOST=localhost
DB_PORT=5432
DB_NAME=nexus
DB_USER=nexus
DB_PASSWORD=<REQUERIDO>

# Para Docker
POSTGRES_USER=nexus
POSTGRES_DB=nexus
POSTGRES_PASSWORD=<REQUERIDO>

# Connection String completo
ConnectionStrings__NexusDb=Host=localhost;Port=5432;Database=nexus;Username=nexus;Password=<PASSWORD>;Encoding=UTF8

# --------------------------------------------
# MESSAGE BROKER - RabbitMQ
# --------------------------------------------
RABBITMQ_HOST=localhost
RABBITMQ_PORT=5672
RABBITMQ_USER=admin
RABBITMQ_PASSWORD=<REQUERIDO>

# Para Docker
RABBITMQ_DEFAULT_USER=admin
RABBITMQ_DEFAULT_PASS=<REQUERIDO>

# Connection String
RabbitMq__ConnectionString=amqp://admin:<PASSWORD>@localhost:5672/

# --------------------------------------------
# API KEYS - SERVICIOS DE IA
# --------------------------------------------
# Anthropic Claude
Claude__ApiKey=<REQUERIDO>
Claude__Model=claude-sonnet-4-20250514

# --------------------------------------------
# INTEGRACIONES EXTERNAS
# --------------------------------------------
# ClickUp (Task Management)
ClickUp__ApiToken=<REQUERIDO>
ClickUp__WebhookSecret=<REQUERIDO>
ClickUp__DefaultTeamId=90152294725
ClickUp__DefaultSpaceId=90159677191

# UltraMsg (WhatsApp)
UltraMsg__InstanceId=<REQUERIDO>
UltraMsg__Token=<REQUERIDO>

# Holded (Projects/Invoicing)
Holded__ApiKey=<REQUERIDO>

# --------------------------------------------
# MONITOREO - Sentry
# --------------------------------------------
Sentry__Dsn=<REQUERIDO>
SENTRY_AUTH_TOKEN=<REQUERIDO>
Sentry__OrganizationSlug=<REQUERIDO>
Sentry__ProjectSlug=<REQUERIDO>

# --------------------------------------------
# EMAIL - SMTP
# --------------------------------------------
Smtp__Host=smtp.gmail.com
Smtp__Port=587
Smtp__Username=<REQUERIDO>
Smtp__Password=<REQUERIDO>
Smtp__UseSsl=true
Smtp__FromEmail=<REQUERIDO>
Smtp__FromName=Nexus Platform

# --------------------------------------------
# SEGURIDAD
# --------------------------------------------
# Clave de encriptación (32+ caracteres)
Encryption__Key=<REQUERIDO>

# JWT para Tunnel Service
Tunnel__SecretKey=<REQUERIDO>
Tunnel__Issuer=nexus-tunnel
Tunnel__Audience=tunnel-clients

# OAuth2 para MCP Remote (32+ caracteres)
OAuth2__SigningKey=<REQUERIDO>

# Clientes OAuth2
OAuth2__Clients__0__ClientSecret=<REQUERIDO>  # Admin
OAuth2__Clients__1__ClientSecret=<REQUERIDO>  # Desktop
OAuth2__Clients__3__ClientSecret=<REQUERIDO>  # Swagger

# Usuarios OAuth2
OAuth2__Users__0__Username=<REQUERIDO>
OAuth2__Users__0__Password=<REQUERIDO>
OAuth2__Users__0__DisplayName=<REQUERIDO>

# --------------------------------------------
# URLs DE SERVICIOS
# --------------------------------------------
Api__BaseUrl=https://api.kalmiazen.com
Admin__BaseUrl=https://app.kalmiazen.com
Mcp__BaseUrl=https://mcp.kalmiazen.com

# --------------------------------------------
# NOTIFICACIONES
# --------------------------------------------
Notifications__AdminWhatsapp=<NÚMERO_ADMIN>

4.2 Validación de Variables

# Script para validar variables requeridas
$required = @(
    "ConnectionStrings__NexusDb",
    "RabbitMq__ConnectionString",
    "Claude__ApiKey",
    "Sentry__Dsn",
    "Encryption__Key"
)

$missing = @()
foreach ($var in $required) {
    if (-not [Environment]::GetEnvironmentVariable($var)) {
        $missing += $var
    }
}

if ($missing.Count -gt 0) {
    Write-Host "Variables faltantes:" -ForegroundColor Red
    $missing | ForEach-Object { Write-Host "  - $_" }
    exit 1
}

Write-Host "Todas las variables requeridas están configuradas" -ForegroundColor Green

4.3 Cargar Variables desde .env

# cargar-env.ps1
function Load-EnvFile {
    param([string]$Path = ".env")

    if (Test-Path $Path) {
        Get-Content $Path | ForEach-Object {
            if ($_ -match '^([^#][^=]+)=(.*)$') {
                $name = $matches[1].Trim()
                $value = $matches[2].Trim()
                [Environment]::SetEnvironmentVariable($name, $value, "Process")
                Write-Host "  Loaded: $name" -ForegroundColor DarkGray
            }
        }
        Write-Host "Variables cargadas desde $Path" -ForegroundColor Green
    } else {
        Write-Host "Archivo $Path no encontrado" -ForegroundColor Yellow
    }
}

# Uso
Load-EnvFile -Path ".env"

5. Base de Datos PostgreSQL

5.1 Configuración de Conexión

// appsettings.json (sin credenciales)
{
  "ConnectionStrings": {
    "NexusDb": ""  // Se lee de variable de entorno
  }
}

// Program.cs
builder.Services.AddDbContext<NexusDbContext>(options =>
{
    var connectionString = builder.Configuration.GetConnectionString("NexusDb");
    options.UseNpgsql(connectionString, npgsqlOptions =>
    {
        npgsqlOptions.EnableRetryOnFailure(
            maxRetryCount: 5,
            maxRetryDelay: TimeSpan.FromSeconds(30),
            errorCodesToAdd: null
        );
        npgsqlOptions.CommandTimeout(120);
    });
});

5.2 Migraciones Entity Framework

# Crear nueva migración
dotnet ef migrations add NombreMigracion -p Orchestrator.Api -c NexusDbContext

# Aplicar migraciones
dotnet ef database update -p Orchestrator.Api

# Ver migraciones pendientes
dotnet ef migrations list -p Orchestrator.Api

# Generar script SQL
dotnet ef migrations script -p Orchestrator.Api -o migration.sql

# Revertir última migración
dotnet ef database update PreviousMigration -p Orchestrator.Api

5.3 Mantenimiento PostgreSQL

-- Ver tamaño de tablas
SELECT
    relname AS table_name,
    pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
    pg_size_pretty(pg_relation_size(relid)) AS data_size,
    pg_size_pretty(pg_indexes_size(relid)) AS index_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 20;

-- Ver índices no usados
SELECT
    schemaname, relname, indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

-- Vacuum y análisis
VACUUM ANALYZE;

-- Ver conexiones activas
SELECT
    pid, usename, application_name, client_addr,
    state, query_start, query
FROM pg_stat_activity
WHERE datname = 'nexus';

-- Terminar conexión específica
SELECT pg_terminate_backend(pid);

-- Ver locks
SELECT
    blocked_locks.pid AS blocked_pid,
    blocking_locks.pid AS blocking_pid,
    blocked_activity.usename AS blocked_user,
    blocking_activity.usename AS blocking_user
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

5.4 Backup y Restore

# Backup completo
pg_dump -U nexus -d nexus -F c -f backup_$(date +%Y%m%d_%H%M%S).dump

# Backup solo schema
pg_dump -U nexus -d nexus --schema-only -f schema.sql

# Backup solo datos
pg_dump -U nexus -d nexus --data-only -f data.sql

# Restore
pg_restore -U nexus -d nexus -c backup.dump

# Backup con compresión
pg_dump -U nexus -d nexus | gzip > backup.sql.gz

# Restore desde comprimido
gunzip -c backup.sql.gz | psql -U nexus -d nexus

6. Message Broker RabbitMQ

6.1 Configuración

// Configuración del cliente
builder.Services.AddSingleton<IConnection>(sp =>
{
    var factory = new ConnectionFactory
    {
        Uri = new Uri(builder.Configuration["RabbitMq:ConnectionString"]!),
        AutomaticRecoveryEnabled = true,
        NetworkRecoveryInterval = TimeSpan.FromSeconds(10)
    };
    return factory.CreateConnection();
});

6.2 Colas Definidas

Cola Propósito Consumers
task_executions Ejecuciones de tareas Workers
copilot_requests Solicitudes a Copilot Workers
whatsapp_messages Mensajes WhatsApp WhatsApp Service
notifications Notificaciones generales Workers
sync_events Eventos de sincronización Workers

6.3 Comandos de Administración

# Ver colas
rabbitmqctl list_queues name messages consumers

# Ver conexiones
rabbitmqctl list_connections user peer_host state

# Purgar cola (CUIDADO)
rabbitmqctl purge_queue task_executions

# Ver estadísticas
rabbitmqctl status

# Exportar definiciones (backup)
rabbitmqctl export_definitions definitions.json

# Importar definiciones (restore)
rabbitmqctl import_definitions definitions.json

6.4 Management UI

  • URL: http://localhost:15672
  • Usuario: admin
  • Contraseña: (ver .env)

Funcionalidades: - Ver colas y mensajes pendientes - Monitorear throughput - Crear/eliminar colas y exchanges - Ver conexiones activas


7. Monitoreo y Observabilidad

7.1 Stack de Monitoreo

graph LR
    APP[Aplicaciones] --> SENTRY[Sentry<br/>Errores]
    APP --> SERILOG[Serilog<br/>Logs]

    SERILOG --> CONSOLE[Console]
    SERILOG --> FILE[Archivos]
    SERILOG --> PG_LOGS[(PostgreSQL<br/>system_logs)]

    SENTRY --> DASHBOARD[Dashboard<br/>Sentry.io]

7.2 Configuración Sentry

// Program.cs
builder.WebHost.UseSentry(options =>
{
    options.Dsn = builder.Configuration["Sentry:Dsn"];
    options.Environment = builder.Environment.EnvironmentName;
    options.Release = "calmia-nexus@1.0.0";
    options.AutoSessionTracking = true;
    options.TracesSampleRate = 1.0;
    options.Debug = builder.Environment.IsDevelopment();
    options.SetBeforeSend((sentryEvent, hint) =>
    {
        sentryEvent.SetTag("service", "orchestrator-api");
        return sentryEvent;
    });
});

// Middleware
app.UseSentryTracing();

7.3 Configuración Serilog

// Configuración completa
Log.Logger = new LoggerConfiguration()
    .MinimumLevel.Information()
    .MinimumLevel.Override("Microsoft.AspNetCore", LogEventLevel.Warning)
    .MinimumLevel.Override("Microsoft.EntityFrameworkCore", LogEventLevel.Warning)
    .Enrich.FromLogContext()
    .Enrich.WithMachineName()
    .Enrich.WithEnvironmentName()
    .WriteTo.Console(
        outputTemplate: "[{Timestamp:HH:mm:ss} {Level:u3}] {Message:lj}{NewLine}{Exception}"
    )
    .WriteTo.File(
        path: "logs/nexus-.log",
        rollingInterval: RollingInterval.Day,
        retainedFileCountLimit: 30
    )
    .WriteTo.PostgreSQL(
        connectionString: connectionString,
        tableName: "system_logs",
        needAutoCreateTable: true
    )
    .WriteTo.Sentry()
    .CreateLogger();

7.4 Métricas Clave a Monitorear

Métrica Umbral Warning Umbral Critical Acción
CPU Usage >70% >90% Scale up
Memory Usage >80% >95% Restart/Scale
Request Latency P95 >500ms >2000ms Investigate
Error Rate >1% >5% Alert
Queue Depth >1000 >5000 Add workers
DB Connections >80 >95 Connection pool
Disk Usage >70% >90% Cleanup/Expand

7.5 Logs Estructurados

// Ejemplo de logging estructurado
_logger.LogInformation(
    "Procesando tarea {TaskId} para sesión {SessionId}, tipo: {TaskType}",
    taskId, sessionId, taskType
);

// Con scope
using (_logger.BeginScope(new Dictionary<string, object>
{
    ["UserId"] = userId,
    ["OrganizationId"] = orgId
}))
{
    _logger.LogInformation("Iniciando operación");
    // ... operación
    _logger.LogInformation("Operación completada");
}

8. Health Checks

8.1 Endpoints de Health

Servicio Endpoint Verificaciones
API /health DB, RabbitMQ, Sentry
Admin /health API connectivity
MCP /health API, OAuth2
WhatsApp /health DB, RabbitMQ, UltraMsg

8.2 Implementación ASP.NET Core

// Program.cs
builder.Services.AddHealthChecks()
    .AddNpgSql(
        connectionString,
        name: "postgresql",
        tags: new[] { "db", "sql", "postgresql" }
    )
    .AddRabbitMQ(
        rabbitConnectionString,
        name: "rabbitmq",
        tags: new[] { "messaging", "rabbitmq" }
    )
    .AddUrlGroup(
        new Uri("https://api.anthropic.com/v1/health"),
        name: "claude-api",
        tags: new[] { "external", "ai" }
    )
    .AddCheck<SentryHealthCheck>("sentry");

// Endpoint
app.MapHealthChecks("/health", new HealthCheckOptions
{
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready")
});

app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = _ => false // Solo verifica que la app responde
});

8.3 Health Check Personalizado

public class SentryHealthCheck : IHealthCheck
{
    private readonly IConfiguration _config;

    public SentryHealthCheck(IConfiguration config) => _config = config;

    public Task<HealthCheckResult> CheckHealthAsync(
        HealthCheckContext context,
        CancellationToken cancellationToken = default)
    {
        var dsn = _config["Sentry:Dsn"];

        if (string.IsNullOrEmpty(dsn))
            return Task.FromResult(HealthCheckResult.Degraded("Sentry DSN not configured"));

        return Task.FromResult(HealthCheckResult.Healthy("Sentry configured"));
    }
}

8.4 Docker Health Checks

# docker-compose.yml
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

9. Deployment a Producción

9.1 Script Principal: restart-platform.ps1

# Parámetros
param(
    [switch]$SkipBuild,      # Saltar compilación
    [switch]$Force,          # Forzar restart aunque haya errores
    [switch]$WaitForReady    # Esperar a que todo esté listo
)

# Configuración
$IIS_PATHS = @{
    Admin = "C:\inetpub\app.kalmiazen.com"
    Api   = "C:\inetpub\api.kalmiazen.com"
    Mcp   = "C:\inetpub\mcp.kalmiazen.com"
}

$APP_POOLS = @("NexusAdminPool", "NexusApiPool", "NexusMcpPool")

# 1. Cargar variables de entorno
. .\cargar-env.ps1

# 2. Detener servicios
Write-Host "Deteniendo servicios..." -ForegroundColor Yellow
foreach ($pool in $APP_POOLS) {
    & appcmd.exe stop apppool /apppool.name:$pool 2>$null
}
Stop-Process -Name "Orchestrator.Workers" -Force -ErrorAction SilentlyContinue
Stop-Process -Name "WhatsappService" -Force -ErrorAction SilentlyContinue
Stop-Process -Name "ngrok" -Force -ErrorAction SilentlyContinue

# 3. Verificar infraestructura
Write-Host "Verificando infraestructura..." -ForegroundColor Yellow

# PostgreSQL
$pgService = Get-Service "postgresql-x64-16" -ErrorAction SilentlyContinue
if ($pgService.Status -ne "Running") {
    Start-Service "postgresql-x64-16"
    Start-Sleep -Seconds 5
}

# RabbitMQ
$rmqService = Get-Service "RabbitMQ" -ErrorAction SilentlyContinue
if ($rmqService.Status -ne "Running") {
    Start-Service "RabbitMQ"
    Start-Sleep -Seconds 10
}

# 4. Compilar (si no -SkipBuild)
if (-not $SkipBuild) {
    Write-Host "Compilando solución..." -ForegroundColor Yellow
    dotnet build --configuration Release
    if ($LASTEXITCODE -ne 0) { throw "Build failed" }
}

# 5. Publicar a IIS
Write-Host "Publicando a IIS..." -ForegroundColor Yellow
dotnet publish Orchestrator/src/Orchestrator.Admin -c Release -o $IIS_PATHS.Admin
dotnet publish Orchestrator/src/Orchestrator.Api -c Release -o $IIS_PATHS.Api
dotnet publish Orchestrator/src/Orchestrator.Mcp.Remote -c Release -o $IIS_PATHS.Mcp

# Copiar .env a cada directorio
foreach ($path in $IIS_PATHS.Values) {
    Copy-Item ".env" -Destination $path -Force
}

# 6. Aplicar migraciones
Write-Host "Aplicando migraciones..." -ForegroundColor Yellow
Push-Location Orchestrator/src/Orchestrator.Api
dotnet ef database update
Pop-Location

# 7. Iniciar Workers
Write-Host "Iniciando Workers..." -ForegroundColor Yellow
Start-Process -FilePath "dotnet" -ArgumentList "run --configuration Release" `
    -WorkingDirectory "Orchestrator/src/Orchestrator.Workers" `
    -WindowStyle Hidden

# 8. Iniciar WhatsApp Service
Write-Host "Iniciando WhatsApp Service..." -ForegroundColor Yellow
Start-Process -FilePath "dotnet" -ArgumentList "run --configuration Release" `
    -WorkingDirectory "Messaging/WhatsappService" `
    -WindowStyle Hidden

# 9. Iniciar App Pools
Write-Host "Iniciando IIS App Pools..." -ForegroundColor Yellow
foreach ($pool in $APP_POOLS) {
    & appcmd.exe start apppool /apppool.name:$pool
}

# 10. Iniciar ngrok
Write-Host "Iniciando ngrok..." -ForegroundColor Yellow
Start-Process -FilePath "ngrok" -ArgumentList "start --all --config=ngrok.yml" `
    -WindowStyle Hidden

# 11. Resumen
Write-Host "`n=== DEPLOYMENT COMPLETADO ===" -ForegroundColor Green
Write-Host "Admin:   https://app.kalmiazen.com"
Write-Host "API:     https://api.kalmiazen.com"
Write-Host "MCP:     https://mcp.kalmiazen.com"
Write-Host "Swagger: https://api.kalmiazen.com/swagger"

9.2 Checklist de Deployment

## Pre-Deployment
- [ ] Backup de base de datos realizado
- [ ] Variables de entorno actualizadas
- [ ] Tests pasando en CI
- [ ] Changelog actualizado
- [ ] Tag de versión creado

## Deployment
- [ ] Servicios detenidos correctamente
- [ ] Build exitoso
- [ ] Migraciones aplicadas
- [ ] Archivos publicados
- [ ] .env copiado a destinos

## Post-Deployment
- [ ] Health checks pasando
- [ ] Logs sin errores críticos
- [ ] Funcionalidades clave verificadas
- [ ] Monitoreo confirmando métricas normales
- [ ] Comunicación al equipo

9.3 Rollback

# rollback.ps1
param(
    [Parameter(Mandatory)]
    [string]$Version  # Tag de git a restaurar
)

# 1. Detener servicios
.\restart-platform.ps1 -StopOnly

# 2. Checkout versión anterior
git checkout $Version

# 3. Restaurar backup de BD
$backupFile = "backup_$Version.dump"
if (Test-Path $backupFile) {
    pg_restore -U nexus -d nexus -c $backupFile
}

# 4. Redeployar
.\restart-platform.ps1 -SkipBuild:$false

Write-Host "Rollback a $Version completado" -ForegroundColor Green

10. Runbooks Operativos

10.1 RB-001: Servicio No Responde

## Síntoma
El servicio (API/Admin/MCP) no responde a requests

## Diagnóstico
1. Verificar health check: curl https://api.kalmiazen.com/health
2. Ver logs de IIS: Get-EventLog -LogName Application -Source "IIS*" -Newest 20
3. Ver logs de aplicación: Get-Content C:\inetpub\api.kalmiazen.com\logs\*.log -Tail 100

## Resolución
1. Reiniciar App Pool:
   appcmd.exe recycle apppool /apppool.name:NexusApiPool

2. Si persiste, reiniciar IIS:
   iisreset

3. Si sigue fallando, verificar:
   - Conexión a PostgreSQL
   - Conexión a RabbitMQ
   - Variables de entorno

## Escalación
Si no se resuelve en 15 minutos, escalar a desarrollo

10.2 RB-002: Base de Datos Lenta

## Síntoma
Queries lentas, timeouts frecuentes

## Diagnóstico
1. Ver queries activas:
   SELECT * FROM pg_stat_activity WHERE state = 'active';

2. Ver queries lentas:
   SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;

3. Ver locks:
   SELECT * FROM pg_locks WHERE granted = false;

## Resolución
1. Terminar queries problemáticas:
   SELECT pg_terminate_backend(pid);

2. Ejecutar VACUUM:
   VACUUM ANALYZE;

3. Si hay índices faltantes:
   CREATE INDEX CONCURRENTLY idx_name ON table(column);

## Prevención
- Revisar explain plans de queries nuevas
- Monitorear pg_stat_statements regularmente

10.3 RB-003: Cola RabbitMQ Saturada

## Síntoma
Mensajes acumulándose en cola, workers no procesan

## Diagnóstico
1. Ver estado de colas:
   rabbitmqctl list_queues name messages consumers

2. Ver consumers:
   rabbitmqctl list_consumers

## Resolución
1. Verificar que Workers esté corriendo:
   Get-Process -Name "*Orchestrator.Workers*"

2. Reiniciar Workers:
   Stop-Process -Name "Orchestrator.Workers" -Force
   Start-Process dotnet "run --configuration Release" -WorkingDirectory "Orchestrator.Workers"

3. Si hay mensajes corruptos, purgar cola (CUIDADO):
   rabbitmqctl purge_queue nombre_cola

## Escalación
Si mensajes siguen acumulándose, aumentar workers o investigar bottleneck

10.4 RB-004: Error de Memoria

## Síntoma
OutOfMemoryException, app crashes

## Diagnóstico
1. Ver uso de memoria:
   Get-Process -Name "dotnet" | Select-Object Name, WorkingSet64

2. Ver logs de Sentry para stack trace

## Resolución
1. Reciclar App Pool:
   appcmd.exe recycle apppool /apppool.name:NexusApiPool

2. Si persiste, aumentar límites de memoria en IIS:
   - Application Pools > Advanced Settings
   - Private Memory Limit (KB)

3. Investigar memory leaks con dotnet-dump

## Prevención
- Configurar reciclaje periódico de App Pool
- Usar IDisposable correctamente

10.5 RB-005: Certificado SSL Expirado

## Síntoma
ERR_CERT_DATE_INVALID en navegador

## Diagnóstico
1. Verificar certificado:
   openssl s_client -connect api.kalmiazen.com:443 -servername api.kalmiazen.com

## Resolución (Cloudflare)
1. Dashboard Cloudflare > SSL/TLS > Edge Certificates
2. Verificar que el certificado esté activo
3. Si usa Origin Certificate, renovar en el servidor

## Resolución (IIS directo)
1. Generar nuevo certificado con certbot/win-acme
2. Importar en IIS > Server Certificates
3. Actualizar binding HTTPS del sitio

## Prevención
- Configurar alertas de expiración (30 días antes)
- Usar Cloudflare con renovación automática

11. Troubleshooting

11.1 Problemas Comunes

Problema Causa Probable Solución
502 Bad Gateway App Pool detenido appcmd start apppool /apppool.name:NexusApiPool
Connection refused DB PostgreSQL detenido Start-Service postgresql-x64-16
Queue not found RabbitMQ reiniciado Reiniciar app para recrear colas
Sentry no envía DSN inválido Verificar variable Sentry__Dsn
SignalR desconecta Timeout de IIS Aumentar timeout en web.config
Memory leak IDisposable mal usado Reciclar App Pool, investigar código

11.2 Comandos de Diagnóstico

# Ver puertos en uso
netstat -ano | findstr "5001 5002 5003 5432 5672"

# Ver procesos .NET
Get-Process -Name "dotnet" | Select-Object Id, ProcessName, WorkingSet64, CPU

# Ver logs de Windows
Get-EventLog -LogName Application -Newest 50 | Where-Object { $_.Source -match "IIS|ASP.NET" }

# Test conexión PostgreSQL
psql -U nexus -d nexus -h localhost -c "SELECT 1"

# Test conexión RabbitMQ
rabbitmqctl status

# Ver últimos errores en Sentry (API)
curl -H "Authorization: Bearer $SENTRY_AUTH_TOKEN" \
     "https://sentry.io/api/0/projects/ORG/PROJECT/issues/?query=is:unresolved"

11.3 Logs por Servicio

Servicio Ubicación de Logs
API C:\inetpub\api.kalmiazen.com\logs\
Admin C:\inetpub\app.kalmiazen.com\logs\
MCP C:\inetpub\mcp.kalmiazen.com\logs\
Workers Orchestrator.Workers\logs\
WhatsApp Messaging\WhatsappService\logs\
IIS C:\inetpub\logs\LogFiles\
PostgreSQL C:\Program Files\PostgreSQL\16\data\log\

12. Backup y Recovery

12.1 Estrategia de Backup

Componente Frecuencia Retención Método
PostgreSQL Full Diario 03:00 30 días pg_dump
PostgreSQL Incremental Cada 6h 7 días WAL archiving
RabbitMQ Definitions Diario 7 días export_definitions
Archivos .env En cada cambio Indefinido Git (repo privado)
Logs - 30 días Rotación automática

12.2 Script de Backup Automático

# backup-nexus.ps1
$timestamp = Get-Date -Format "yyyyMMdd_HHmmss"
$backupDir = "D:\Backups\nexus"

# Crear directorio si no existe
New-Item -ItemType Directory -Force -Path $backupDir | Out-Null

# Backup PostgreSQL
$pgBackup = Join-Path $backupDir "nexus_$timestamp.dump"
pg_dump -U nexus -d nexus -F c -f $pgBackup
Write-Host "PostgreSQL backup: $pgBackup"

# Backup RabbitMQ
$rmqBackup = Join-Path $backupDir "rabbitmq_$timestamp.json"
rabbitmqctl export_definitions $rmqBackup
Write-Host "RabbitMQ backup: $rmqBackup"

# Limpiar backups antiguos (>30 días)
Get-ChildItem $backupDir -File | Where-Object {
    $_.LastWriteTime -lt (Get-Date).AddDays(-30)
} | Remove-Item -Force

Write-Host "Backup completado: $timestamp"

12.3 Procedimiento de Recovery

## Recovery Completo

1. Restaurar PostgreSQL:
   pg_restore -U nexus -d nexus -c backup.dump

2. Restaurar RabbitMQ:
   rabbitmqctl import_definitions rabbitmq.json

3. Verificar variables de entorno (.env)

4. Aplicar migraciones pendientes:
   dotnet ef database update

5. Reiniciar servicios:
   .\restart-platform.ps1

6. Verificar health checks:
   curl https://api.kalmiazen.com/health

13. Seguridad Operativa

13.1 Checklist de Seguridad

## Acceso y Autenticación
- [ ] Contraseñas de BD rotadas cada 90 días
- [ ] API keys con scopes mínimos necesarios
- [ ] OAuth2 secrets seguros (32+ caracteres)
- [ ] MFA habilitado para acceso a servidores

## Red
- [ ] Firewall configurado (solo puertos necesarios)
- [ ] HTTPS obligatorio (HTTP redirige)
- [ ] Cloudflare WAF habilitado
- [ ] Rate limiting configurado

## Datos
- [ ] Backups encriptados
- [ ] Logs sin datos sensibles (PII)
- [ ] Variables de entorno protegidas
- [ ] .env fuera de repositorio

## Monitoreo
- [ ] Alertas de seguridad en Sentry
- [ ] Logs de acceso revisados
- [ ] Dependencias escaneadas (npm audit, dotnet list vulnerable)

13.2 Secretos y Rotación

# generar-secretos.ps1

# Generar clave de encriptación (256 bits)
$encryptionKey = [Convert]::ToBase64String((1..32 | ForEach-Object { Get-Random -Maximum 256 }))
Write-Host "Encryption__Key=$encryptionKey"

# Generar OAuth2 signing key
$oauth2Key = -join ((65..90) + (97..122) + (48..57) | Get-Random -Count 64 | ForEach-Object { [char]$_ })
Write-Host "OAuth2__SigningKey=$oauth2Key"

# Generar contraseña de BD
$dbPassword = -join ((65..90) + (97..122) + (48..57) | Get-Random -Count 24 | ForEach-Object { [char]$_ })
Write-Host "DB_PASSWORD=$dbPassword"

13.3 Firewall Rules

# Configurar firewall de Windows

# Permitir PostgreSQL solo desde localhost
New-NetFirewallRule -DisplayName "PostgreSQL Local" -Direction Inbound `
    -LocalPort 5432 -Protocol TCP -Action Allow -RemoteAddress 127.0.0.1

# Permitir RabbitMQ solo desde localhost
New-NetFirewallRule -DisplayName "RabbitMQ Local" -Direction Inbound `
    -LocalPort 5672,15672 -Protocol TCP -Action Allow -RemoteAddress 127.0.0.1

# Permitir HTTP/HTTPS desde cualquier lugar (para IIS)
New-NetFirewallRule -DisplayName "Web Traffic" -Direction Inbound `
    -LocalPort 80,443 -Protocol TCP -Action Allow

Apéndice A: Contactos de Escalación

Nivel Tiempo Contacto Método
L1 0-15 min DevOps on-call Slack #ops-alerts
L2 15-60 min Tech Lead WhatsApp
L3 >60 min CTO Teléfono

Apéndice B: SLAs

Servicio Disponibilidad RTO RPO
API 99.5% 1h 6h
Admin 99% 2h 24h
Workers 95% 4h 24h

Historial de Cambios

Fecha Versión Cambios Autor
2026-02 1.0.0 Documento inicial DevOps Team