How do I monitor Multi-Protocol Gateway service health?
DataPower Multi-Protocol Gateway (MPG) services are the core runtime components processing integration traffic (REST APIs, SOAP web services, EDI X12/EDIFACT, MQ messages). Service health monitoring detects crashes, manual stops, and configuration issues before they impact business operations.
Service Health Monitoring via SOMA API
The Nodinite DataPower Monitoring Agent polls service status using SOMA (SOAP Management) XML Management Interface.
Step 1: Create Service Resource in Nodinite
- Navigate: Nodinite Web Client → Repository → Monitoring Resources
- Create New Resource:
- Resource type: Service
- DataPower appliance:
Prod-Primary(or appliance name) - Domain:
TradingPartner(DataPower domain hosting the service) - Service name:
TradingPartner-MPG(exact service name as configured in DataPower) - Service class:
MultiProtocolGateway(DataPower object class)
Step 2: Configure Agent Polling Interval
- Set polling frequency:
- Default: 5 minutes (288 health checks per day)
- High-priority services: 1 minute (1,440 health checks per day, faster failure detection)
- Low-priority development services: 15 minutes (96 health checks per day, reduced network overhead)
Step 3: SOMA API Request/Response
Agent sends SOMA XML request every 5 minutes:
<dp:request domain="TradingPartner">
<dp:get-status class="MultiProtocolGateway"/>
<dp:filter>TradingPartner-MPG</dp:filter>
</dp:request>
DataPower responds with service status:
<dp:response>
<dp:status class="MultiProtocolGateway">
<Name>TradingPartner-MPG</Name>
<OpState>up</OpState>
<AdminState>enabled</AdminState>
<ConfigState>saved</ConfigState>
<QuiesceState>normal</QuiesceState>
</dp:status>
</dp:response>
Step 4: OpState Values and Meanings
The agent parses the <OpState> element to determine service health:
| OpState Value | Meaning | Typical Causes |
|---|---|---|
| up | Service running normally | Healthy state, processing traffic |
| down | Service crashed/failed | OutOfMemoryError, configuration error, backend unreachable |
| stopped | Service manually disabled | Administrator disabled via WebGUI, planned maintenance |
| starting | Service initializing | Appliance rebooting, service recently enabled (transient state) |
Step 5: Threshold Evaluation
Agent compares actual OpState vs expected state:
Scenario 1: Service crashed unexpectedly
- Expected state:
running(24/7 production service) - Actual OpState:
down - Alert: Error alert fires → "Service TradingPartner-MPG crashed unexpectedly at 2024-10-16 14:23:47 UTC"
- Actions: PagerDuty page on-call engineer, investigate service logs via Remote Action "View Service Logs"
Scenario 2: Service manually stopped (unexpected)
- Expected state:
running(24/7 production service) - Actual OpState:
stopped - Alert: Warning alert fires → "Service TradingPartner-MPG manually disabled, investigate if intentional"
- Actions: Email operations team, verify if planned maintenance (if not, escalate to network ops)
Scenario 3: Service stopped during scheduled maintenance (expected)
- Expected state:
stopped Saturday 2-6 AM(configured maintenance window) - Actual OpState:
stopped(Saturday 3:15 AM) - Alert: No alert (expected state matches actual state)
Scenario 4: Service stuck in "starting" state
- Expected state:
running - Actual OpState:
starting(15 minutes elapsed) - Alert: Warning alert fires → "Service TradingPartner-MPG stuck starting for 15 minutes, possible configuration issue"
- Actions: Investigate DataPower logs, check backend dependencies (database connections, MQ queue managers)
Expected State Configuration
Configure per-service expected state for intelligent alerting:
Production Services (24/7 uptime)
- Expected state:
Running 24/7 - Alert if: OpState = down/stopped any time
- Use case: Payment gateway, customer-facing APIs, partner EDI connections
Development Services (Business hours only)
- Expected state:
Running Mon-Fri 8 AM - 6 PM, Stopped outside business hours + weekends - Alert if:
- OpState = stopped during business hours (should be running)
- OpState = running outside business hours (wasting resources, potential security issue)
- Use case: Development/QA environments with limited operating hours
Scheduled Maintenance Windows
- Expected state:
Running except Saturday 2-6 AM weekly - Alert if: OpState = down/stopped outside maintenance window
- Use case: Production services with scheduled patching/backups
Alert Email Example
When service crashes unexpectedly, operations team receives email:
Subject: CRITICAL: DataPower Service TradingPartner-MPG DOWN
Body:
Alert: DataPower service failure detected
Appliance: Prod-Primary
Domain: TradingPartner
Service Name: TradingPartner-MPG
Service Class: MultiProtocolGateway
Previous State: up (running normally)
Current State: down (service crashed)
State Change Time: 2024-10-16 14:23:47 UTC
Expected State: Running 24/7 (production service)
Possible Causes:
- OutOfMemoryError (Java heap exhaustion from memory leak)
- Configuration error (invalid backend URL, missing certificate)
- Backend service unreachable (database down, MQ queue manager stopped)
Immediate Actions:
1. Check service logs via Nodinite Remote Action "View Service Logs"
2. Review recent configuration changes in DataPower domain "TradingPartner"
3. Verify backend service availability (database ping, MQ queue manager status)
4. Restart service if transient issue, escalate to development team if recurring
View service health history in Nodinite Monitor View:
https://nodinite.company.com/monitor/datapower-services/TradingPartner-MPG
Last known good state: 2024-10-16 14:18:32 UTC (5 minutes ago)
Service uptime (last 30 days): 99.87% (3 outages totaling 56 minutes)
Related Topics
- DataPower Monitoring Agent Installation - Step-by-step resource creation guide, SOMA API configuration
- Alert Plugins Configuration - Configure PagerDuty for on-call engineer escalation
Next Steps
- Create Resource: Set up service health monitoring for your critical DataPower services
- Configure Polling: Set 5-minute polling interval for production services, adjust for development
- Set Expected States: Configure per-service expected state (24/7 vs business hours)
- Alert Routing: Configure email/Slack/PagerDuty alerts for service failures
- Monitor Dashboard: Create a service health dashboard to track uptime trends
For more scenarios: