Retrying instance(s) Category
Nodinite BizTalk Server Monitoring Agent empowers you to monitor BizTalk Server Retrying instances for all BizTalk Applications in your group. Instantly detect issues, automate actions, and maintain business continuity.
- Retrying instances in BizTalk are listed within Nodinite as Resources named 'Retrying instance(s)'.
- Nodinite mirrors the Application structure, providing a 1:1 mapping with BizTalk Applications.
- Retrying instances are grouped by the Category Retrying instance(s) for streamlined management.

Here's an example of a Monitor View filtered by the 'Retrying instance(s)' category.
Note
All User operations within Nodinite are Log Audited, supporting your security and corporate governance compliance policies.
Understanding BizTalk Server Retrying Instances
Retrying Instances represent service instances (messaging or orchestrations) that have encountered transient failures during processing and are actively being retried by BizTalk Server's retry mechanism. These instances are in a special state where BizTalk automatically re-attempts delivery or processing based on configured retry intervals and retry counts.
What Are Retry Mechanisms in BizTalk?
When BizTalk sends a message to a destination (send port, web service, file system, database) or processes through a pipeline, transient failures can occur:
- Network timeouts – Temporary network disruptions preventing message delivery
- Backend system unavailability – Target web service, database, or server temporarily offline
- Connection failures – Database connection pool exhaustion, dropped TCP connections
- File system locks – Target file locked by another process, directory temporarily unavailable
- Authentication/SSL errors – Certificate expiration, transient authentication provider issues
- Throttling responses – Downstream system returning 429/503 status codes requesting retry
Rather than immediately suspending these instances, BizTalk automatically retries based on:
- Retry count – Number of retry attempts configured on send port (e.g., 3 retries)
- Retry interval – Time delay between retries (e.g., 5 minutes)
- Transport-specific retry logic – Some adapters (HTTP, SOAP, WCF) have additional retry mechanisms
Normal retry flow:
- Message delivery fails with transient error
- Instance enters "Retrying" state
- BizTalk waits for retry interval (e.g., 5 minutes)
- Automatic retry attempt #1
- If still fails → wait retry interval → retry attempt #2
- If retry count exhausted (e.g., 3 retries) → instance suspended (non-resumable)
Why Monitoring Retrying Instances is Critical
Retrying instances indicate ongoing integration failures that may or may not resolve automatically. Excessive retrying instances signal serious problems:
- Time-Sensitive Delays – Messages waiting for retries violate SLAs; customer orders, invoices, or notifications delayed
- Backend System Failures – High retry counts indicate sustained downstream system outages (web services down, databases offline, file shares unavailable)
- Retry Exhaustion Risk – Instances nearing final retry attempt will suspend if backend doesn't recover, requiring manual intervention
- Configuration Issues – Incorrect URLs, expired certificates, wrong credentials causing repeated failures
- Integration Health Indicator – Spike in retrying instances = broader integration landscape problems (network issues, infrastructure failures)
- MessageBox Pressure – Retrying instances store state in MessageBox, consuming resources during retry wait periods
- Business Process Interruption – Orders not fulfilled, payments not processed, partner integrations failing
What Triggers Retry State?
Send port delivery failures:
- HTTP/SOAP/WCF adapters – Target endpoint unreachable, HTTP 500/502/503/504 errors, connection timeouts
- File adapter – Network share unavailable, disk full, permission denied, file locked
- FTP/SFTP adapter – Connection refused, authentication failure, remote directory missing
- Database adapter – SQL connection timeout, deadlock, tempdb full, service unavailable
- MSMQ/MQSeries adapter – Queue manager down, queue not found, remote queue unreachable
- Email adapter – SMTP server down, authentication failure, mailbox full
- Custom adapters – Any transient error returned by custom adapter code
Pipeline failures (retryable):
- Transient database lookup failures in custom pipeline components
- Temporary external service calls during message enrichment
Normal vs. Problematic Retry Patterns
Normal retry scenarios (expected behavior):
- Brief network glitches – 1-2 instances retrying for a few minutes during temporary network hiccup, then succeed
- Scheduled backend maintenance – Predictable retry spike during known maintenance windows (e.g., nightly database backups)
- Isolated transient failures – Occasional single instance retry due to random timeouts (acceptable in distributed systems)
- Retry success rate – Most retrying instances succeed on retry attempt #1 or #2 (not exhausting all retries)
Problematic retry patterns (require investigation):
- Sustained high counts – Dozens/hundreds of instances in retry state for extended periods (>30 minutes)
- All retries exhausting – Instances consistently failing all retry attempts and suspending (backend definitely down)
- Specific destination failures – All retries to specific endpoint/partner/system (configuration error or sustained outage)
- Growing retry queues – Retry counts increasing linearly over time without resolution
- Cyclical retry patterns – Same services retrying at predictable intervals (intermittent connectivity issues)
- Single-retry failures – Instances failing immediately on first retry (not transient – likely permanent error)
Root causes of excessive retrying:
- Backend system outages – Web services crashed, databases offline, file servers down, partner systems unavailable
- Network infrastructure issues – Firewall misconfigurations, DNS failures, routing problems, VPN disconnections
- Configuration errors – Wrong URLs after deployment, expired SSL certificates, incorrect authentication credentials
- Resource exhaustion – Backend connection pool limits, disk space full, memory pressure causing rejections
- Throttling/rate limiting – Downstream systems rejecting requests due to volume (need backoff strategy)
- Intermittent connectivity – Unstable network links, flapping connections, DNS resolution failures
- Scheduled maintenance conflicts – Backend maintenance overlapping with BizTalk processing windows
Nodinite evaluates both count (how many instances retrying) and time (how long instances have been retrying), enabling detection of both mass failures (backend outage) and prolonged retry scenarios (configuration errors preventing eventual success).
What are the key features for Monitoring BizTalk Server Retrying instances?
Nodinite's Retrying Instance monitoring provides dual-threshold evaluation combined with transient failure visibility, enabling proactive detection of backend outages and integration issues before retry exhaustion causes message loss:
- Dual-Threshold Evaluation – Intelligent monitoring using both count-based (how many instances retrying) and time-based (how long retrying) thresholds to detect mass backend failures vs. prolonged configuration errors.
- Transient Failure Detection – View retrying instances to identify which endpoints, partners, or systems are experiencing failures, pinpointing integration health issues.
- Retry Exhaustion Prevention – Early alerts enable intervention before instances exhaust all retry attempts and suspend (preventing message loss).
- Application-Specific Configuration – Tailor threshold settings per BizTalk Application to accommodate different integration reliability profiles (stable partners vs. flaky systems).
- Backend Health Indicator – Spike in retrying instances signals downstream system problems (databases, web services, file shares) before users report failures.
- SLA Risk Management – Detect retry delays impacting time-sensitive business processes (orders, payments, notifications) before SLA violations occur.
What is evaluated for BizTalk Retrying instances?
The monitoring agent continuously queries BizTalk's MessageBox to assess retrying instance counts and retry durations across all applications. Nodinite evaluates instances against both count and time thresholds, providing comprehensive integration health assessment:
| State | Status | Description | Actions | |
|---|---|---|---|---|
| Unavailable | Resource not available | Evaluation of the 'Retrying instances' is not possible due to network or security-related problems | Review prerequisites | |
| Error | Error threshold is breached | More Retrying instances exist than allowed by the Error threshold | Details Edit thresholds |
|
| Warning | Warning threshold is breached | More Retrying instances exist than allowed by the Warning threshold | Details Edit thresholds |
|
| OK | Within user-defined thresholds | Number of Retrying instances are within user-defined thresholds | Details Edit thresholds |
Tip
You can reconfigure the evaluated state using the Expected State feature on every Resource within Nodinite. For integrations with known unreliable partners or systems with scheduled maintenance windows, you can expect Warning states during predictable retry periods without generating false alarms.
Actions
When retrying instances accumulate or remain in retry state beyond expected recovery time, immediate investigation prevents retry exhaustion and message suspension. Nodinite provides Remote Actions for rapid failure diagnosis and backend system health assessment.
These actions enable operations teams to identify which endpoints are failing and coordinate with infrastructure/backend teams for remediation. All actions are audit logged for compliance tracking.
Available Actions for Retrying Instances
The following Remote Actions are available for the Retrying instance(s) Category:

Retrying instance Actions Menu in Nodinite Web Client.
Details
When alerts indicate excessive retrying instances or prolonged retry durations, the Details view provides critical diagnostic information about which integrations are failing and why. This interface reveals transient failure patterns without requiring BizTalk Group Hub access.
What you can see:
- Instance details – Service name, send port, destination endpoint, instance ID
- Retry information – Current retry attempt number (e.g., "Retry 2 of 3"), next retry timestamp
- Failure details – Error message, exception type, HTTP status code (if applicable)
- Retry timestamp – When instance entered retry state (critical for time-based alerting)
- Target endpoint – Which URL, file path, database, or system is failing
When to use this view:
- Backend outage detection – Identify which downstream systems are unavailable (all retries to specific endpoint)
- Configuration error diagnosis – Find wrong URLs, expired certificates, incorrect credentials causing repeated failures
- Retry exhaustion prevention – See instances nearing final retry attempt to enable manual intervention before suspension
- Partner/vendor communication – Identify failing partner endpoints to escalate to external support teams
- Pattern analysis – Determine if failures are specific service, time-based (nightly), or intermittent
- SLA impact assessment – Identify time-sensitive messages stuck in retry (orders, payments, notifications)
Common diagnostic patterns:
Pattern: All retries to single endpoint
- Diagnosis: Specific backend system down (web service, database, file share)
- Action: Contact backend team, verify system health, check firewall/network
- Example: All retries show "Connection refused" to
https://partner-api.example.com→ partner's API server offline
Pattern: Mixed retry attempts (some at retry 1, some at retry 3)
- Diagnosis: Intermittent connectivity or flapping connection
- Action: Investigate network stability, DNS resolution, load balancer health
- Example: Some succeed on retry, others exhaust → unstable network link
Pattern: All instances failing at retry attempt 1
- Diagnosis: Not transient – permanent error (wrong URL, 404, authentication failure)
- Action: Fix configuration error immediately (instances will suspend after retries exhausted)
- Example: Certificate expired, wrong credentials, endpoint URL changed post-deployment
Pattern: Retrying instances for specific message type/partner
- Diagnosis: Partner system issues, routing configuration, adapter-specific problem
- Action: Test endpoint manually, verify adapter configuration, check partner system status
- Example: All EDI messages to specific trading partner retrying → partner's AS2 endpoint down
Pattern: Cyclical retry pattern (spike every 5 minutes)
- Diagnosis: Retry interval timing (all instances retrying simultaneously at interval)
- Action: Normal behavior if backend recovers before exhaustion; if sustained, backend needs attention
- Example: 50 instances retry every 5 minutes at exactly 10:00, 10:05, 10:10 → backend still failing
Pattern: Instances retrying for hours without suspension
- Diagnosis: Very high retry count configured (e.g., 999 retries) or infinite retry adapter setting
- Action: Evaluate if extended retries appropriate; may need to manually suspend to prevent indefinite delay
- Example: File adapter with infinite retry waiting for locked file that won't unlock
Tip
Check error messages for HTTP status codes: 503 Service Unavailable = backend overloaded (temporary), 401/403 = authentication (configuration), 404 = wrong URL (configuration), 500 = backend crash (contact backend team).
Warning
Instances nearing final retry attempt (e.g., "Retry 3 of 3") will suspend (non-resumable) if next attempt fails. Messages in non-resumable suspended state cannot be automatically resumed – manual resubmission or backend recovery + new message required.
To access retry diagnostics, press the Action button and select the Details menu item:

Action button menu with 'Details' option.
The modal displays comprehensive retry information including failure details and retry attempt progress:

Details modal showing retrying instances with error messages and retry attempt counts.
Edit thresholds
Retrying instance monitoring uses dual-threshold evaluation—both count-based (how many instances retrying) and time-based (how long instances have been retrying)—to detect different failure patterns. This enables detection of both mass backend outages (count spikes) and prolonged configuration errors (extended retry durations).
When to adjust thresholds:
- Count: After improving backend reliability, for integrations with unreliable partners, based on retry intervals, during scheduled maintenance
- Time: Based on retry configuration (count × interval), for time-sensitive SLA messages, to catch configuration errors early
Threshold tuning strategy:
- Calculate retry exhaustion: (Retry count × Retry interval)
- Set time Warning at 60%, Error at 85% of exhaustion
- Set count Warning >5-10 instances, Error >20-50 instances
Per-application overrides: B2B/EDI partners (higher), Internal databases (very low), Real-time APIs (SLA-critical)
Thresholds can be managed through the Actions menu or via Remote Configuration for bulk adjustments.
To manage the Retrying instance(s) threshold for the selected BizTalk Server Application, press the Action button and select the Edit thresholds menu item:

Action button menu with Edit thresholds option.
The modal allows you to configure both time-based and count-based alert thresholds:

Dual-threshold configuration for comprehensive retry monitoring.
Time-based evaluation
Time-based evaluation detects instances stuck in retry state longer than expected, enabling intervention before retry exhaustion and message suspension. Nodinite tracks how long each instance has been retrying, alerting when duration approaches configured retry limits.
Time-based evaluation is always active. If you don't want time-based alerting, set thresholds longer than your maximum retry duration (retry count × retry interval), or use Expected State to accept Warning states during known backend maintenance.
Why time thresholds are critical for retries:
Retries have finite duration before exhaustion. Once all retry attempts fail, instances suspend (non-resumable) and messages are lost unless manually resubmitted. Time thresholds enable intervention before suspension by alerting when:
- Instances have been retrying for 80-90% of maximum retry duration
- Backend hasn't recovered in expected timeframe
- Configuration errors preventing eventual success are detected early
Calculating appropriate time thresholds:
Formula: Warning = (Retry Count × Retry Interval) × 0.6, Error = (Retry Count × Retry Interval) × 0.85
Examples:
- 3 retries × 5 min = 15 min max: Warning: 9 min, Error: 13 min
- 10 retries × 2 min = 20 min max: Warning: 12 min, Error: 17 min
- Real-time (2 × 30 sec = 1 min): Warning: 40 sec, Error: 50 sec
Diagnostic value of time-based retry alerts:
- Alert within first retry interval → Likely configuration error (wrong URL, expired cert); won't resolve via retry
- Alert after 50-70% of max duration → Backend hasn't recovered; likely sustained outage requiring intervention
- Alert just before exhaustion → Last chance to manually intervene, contact backend team, or suspend gracefully
- Multiple instances at similar retry duration → Mass backend failure affecting all integrations to that system
- Single instance retrying for extended time → Isolated issue; specific message, routing, or partner problem
Warning
When instances reach final retry attempt (e.g., "Retry 5 of 5"), next failure causes non-resumable suspension. Messages cannot be automatically resumed—requires manual resubmission or backend recovery + message resend from source system. Time-based alerts provide intervention window to prevent message loss.
Tip
For backends with predictable recovery times (e.g., "database restarts take 5-10 minutes"), set Error threshold slightly beyond typical recovery to avoid false alarms but catch prolonged outages. Example: DB typically restarts in 8 min → set Error at 12 min.
| State | Name | Data Type | Description |
|---|---|---|---|
| Warning TimeSpan | Timespan 00:13:37 (13 minutes 37 seconds) | If any retrying instance has been in retry state longer than this timespan, a Warning alert is raised. Set at 60-70% of maximum retry duration to enable early intervention. Format: days.hours:minutes:seconds (e.g., 0.00:10:00 = 10 minutes) | |
| Error TimeSpan | Timespan 01:10:00 (1 hour 10 minutes) | If any retrying instance has been in retry state longer than this timespan, an Error alert is raised. Set at 80-90% of maximum retry duration to alert before final retry attempt and suspension. Format: days.hours:minutes:seconds (e.g., 0.00:15:00 = 15 minutes) |
Count-based evaluation
Count-based evaluation detects mass backend failures where multiple instances retry simultaneously, indicating downstream system outages, network failures, or infrastructure problems affecting broad integration landscape.
What retry counts reveal:
- 0-2 instances – Normal transient failures (random timeouts, occasional network glitches)
- 3-10 instances – Elevated retry rate; monitor for pattern (specific endpoint? time-based?)
- 10-50 instances – Likely backend system issue; investigate endpoint health, network connectivity
- 50-100 instances – Confirmed backend outage or mass configuration error; immediate action required
- 100+ instances – Severe sustained backend failure; coordinate with infrastructure/backend teams urgently
How to set count thresholds:
- Establish baseline retry rate during normal operations
- Set Warning above baseline, Error at backend failure threshold
- Account for message volume (higher throughput → scale thresholds)
Example thresholds:
- Low-volume partners (50 msg/hr): Warning 3, Error 10
- Medium-volume internal (500 msg/hr): Warning 10, Error 30
- High-volume B2B (5000 msg/hr): Warning 50, Error 150
Diagnostic patterns:
Sudden spike (0 → 50+ instances): Backend just went offline → contact team immediately
Gradual increase (5 → 30 over hours): Backend degrading (memory leak, disk filling) → prevent full outage
All instances exhaust and suspend: Sustained backend outage → fix urgently, prepare resubmission
Warning
High retry counts (>50 instances for medium-volume apps) indicate:
- Backend system offline – Web service, database, file share, or partner system unavailable
- Network infrastructure failure – Firewall blocking, DNS failure, routing issue
- Mass configuration error – Wrong URL deployed to all send ports, certificate expired
- Cascading failures risk – If backend doesn't recover, all retrying instances will suspend
Tip
Correlation with other metrics: High retry counts + high Ready to Run counts = backend backpressure (system slow, not down). High retry counts + normal Ready to Run = backend outage (system completely unavailable).
Tip
Backend team coordination: When Error threshold breached, immediately engage backend/infrastructure teams. Provide failing endpoint details from Details view. Every retry cycle brings instances closer to exhaustion and message loss.
| State | Name | Data Type | Description |
|---|---|---|---|
| Warning Count | integer | If the total number of retrying instances exceeds this value, a Warning alert is raised. Set slightly above baseline transient failure rate to detect emerging backend issues early. | |
| Error Count | integer | If the total number of retrying instances exceeds this value, an Error alert is raised. Set at level indicating confirmed backend outage or mass configuration error requiring immediate remediation. |
Next Step
Add or manage a Monitoring Agent Configuration
Configuration
Related Topics
BizTalk Monitoring Agent
Administration
Monitoring Agents
Add or manage a Monitoring Agent Configuration
Remote Configuration