- 14 minutes to read

Retrying instance(s) Category

Nodinite BizTalk Server Monitoring Agent empowers you to monitor BizTalk Server Retrying instances for all BizTalk Applications in your group. Instantly detect issues, automate actions, and maintain business continuity.

Retrying instances in BizTalk are listed within Nodinite as Resources named 'Retrying instance(s)'.
Nodinite mirrors the Application structure, providing a 1:1 mapping with BizTalk Applications.
Retrying instances are grouped by the Category Retrying instance(s) for streamlined management.

Here's an example of a Monitor View filtered by the 'Retrying instance(s)' category.

Note
All User operations within Nodinite are Log Audited, supporting your security and corporate governance compliance policies.

Understanding BizTalk Server Retrying Instances

Retrying Instances represent service instances (messaging or orchestrations) that have encountered transient failures during processing and are actively being retried by BizTalk Server's retry mechanism. These instances are in a special state where BizTalk automatically re-attempts delivery or processing based on configured retry intervals and retry counts.

What Are Retry Mechanisms in BizTalk?

When BizTalk sends a message to a destination (send port, web service, file system, database) or processes through a pipeline, transient failures can occur:

Network timeouts – Temporary network disruptions preventing message delivery
Backend system unavailability – Target web service, database, or server temporarily offline
Connection failures – Database connection pool exhaustion, dropped TCP connections
File system locks – Target file locked by another process, directory temporarily unavailable
Authentication/SSL errors – Certificate expiration, transient authentication provider issues
Throttling responses – Downstream system returning 429/503 status codes requesting retry

Rather than immediately suspending these instances, BizTalk automatically retries based on:

Retry count – Number of retry attempts configured on send port (e.g., 3 retries)
Retry interval – Time delay between retries (e.g., 5 minutes)
Transport-specific retry logic – Some adapters (HTTP, SOAP, WCF) have additional retry mechanisms

Normal retry flow:

Message delivery fails with transient error
Instance enters "Retrying" state
BizTalk waits for retry interval (e.g., 5 minutes)
Automatic retry attempt #1
If still fails → wait retry interval → retry attempt #2
If retry count exhausted (e.g., 3 retries) → instance suspended (non-resumable)

Why Monitoring Retrying Instances is Critical

Retrying instances indicate ongoing integration failures that may or may not resolve automatically. Excessive retrying instances signal serious problems:

Time-Sensitive Delays – Messages waiting for retries violate SLAs; customer orders, invoices, or notifications delayed
Backend System Failures – High retry counts indicate sustained downstream system outages (web services down, databases offline, file shares unavailable)
Retry Exhaustion Risk – Instances nearing final retry attempt will suspend if backend doesn't recover, requiring manual intervention
Configuration Issues – Incorrect URLs, expired certificates, wrong credentials causing repeated failures
Integration Health Indicator – Spike in retrying instances = broader integration landscape problems (network issues, infrastructure failures)
MessageBox Pressure – Retrying instances store state in MessageBox, consuming resources during retry wait periods
Business Process Interruption – Orders not fulfilled, payments not processed, partner integrations failing

What Triggers Retry State?

Send port delivery failures:

HTTP/SOAP/WCF adapters – Target endpoint unreachable, HTTP 500/502/503/504 errors, connection timeouts
File adapter – Network share unavailable, disk full, permission denied, file locked
FTP/SFTP adapter – Connection refused, authentication failure, remote directory missing
Database adapter – SQL connection timeout, deadlock, tempdb full, service unavailable
MSMQ/MQSeries adapter – Queue manager down, queue not found, remote queue unreachable
Email adapter – SMTP server down, authentication failure, mailbox full
Custom adapters – Any transient error returned by custom adapter code

Pipeline failures (retryable):

Transient database lookup failures in custom pipeline components
Temporary external service calls during message enrichment

Normal vs. Problematic Retry Patterns

Normal retry scenarios (expected behavior):

Brief network glitches – 1-2 instances retrying for a few minutes during temporary network hiccup, then succeed
Scheduled backend maintenance – Predictable retry spike during known maintenance windows (e.g., nightly database backups)
Isolated transient failures – Occasional single instance retry due to random timeouts (acceptable in distributed systems)
Retry success rate – Most retrying instances succeed on retry attempt #1 or #2 (not exhausting all retries)

Problematic retry patterns (require investigation):

Sustained high counts – Dozens/hundreds of instances in retry state for extended periods (>30 minutes)
All retries exhausting – Instances consistently failing all retry attempts and suspending (backend definitely down)
Specific destination failures – All retries to specific endpoint/partner/system (configuration error or sustained outage)
Growing retry queues – Retry counts increasing linearly over time without resolution
Cyclical retry patterns – Same services retrying at predictable intervals (intermittent connectivity issues)
Single-retry failures – Instances failing immediately on first retry (not transient – likely permanent error)

Root causes of excessive retrying:

Backend system outages – Web services crashed, databases offline, file servers down, partner systems unavailable
Network infrastructure issues – Firewall misconfigurations, DNS failures, routing problems, VPN disconnections
Configuration errors – Wrong URLs after deployment, expired SSL certificates, incorrect authentication credentials
Resource exhaustion – Backend connection pool limits, disk space full, memory pressure causing rejections
Throttling/rate limiting – Downstream systems rejecting requests due to volume (need backoff strategy)
Intermittent connectivity – Unstable network links, flapping connections, DNS resolution failures
Scheduled maintenance conflicts – Backend maintenance overlapping with BizTalk processing windows

Nodinite evaluates both count (how many instances retrying) and time (how long instances have been retrying), enabling detection of both mass failures (backend outage) and prolonged retry scenarios (configuration errors preventing eventual success).

What are the key features for Monitoring BizTalk Server Retrying instances?

Nodinite's Retrying Instance monitoring provides dual-threshold evaluation combined with transient failure visibility, enabling proactive detection of backend outages and integration issues before retry exhaustion causes message loss:

Dual-Threshold Evaluation – Intelligent monitoring using both count-based (how many instances retrying) and time-based (how long retrying) thresholds to detect mass backend failures vs. prolonged configuration errors.
Transient Failure Detection – View retrying instances to identify which endpoints, partners, or systems are experiencing failures, pinpointing integration health issues.
Retry Exhaustion Prevention – Early alerts enable intervention before instances exhaust all retry attempts and suspend (preventing message loss).
Application-Specific Configuration – Tailor threshold settings per BizTalk Application to accommodate different integration reliability profiles (stable partners vs. flaky systems).
Backend Health Indicator – Spike in retrying instances signals downstream system problems (databases, web services, file shares) before users report failures.
SLA Risk Management – Detect retry delays impacting time-sensitive business processes (orders, payments, notifications) before SLA violations occur.

What is evaluated for BizTalk Retrying instances?

The monitoring agent continuously queries BizTalk's MessageBox to assess retrying instance counts and retry durations across all applications. Nodinite evaluates instances against both count and time thresholds, providing comprehensive integration health assessment:

State	Status	Description	Actions
Unavailable	Resource not available	Evaluation of the 'Retrying instances' is not possible due to network or security-related problems	Review prerequisites
Error	Error threshold is breached	More Retrying instances exist than allowed by the Error threshold	Details Edit thresholds
Warning	Warning threshold is breached	More Retrying instances exist than allowed by the Warning threshold	Details Edit thresholds
OK	Within user-defined thresholds	Number of Retrying instances are within user-defined thresholds	Details Edit thresholds

Tip
You can reconfigure the evaluated state using the Expected State feature on every Resource within Nodinite. For integrations with known unreliable partners or systems with scheduled maintenance windows, you can expect Warning states during predictable retry periods without generating false alarms.

Actions

When retrying instances accumulate or remain in retry state beyond expected recovery time, immediate investigation prevents retry exhaustion and message suspension. Nodinite provides Remote Actions for rapid failure diagnosis and backend system health assessment.

These actions enable operations teams to identify which endpoints are failing and coordinate with infrastructure/backend teams for remediation. All actions are audit logged for compliance tracking.

Available Actions for Retrying Instances

The following Remote Actions are available for the Retrying instance(s) Category:

Retrying instance Actions Menu in Nodinite Web Client.

Details

When alerts indicate excessive retrying instances or prolonged retry durations, the Details view provides critical diagnostic information about which integrations are failing and why. This interface reveals transient failure patterns without requiring BizTalk Group Hub access.

What you can see:

Instance details – Service name, send port, destination endpoint, instance ID
Retry information – Current retry attempt number (e.g., "Retry 2 of 3"), next retry timestamp
Failure details – Error message, exception type, HTTP status code (if applicable)
Retry timestamp – When instance entered retry state (critical for time-based alerting)
Target endpoint – Which URL, file path, database, or system is failing

When to use this view:

Backend outage detection – Identify which downstream systems are unavailable (all retries to specific endpoint)
Configuration error diagnosis – Find wrong URLs, expired certificates, incorrect credentials causing repeated failures
Retry exhaustion prevention – See instances nearing final retry attempt to enable manual intervention before suspension
Partner/vendor communication – Identify failing partner endpoints to escalate to external support teams
Pattern analysis – Determine if failures are specific service, time-based (nightly), or intermittent
SLA impact assessment – Identify time-sensitive messages stuck in retry (orders, payments, notifications)

Common diagnostic patterns:

Pattern: All retries to single endpoint

Diagnosis: Specific backend system down (web service, database, file share)
Action: Contact backend team, verify system health, check firewall/network
Example: All retries show "Connection refused" to https://partner-api.example.com → partner's API server offline

Pattern: Mixed retry attempts (some at retry 1, some at retry 3)

Diagnosis: Intermittent connectivity or flapping connection
Action: Investigate network stability, DNS resolution, load balancer health
Example: Some succeed on retry, others exhaust → unstable network link

Pattern: All instances failing at retry attempt 1

Diagnosis: Not transient – permanent error (wrong URL, 404, authentication failure)
Action: Fix configuration error immediately (instances will suspend after retries exhausted)
Example: Certificate expired, wrong credentials, endpoint URL changed post-deployment

Pattern: Retrying instances for specific message type/partner

Diagnosis: Partner system issues, routing configuration, adapter-specific problem
Action: Test endpoint manually, verify adapter configuration, check partner system status
Example: All EDI messages to specific trading partner retrying → partner's AS2 endpoint down

Pattern: Cyclical retry pattern (spike every 5 minutes)

Diagnosis: Retry interval timing (all instances retrying simultaneously at interval)
Action: Normal behavior if backend recovers before exhaustion; if sustained, backend needs attention
Example: 50 instances retry every 5 minutes at exactly 10:00, 10:05, 10:10 → backend still failing

Pattern: Instances retrying for hours without suspension

Diagnosis: Very high retry count configured (e.g., 999 retries) or infinite retry adapter setting
Action: Evaluate if extended retries appropriate; may need to manually suspend to prevent indefinite delay
Example: File adapter with infinite retry waiting for locked file that won't unlock

Tip
Check error messages for HTTP status codes: 503 Service Unavailable = backend overloaded (temporary), 401/403 = authentication (configuration), 404 = wrong URL (configuration), 500 = backend crash (contact backend team).

Warning
Instances nearing final retry attempt (e.g., "Retry 3 of 3") will suspend (non-resumable) if next attempt fails. Messages in non-resumable suspended state cannot be automatically resumed – manual resubmission or backend recovery + new message required.

To access retry diagnostics, press the Action button and select the Details menu item:

Action button menu with 'Details' option.

The modal displays comprehensive retry information including failure details and retry attempt progress:

Details modal showing retrying instances with error messages and retry attempt counts.

Edit thresholds

Retrying instance monitoring uses dual-threshold evaluation—both count-based (how many instances retrying) and time-based (how long instances have been retrying)—to detect different failure patterns. This enables detection of both mass backend outages (count spikes) and prolonged configuration errors (extended retry durations).

When to adjust thresholds:

Count: After improving backend reliability, for integrations with unreliable partners, based on retry intervals, during scheduled maintenance
Time: Based on retry configuration (count × interval), for time-sensitive SLA messages, to catch configuration errors early

Threshold tuning strategy:

Calculate retry exhaustion: (Retry count × Retry interval)
Set time Warning at 60%, Error at 85% of exhaustion
Set count Warning >5-10 instances, Error >20-50 instances

Per-application overrides: B2B/EDI partners (higher), Internal databases (very low), Real-time APIs (SLA-critical)

Thresholds can be managed through the Actions menu or via Remote Configuration for bulk adjustments.

To manage the Retrying instance(s) threshold for the selected BizTalk Server Application, press the Action button and select the Edit thresholds menu item:

Action button menu with Edit thresholds option.

The modal allows you to configure both time-based and count-based alert thresholds:

Dual-threshold configuration for comprehensive retry monitoring.

Time-based evaluation

Time-based evaluation detects instances stuck in retry state longer than expected, enabling intervention before retry exhaustion and message suspension. Nodinite tracks how long each instance has been retrying, alerting when duration approaches configured retry limits.

Time-based evaluation is always active. If you don't want time-based alerting, set thresholds longer than your maximum retry duration (retry count × retry interval), or use Expected State to accept Warning states during known backend maintenance.

Why time thresholds are critical for retries:

Retries have finite duration before exhaustion. Once all retry attempts fail, instances suspend (non-resumable) and messages are lost unless manually resubmitted. Time thresholds enable intervention before suspension by alerting when:

Instances have been retrying for 80-90% of maximum retry duration
Backend hasn't recovered in expected timeframe
Configuration errors preventing eventual success are detected early

Calculating appropriate time thresholds:

Formula: Warning = (Retry Count × Retry Interval) × 0.6, Error = (Retry Count × Retry Interval) × 0.85

Examples:

3 retries × 5 min = 15 min max: Warning: 9 min, Error: 13 min
10 retries × 2 min = 20 min max: Warning: 12 min, Error: 17 min
Real-time (2 × 30 sec = 1 min): Warning: 40 sec, Error: 50 sec

Diagnostic value of time-based retry alerts:

Alert within first retry interval → Likely configuration error (wrong URL, expired cert); won't resolve via retry
Alert after 50-70% of max duration → Backend hasn't recovered; likely sustained outage requiring intervention
Alert just before exhaustion → Last chance to manually intervene, contact backend team, or suspend gracefully
Multiple instances at similar retry duration → Mass backend failure affecting all integrations to that system
Single instance retrying for extended time → Isolated issue; specific message, routing, or partner problem

Warning
When instances reach final retry attempt (e.g., "Retry 5 of 5"), next failure causes non-resumable suspension. Messages cannot be automatically resumed—requires manual resubmission or backend recovery + message resend from source system. Time-based alerts provide intervention window to prevent message loss.

Tip
For backends with predictable recovery times (e.g., "database restarts take 5-10 minutes"), set Error threshold slightly beyond typical recovery to avoid false alarms but catch prolonged outages. Example: DB typically restarts in 8 min → set Error at 12 min.

State	Name	Data Type	Description
	Warning TimeSpan	Timespan 00:13:37 (13 minutes 37 seconds)	If any retrying instance has been in retry state longer than this timespan, a Warning alert is raised. Set at 60-70% of maximum retry duration to enable early intervention. Format: days.hours:minutes:seconds (e.g., 0.00:10:00 = 10 minutes)
	Error TimeSpan	Timespan 01:10:00 (1 hour 10 minutes)	If any retrying instance has been in retry state longer than this timespan, an Error alert is raised. Set at 80-90% of maximum retry duration to alert before final retry attempt and suspension. Format: days.hours:minutes:seconds (e.g., 0.00:15:00 = 15 minutes)

Count-based evaluation

Count-based evaluation detects mass backend failures where multiple instances retry simultaneously, indicating downstream system outages, network failures, or infrastructure problems affecting broad integration landscape.

What retry counts reveal:

0-2 instances – Normal transient failures (random timeouts, occasional network glitches)
3-10 instances – Elevated retry rate; monitor for pattern (specific endpoint? time-based?)
10-50 instances – Likely backend system issue; investigate endpoint health, network connectivity
50-100 instances – Confirmed backend outage or mass configuration error; immediate action required
100+ instances – Severe sustained backend failure; coordinate with infrastructure/backend teams urgently

How to set count thresholds:

Establish baseline retry rate during normal operations
Set Warning above baseline, Error at backend failure threshold
Account for message volume (higher throughput → scale thresholds)

Example thresholds:

Low-volume partners (50 msg/hr): Warning 3, Error 10
Medium-volume internal (500 msg/hr): Warning 10, Error 30
High-volume B2B (5000 msg/hr): Warning 50, Error 150

Diagnostic patterns:

Sudden spike (0 → 50+ instances): Backend just went offline → contact team immediately

Gradual increase (5 → 30 over hours): Backend degrading (memory leak, disk filling) → prevent full outage

All instances exhaust and suspend: Sustained backend outage → fix urgently, prepare resubmission

Warning
High retry counts (>50 instances for medium-volume apps) indicate:

Backend system offline – Web service, database, file share, or partner system unavailable

Network infrastructure failure – Firewall blocking, DNS failure, routing issue

Mass configuration error – Wrong URL deployed to all send ports, certificate expired

Cascading failures risk – If backend doesn't recover, all retrying instances will suspend

Tip
Correlation with other metrics: High retry counts + high Ready to Run counts = backend backpressure (system slow, not down). High retry counts + normal Ready to Run = backend outage (system completely unavailable).

Tip
Backend team coordination: When Error threshold breached, immediately engage backend/infrastructure teams. Provide failing endpoint details from Details view. Every retry cycle brings instances closer to exhaustion and message loss.

State	Name	Data Type	Description
	Warning Count	integer	If the total number of retrying instances exceeds this value, a Warning alert is raised. Set slightly above baseline transient failure rate to detect emerging backend issues early.
	Error Count	integer	If the total number of retrying instances exceeds this value, an Error alert is raised. Set at level indicating confirmed backend outage or mass configuration error requiring immediate remediation.

Next Step

Add or manage a Monitoring Agent Configuration
Configuration

BizTalk Monitoring Agent
Administration
Monitoring Agents
Add or manage a Monitoring Agent Configuration
Remote Configuration

Explore Nodinite for 30 days – completely free

See how Nodinite resolves your technical issues, in a

LIVE Demo!