Opening

“We got a timeout error, but the customer says the payment went through!”

3 AM, emergency call. A TimeoutException occurred in our real-time settlement system, yet the money had already landed in the customer’s account. This marked the beginning of our war against timeouts and the various timeout issues that arise from the complex hierarchical structure of financial payment systems.

The Beginning - When Inconsistency Strikes

While developing our real-time settlement and payment agency system, we built the following architecture:

CLIENT (Merchants/Users) → PG (Our System) → Financial VAN (HYPEN/DOZN) → BANK

The core issue in this seemingly peaceful structure was this: we marked the transaction as failed due to timeout, but the bank had actually processed the settlement successfully. Upon checking the logs, we found a success callback from the VAN provider that arrived 5 minutes after our Read Timeout occurred.

The Trinity of TimeoutException - Connection, Read, and Socket

1. Connection Timeout: Knocking on the Door

Connection Timeout is the time taken to complete the TCP 3-way handshake. Think of it as waiting in line outside a popular restaurant.

Main causes of Connection Timeout in financial VAN integration:

  • Firewall issues: Security is paramount in finance. Missing IP whitelisting means no connection at all.
  • Network routing problems: Issues with dedicated financial network connections.
  • VAN server downtime: Missing scheduled maintenance windows or unexpected failures.

2. Read Timeout: Waiting for Your Order

Read Timeout occurs when the connection is established but no response arrives. It’s like being seated at the restaurant and placing your order, but the food never comes.

Why Read Timeout is most troublesome in payment gateway integration:

  • Bank processing delays: Large transfers or suspicious transactions go through additional verification processes within the bank.
  • VAN queuing: Transaction queues get longer during month-end or payroll days.
  • Network packet loss: Even dedicated financial networks aren’t 100% perfect.

3. Socket Timeout: The Breathing Space Between Packets

Socket Timeout refers to the time interval between individual packets. When a server sends responses in multiple packets, if the gap between packets is too long, this timeout occurs. It’s mainly an issue when receiving large settlement data.

The Trap of Hierarchical Timeout Settings

The biggest problem we discovered during actual operations was the inversion of the timeout chain.

Ideal timeout chain:

  • Client Timeout: 60 seconds
  • PG Timeout: 50 seconds
  • VAN Timeout: 40 seconds
  • Bank Timeout: 30 seconds

Lower layers should have shorter timeouts so upper layers can respond appropriately. However, in reality, we had a configuration error where VAN timeout was 40 seconds but PG only waited for 30 seconds. As a result, while VAN was still waiting for the bank’s response, we had already marked it as a timeout failure.

VAN-Specific Timeout Characteristics and Responses

HYPEN

HYPEN typically shows fast responses but becomes unstable during peak times. Especially during 9-10 AM rush hours and 3-4 PM bank closing times, responses were 3-5 times slower than usual. We needed a strategy of setting Connection to 3 seconds and Read to 20 seconds as baseline, but dynamically increasing them during peak hours.

DOZN

DOZN is more stable than HYPEN but generally slower in response. Particularly for large transactions or batch transfers, processing time increased linearly. Setting Connection to 5 seconds and Read to 30 seconds as baseline, then adjusting timeouts based on transaction amount or count proved effective.

Retry Strategy When Timeouts Occur

Financial VAN providers typically offer retry guidelines and transaction status inquiry APIs for timeout cases. However, beyond simply following VAN guidelines, the key is how the PG side handles data and ensures business continuity.

1. Transaction State Management Architecture

The first thing to build is a sophisticated state management system. Simply having success/failure states isn’t enough to handle timeout situations properly. You need granular states like PENDING, PROCESSING, TIMEOUT, SUCCESS, and FAILED, recording timestamps and reasons for each state transition.

The TIMEOUT state is particularly important. It means “transaction not yet completed” and requires periodic status checks and reprocessing.

2. Intelligent Retry Strategy

When a timeout occurs, immediately call the VAN’s transaction inquiry API. There are four possible outcomes:

SUCCESS (Phantom Success): We marked it as failed due to timeout, but it actually succeeded. In this case, immediately update the internal state to success and notify the client. Also check settlement data consistency.

PROCESSING: Still being processed by VAN or bank. Use exponential backoff strategy to increase the recheck interval.

NOT_FOUND: VAN can’t find the transaction either. The request likely got lost at the network level, so attempt a new request within retry limits.

FAILED: Clear failure. Analyze the failure reason to determine if retry is possible.

3. Compensation Transaction Pattern

When retry limits are reached or continuous problems occur with a specific VAN, create a compensation transaction. This is a new transaction separate from the original, with these characteristics:

  • Clearly maintain linkage with the original transaction
  • Bypass processing through alternative VAN routes
  • More conservative timeout settings (double the normal)
  • Immediate manual intervention alert on failure

4. Data Consistency Guarantee Strategy

Data consistency is paramount in financial transactions. To ensure this even during timeout situations:

Distributed Locking: Prevents the same transaction from being processed simultaneously by multiple threads. Essential especially during post-timeout reprocessing when the original transaction might still be in progress.

Idempotency Key Management: Every transaction must have a unique idempotency key, and the same key must be used for retries. VAN providers also filter duplicate requests using this key.

State Synchronization: Maintain locks for a certain period even after timeout to prevent race conditions with asynchronous VAN callbacks.

5. Real-time Monitoring and Self-healing

Timeouts aren’t isolated events but have patterns. You need a system that monitors these in real-time and responds automatically:

Pattern Analysis: Analyze timeouts from the last 30 minutes to detect performance degradation of specific VANs or spikes at certain times.

Automatic Actions: Automatically adjust VAN traffic ratios, dynamically change timeout thresholds, activate alternative routes.

Long-pending Transaction Handling: Move transactions that have been in TIMEOUT state for over 10 minutes to a separate queue for focused management.

Real-world Troubleshooting Cases

Case 1: The 9 AM Nightmare

Every day at exactly 9 AM, massive timeouts occurred. The cause was all merchants requesting previous day’s settlement simultaneously.

The solution was surprisingly simple. We introduced random delays between 0-300 seconds based on hashed merchant IDs to distribute requests. We also applied the Circuit Breaker pattern to automatically delay requests or route to alternative VANs when certain thresholds were exceeded.

Case 2: The Mystery of Phantom Transactions

Read Timeout occurred but the customer’s account showed the deposit completed. There was a delay in the VAN-Bank segment, but processing ultimately succeeded.

To handle such cases, we schedule a status check 30 seconds after timeout occurrence and send a “processing” notification to customers. Status checks are attempted up to 5 times at intervals of 30 seconds, 1 minute, 2 minutes, and 5 minutes, with each attempt having twice the wait time of the previous one.

Case 3: Distinguishing Network Disconnection vs Timeout

Connection Timeout and Read Timeout require completely different handling approaches.

Connection Timeout likely means the request wasn’t delivered at all, so immediate retry is possible. Read Timeout means the request was delivered but response is delayed, so check status first before taking action.

Socket Timeout particularly indicates packet loss, so classify it as a network quality issue for separate management.

Building Monitoring and Alert Systems

Timeout Dashboard Configuration

For effective monitoring, track these metrics in real-time:

  • Timeout count by type: Frequency of Connection, Read, Socket occurrences
  • Distribution by VAN: Which VAN mainly experiences issues
  • Time-based patterns: Whether concentrated at specific times
  • Phantom success rate: Percentage that actually succeeded after timeout
  • Average recovery time: Time from timeout to final status confirmation

Real-time Alert Configuration

Timeout alerts need appropriate sensitivity adjustment:

  • Connection Timeout: Alert ops team if 10+ occurrences within 5 minutes
  • Read Timeout: Alert dev team if 50+ occurrences within 10 minutes
  • Phantom Success: Immediate check for even 1 occurrence
  • Recovery Failure: Emergency alert if failed after 3 retries

Lessons and Best Practices

1. Design Timeouts Hierarchically

Higher layers should have longer timeouts. This allows upper layers to properly handle delays occurring in lower layers.

2. Timeout ≠ Failure

Especially in financial transactions, actual transactions may succeed despite timeouts. Always prioritize status verification and avoid hasty failure processing.

3. Idempotency is Mandatory, Not Optional

All financial APIs must have idempotency keys, and the same key must be used for retries after timeout. This is the most basic safeguard against duplicate transactions.

4. Logging is the Beginning of Debugging

Accurately record timeout occurrence time, duration, and location. It’s especially important to distinguish which segment (PG-VAN, VAN-Bank) experienced the issue.

5. Load Testing Should Mirror Real Environment

Test scenarios must include timeout cases. Simulate various situations including not just normal responses but delayed responses, timeouts, and network disconnections to avoid surprises in actual operations.

Closing

TimeoutException is an unavoidable phenomenon in distributed systems. It becomes even more complex in environments like financial payment systems where multiple institutions are interconnected.

What’s important:

  1. Don’t conclude timeout as failure: This can be fatal especially in financial transactions
  2. Hierarchical timeout design: More lenient for upper layers
  3. Status verification mechanism: Always check transaction status after timeout
  4. Idempotency guarantee: Prevent retries from becoming duplicate transactions
  5. Monitoring and analysis: Identify patterns for proactive response

The war against timeouts never ends. But with sufficient understanding and preparation, you can definitely reduce those 3 AM emergency calls.

And remember: “Getting a timeout” and “transaction failed” are completely different stories.