Solve Potential Too-Many-Request Token Blow-Up

So we are moving towards using OAuth authentication for the new FedEx and UPS integration, where the payloads will be in JSON not XML. Yay!

A reminder that in this approach, we first send a request to retrieve an access token from the 3rd party API. This token will then be used and re-used in every subsequent request (instead of the plain username-password credential) until it expires. When it expires, we have to renew this access token and so on.

Get Token

So the approach I’ve implemented so far is like this (pseudo-code, okay?):
var bearerToken = requestToken(CACHE or RENEWED)

// Check if token exists in cache (cache is set to expire/be evicted according to how long the token is valid for... this "expiration" value is provided by their API when we request a new token - for FedEx, it expires every hour)
- If YES, retrieve cache token
- If NO, request a new token and then cache it, in this case, for 1 hour

So this bearerToken is then used in the HTTP header in a subsequent request
addHeaderParam("Bearer " + bearToken)
This looks fine, we cache the token and we use this cached token for up to an hour. No risk here, we are not requesting a new token everytime. Looks great.

Renew Token

But then the retry logic is implemented, because it is possible that a token that you have retrieved from cache, or even freshly renewed, could be expired. This can happen, when, for example:
Expired token: the cached token we retrieve expires right at the milisecond or nanosecond we use it to authenticate the request (say it takes half a server round-trip for our token+request to reach FedEx API... it has expired some time during this trip).
401 Unauthorized response will be returned by FedEx API (signalling expired token).
Outdated/invalid token: Some other user, say USER B, for whatever reason, renews the token right before your supposedly valid token has been sent to their API. This means the token you are holding is no longer valid.
401 Unauthorized error will be returned also (signalling invalid/outdated token)

If the HTTP request using that token returns 401 Unauthorized, we then request a new bearer token and then retry the failed request with a new token. So the code is like:
var bearerToken = requestToken(RENEWED)
Looks fine so far. The #1 retry attempt should successfully authenticate now, because we use a freshly renewed token.
But it is still also possible that we get a 401 Unauthorized response again, so we retry one more time. Actually we've set maximum number of retries to be 3. So therequestToken() will be triggered 4 times in total, the first is normal call where we get a token from either cache or we renew it, and the rest will be 3 retry attempts. So the code now looks like
var bearerToken = requestToken(CACHE OR RENEWED) // gimme some token
var bearerToken = requestToken(RENEWED) // retry #1
var bearerToken = requestToken(RENEWED) // retry #2
var bearerToken = requestToken(RENEWED) // retry #3

How many times have we renewed the tokens now?

Note that every time we request a new token, we also cache it. But it makes no sense for us to retrieve a CACHED token during retry #1, #2, or #3, right? We have to freshly/force-renew a token in between the retry attempts. Otherwise, you would be retrieving the cached token which had led to invalid authentication in the first place.
So now if you look at the code, we could hypothetically be renewing a token 4 times for a single method call for a single user. How does this fare? Not so good.

FedEx Quota for Token Renewal

“Burst threshold: 3 hits per second, continuously, during a span of 5 seconds... Once a public IP address violates [this limit], will then be penalized for 10 minutes, and all further requests during this 10-minute timeframe will receive a “403 Forbidden” status code”

What could go wrong?

Rare edge cases, but they can happen, if say, our cache system is down. This is a worst case scenario, but nothing you can do much here, as this guarantees we will be banned for 10 minutes straight, or actually and very likely kinda forever... if at least one user triggers at least one FedEx related call, the ban period will keep extending.
This forever-blocking ourselves can also happen if we stumble upon another edge case that would trigger the over-firing of requests (see section below).

Disaster case: how to handle?

Maybe have the application flip a breaker switch or something that imposes a total ban on any communication with FedEx API. The alternative, if we don’t want to forever block ourselves, is for our consultants to co-ordinate with 10+ clients with 50+ users to stop clicking on FedEx buttons, etc. which is not a good PR exercise for anyone...

Action Point #1 🏴󠁧󠁢󠁥󠁮󠁧󠁿 What do you think we should do?

Other Edge Cases?

Remember that there is an interval, however short, say 50ms or 100 ms, between a token getting renewed and a token getting used. Or to be more specific: the timestamp where our request reaches FedEx server with the newly issued token MINUS the timestamp where this newly issued token is issued by FedEx. This includes a server round trip actually.
So what could go wrong here? Will this round-trip interval complicate things? Will it contribute to users invalidating each other’s token in-between this round-trip interval?
What if 4-5 concurrent users grab a to-be-expired token in the cache (which can happen every hour), all reaching FedEx server at about the same time, all getting 401 errors, all triggering separate retry attempts. The first user who receives a 401 error then requests a new token, then 2ms later the second user requests a new token, then 2ms later the third user does the same, and same with fourth and fifth. So by the time the first user sends back his issued token in the request 15 or 50ms or 200ms later, his token has expired four times before, so he triggers another retry to renew a token, which then makes the second/third/fourth/fifth users’ tokens expired, and so each then triggers their own wave of retries. But then the sixth and seventh and eight users now join in sometime during this bout of activities, retrieving the token in the cache deposited by whatever user that’s no longer valid, each then separately triggers their own series of requests, invalidating the first five user’s token and the other two late joiners’ token. If this goes on a while, it will blow our quota. We’ll get blocked. Is this a valid concern or am I over thinking?
Yes, this can happen and becomes more likely with increased traffic and users and the number of API endpoints integrated, especially if retry uses constant backoff rather than exponential backoff. This can be solved by:
Reduce the number of retry requests which freshly renew a token every time, from 3 attempts to 2 attempts (or even 1 attempt)
We should throttle/rate-limit HTTP calls to the token renewal endpoint... we can also implement a sort of concurrency lock or a queue that limits competing concurrent users from outbidding or out-invalidating each other
This helps scoping down the competition between application users, to competition between our two instances: appprd11 and appprd12. But it is an obvious improvement.
Note again that the 3 consecutive retry attempts all call the token endpoint to freshly renew it - per 1 method call per 1 user! So it has nearly exhausted the 3 calls/second quota already!
Does this mean we should space out the retry? By how much? 200ms? 300ms? Constant interval or exponential backoff? Exponential backoff sounds good and fancy, but the requests are time-sensitive right?Should we not prefer to be greedy? Is backing off retried requests like 100ms -> 300ms -> 600ms better than 300ms -> 300ms -> 300ms or 200ms -> 200ms -> 200ms? Why wait 300ms or 600ms in the last two tries when we could reduce users’ wait time by retrying at 200ms instead? How do you balance out the call quota with the acceptable wait time for application users? What should take priority? Should we avoid spamming the token endpoint at all cost due to the risk of a 10-minute ban by making the users wait longer due to spacing out the retries more? But does spacing out the retries per each method call per each user compensate for the fact that we can have 500+ concurrent users at the same time?
Throttling calls/rate-limitation is a clear solution to this
But is spacing out the retry interval a good thing to begin with? Because this can drag out the requestToken() x4 calls for longer... which would increase the chance of us hitting the 5-consecutive-seconds (length) quota especially with many users in the system. Or should we space out less, firing a burst of renewal requests disregarding frequency per second? I suppose we can go over the limit of 3-requests-per-second (frequency) as much as we like, as long as we don’t spam them for longer than 5 seconds. But how can we make sure it won’t last longer than 5 seconds with many users in the system? Is this a valid concern or am I over thinking?
Throttling calls/rate-limitation is a clear solution to this

Action Point #2 🏴󠁧󠁢󠁥󠁮󠁧󠁿 What do you guys think? Anything to help! This is driving me nuts...

Some Solutions/Ideas:

Currently, if 3 retries are maxed out due to repeated 401 error response, an error message that looks something like: “Issue with authentication. Please try again later.” is shown to users. I guess we can add a more specific instruction, like “Please try again in 5 minutes” or whatever to stop them from triggering a wave in succession. However, this will not guarantee that our users will follow the instruction.
From the brief initial talk with Markus, perhaps we can implement a throttling or a breaker algorithm that will stop the users or our own codebase from manually DDoS’ing 3rd party API:
Will be implemented for FedEx & UPS APIs
Salisa thinks the token retry attempts per method call should be decreased from 3-4 to 1-2. This should at least halve the risk of going over quota.
Because the “Please try again message” is shown, the users can manually retry the action
Wave points out that that the requests are limited per IP Address, not per credential or account or ClientID-ClientSecret pair: “Threshold mechanisms are based on a user s public IP address. If a user sets up 10 virtual machine instances behind one public IP, then all requests from that IP address will count toward the threshold limit
This means we have a bigger problem. If 10 of our clients use FedEx API, then all of the application users from the 10 clients will share the 3 token request/second quota! This is quite ridiculous! The risk is now much higher than a ban per account.
We have 2 application instances on 2 server nodes ... there is no getting around this and no way to effectively “lock” or “synchronise” a resource, a process, a quota between the instances
The safest bet is to ration the quota, so from 3 reqs/s, we end up with 1.5 reqs/s per instance. We can narrow it down to 1 req/s in case infra/platform would like to spin up another server node in the future and we can’t really have a 1.5 req per second model anyway
Rate-limiting will be subjected to this restriction as well. The holy number, at least for the punitive FedEx, will be 1 token request per second
Tanguy suggests something about concurrency. This may be the answer. I think some mutex stuff is implemented by Anna for authz (internal token) [TODO: is it? please confirm]. No such thing exists for authx (external token). This can at least limit the concurrent users invalidating each other’s tokens.
Current approach is the use of a singleton FedEx class that handles token caching/renewing. But do you know how things work under the hood? Threads, workers, clusters?
We should have, say, a FedExJWTAuthorization class whose getAccessToken() method is accessible by 1 concurrent user at a time.
But then you need to freshly renew token for each retry attempt anyway, so not sure if a lock will solve anything related to quota
If a “sychronized” method is used however, it should not be used in any JWTAuthorization class...but in FedExServiceController.makeHTTPCall() but this is quite bad for performance[TODO: think more on where to lock stuff]
We can use a “synchronized” key on the renewToken() method in JWTAuthorization class, but not on retrieveFromCache(). We can then refactor the code so that retry attempts will try to retrieve value from cache
renewToken() <----- lock here
retrieveFromCache() <-- if renewToken() is locked/ongoing, we retrieve from here
This can still fail. Don’t forget that we have 2 application instances and there is no feasible way to synchronise between them, so it is possible that two servers will invalidate each other’s tokens (but this is still somewhat manageable and not as risky... so lock solves some issue here)
The logic will get increasingly complicated here.. because
1. What if the retry attempt still fetches old/invalid token and results in repeated failures for 20+ users, as opposed to each one of them freshly renewing separately and hope for the best, but may result in a retry storm << trade-off
2. I don’t like the heuristic and overly complex implementation when we can simply rate-limit (?)
I’m not sure about implementing a concurrency lock to begin with, a queue may even make more sense, because 1 user can renew at a time, so the subsequent users will almost be guaranteed to fetch a valid token (if the other application instance doesn’t invalidate the token in-between the round trip)
Concurrency lock is also not enough... it DOES NOT directly solve the quota problem
Perhaps it solves it indirectly by reducing the rate of requests
Also performance? I hate bottlenecks on something as essential and basic as an authentication process

But again, the quota is PER IP ADDRESS, so let’s think on whether any of these concurrency-based improvement ideas will actually improve our situation.

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
) instead.