Popular, but sluggish secure server? Popularity might not be the reason

As mentioned in another of my TLSProber articles, we have constantly been adding functionality to the prober's engine.

One of the preliminary steps before adding the renegotiation tests was to add support for testing SSL Session Resume, as that was needed to test some of the renegotiation corners I was planning to look into.

The Sessions feature in the SSL/TLS protocol has been in the protocol since SSL v2, and it allows multiple connections to use the same negotiated secret key data to calculate encryption keys for the connection, instead of performing a full negotiation to determined the encryption keys. This allows a secure connection to be established very quickly with no loss of security, since they are just reusing data they have exchanged earlier in a secure fashion during the previous full handshake.

Since the server-side part of full negotiation is very costly, due to how much CPU power is needed to decrypt the secret data from the client with the server's private RSA key (a cost that is being quadrupled with the ongoing transition to 2048 bit RSA keys), being able to reuse this key data in more than one connection is very valuable. It allows the server to do other things, like serving content or handling more individual users. Some sites split this cost between a SSL front-end server and an ordinary HTTP back end.

The lifetime of a session varies, depending on the server's capacity to store such sessions securely. The general recommendation is up to 24 hours, but just keeping a session for a few hours can have a great effect on the efficiency of a server.

For a single server, sessions are easy to deploy, but it gets more challenging when many servers are hosting the same secure site, since the information for each session has to be known to all the other servers hosting the site. All of this is part of the configuration system for the server.

The TLSProber is not the only tool that has investigated how well session resumption is used. Ivan Ristic has also done so, and our numbers match his findings: About 90% of websites resume sessions, 10% do not, evenly divided among sites that have disabled sessions completely and sites that just do not resume sessions.

However, given the results of our recent investigation into the Renego patch status for popular sites (popular sites are less likely to patch), and in part prompted by a discussion in a W3C WebID Incubator session at the recent W3C Federated Social Web Workshop in Berlin, I decided to look at how the numbers for no session resume support were for popular sites.

As you can see from the above graph, of the secure servers we probed 29% (860 of 2957) of the servers hosted by the Top 100 sites do not resume sessions, compared to only 8.6% for the non-Alexa rated sites.

Among the sites with servers in this category are:

  • Yahoo!
  • Live.com
  • Twitter's search and mobile servers
  • MSN
  • EBay

It is a bit difficult to tell exactly how the servers in question are used by the sites, but given the number of servers, I think it is likely that many are heavily used by customers.

What are the consequences of not having session resume enabled on a SSL/TLS server?

When a server always refuses to resume a session, this will negatively affect the speed of a client that opens multiple connections to the server, because it (at least in Opera) will delay all the other connections to the server that it is establishing at the same time. The new connections are held back because the result of a full negotiation will affect all of them, and they need to know the new session identifier before they can continue. If a server never resumes a session, webpages loaded from the server will load more slowly than they could have, because new connections take longer to establish. Even if they could be opened in parallel, setting them up will take longer.

If a client on a given day only connects once (a single connection) to a site, session resume would not really improve server performance, in fact it would just require more internal bookkeeping.

If the client establishes more than one connection to the site on a given day, particularly within a short period like when loading a webpage with images, the situation changes depending somewhat on what the server produces and how heavy its normal workload is compared to the cost of the SSL/TLS handshake.

If the normal cost of work with a connection (w) is 1/100 of cost of the RSA handshake, and the server is handling both negotiation and responses, I estimate that 10 connections from a single client would roughly increase the load on the server 9-fold; 100 connections increases it 50-fold. If the difference in CPU usage between the two operations becomes even larger, the extra load on the server increases even more, becoming closer to the number of connections
established by the client within a time period.

On the other hand, if the normal cost of work on a connection is the same as for the RSA handshake, or higher (w>=1), then the load on the server at most doubles, no matter how many connections the client establishes, because the handshake is no longer the largest CPU cost factor. However, a website that fits this profile should probably investigate its code to discover if there are undiscovered bottlenecks that reduce performance or better ways to perform the server's tasks.

If the SSL handshake is handled by a front end, then the increased workload on the front end will follow the number of connections per client, and it will be 10-fold for 10 connections per client, 100-fold if 100 connections are used by the client in the period a session would be valid.

Given that, in many cases, the handshake is the most costly operation, this means that not having session resume significantly increases the load on your servers, almost linearly in proportion to the average number of connection each client will establish with the server in a given time frame.

The upper load limit depends on the relative difference between the handshake cost and the other operations of the site. For normal operations (w at most 10% of the handshake cost, between 10 and 100 connections) my calculations indicate that this upper limit is probably in the range of 5 to 50 times the load of using session resume.

In other words, according to my estimates, if you disable session resume for a heavily trafficked, secure site it is very likely that you must install 5-50 times (or more, depending on site profile) the number of servers you would have needed if you had used session resume.

This quickly snowballs into a considerably higher need for electric power, maintenance personnel, server rooms, capital requirements, and so on, ultimately reducing the site's profitability and the ability to work on new inventive features of the site. And then we haven't even considered the new consideration many people want to include: the impact on the environment. Unless I am mistaken, there is one big winner when you disable session resume: the server hardware vendor. (Also, the electricity company and the landlord are probably delighted, as well.)

SSL/TLS have always had a reputation for being very costly compared to not using encryption, although that has previously been debunked by Bob Lord, as well as by Adam Langley of Google. However, given the above, I find myself wondering if part of that reputation comes from not having used all the possibilities for optimization that exist in the protocol.

The Handshake operation is no longer prohibitively expensive due to advances in CPU architectures, but it is very likely still the most expensive single operation in a single transaction with the client. It is also an operation that does not have to be done frequently for each client; once every few (3-24) hours is frequently enough for most purposes.

So, if a popular and secure website seems to take a long time to load, the reason might not be (just) that it is popular, it might be that the site is not optimized correctly. It might be that the site administrator found it easier to throw money at the performance problem, rather than investigating more closely to discover what the bottleneck really was, and how it might be possible to solve the issue without more hardware.

References:

Appendix:

The estimate of extra work load when not using session resume is calculated as follows:

  • L : Load for all connections with no session resume, compared to using session resume
  • N : Number of connections per client
  • R : CPU cost of doing full handshake
  • C : CPU cost for the entire connection (includes connection key calculation)
L(N,R,C) = N*(R+C)/(R+N*C)

If we eliminate the absolute CPU costs and set

w = C/R

(the relative work for a connection compared to the RSA operation) we get:

L(N,w) = N(1+w)/(1+N*w)

The fraction of CPU power used for the handshake is as follows

F(Session resume) = 1/(1+N*w)
F(no session resume) = 1/(1+w)

As an example, Adam Langley of Google has stated that their All-SSL sites, which support session resume, only consume 1% CPU for SSL handshakes.

Assuming that a user will be connecting to the server 400 times or more in a session (not really a large number if a user is processing a lot of email in Gmail, for example), this could be equivalent to 50 or more documents loaded with full pipelining in Opera.

Based on this, I estimate that the Google services per connection cost is 25% of a full handshake, if the number of connections per client is increased to 1000 the cost per connection is lowered to 10%.

3 thoughts on “Popular, but sluggish secure server? Popularity might not be the reason”

  1. Very interesting, and very conclusive.However, assuming that the admins of eBay, MS, Yahoo etc. are no idiots (and I actually believe they are very clever people), I wonder why they disable this feature if it 1) helped their users AND 2) helped themselves save money (hardware, electrical power…) while increasing performance and lowering response time.So wouldn’t this be a win – win situation (without and cons)?

  2. When you have multiple servers, you either need to share all the session information between the servers (Apache does this using a database;IIS can’t), or have something in front (a load balancer) of the servers that route the connections to the right server. If you are nor aware of that need, you will not design your system that way.24% of the Alexa top-100 servers send a (supposedly resumable) session ID, but does not resume it, indicating missing synchronization between multiple servers or intentional configuration. Another 5% (same as overall) does not send a session ID at all, which have to be intentional configuration.A possibility, as I mention above, is that as load on a server increases, the administrators of the servers, taken in by the myth of “SSL is expensive” and perhaps unaware of the SSL Session feature, will not look closer, but just add more hardware (and perhaps relocate content to non-HTTPS servers to reduce load).Another possibility is that they think the profile of the site (e.g users will only visit once a day) is such that there will be unnecessary overhead of storing the session.

  3. I see, thanks a lot for the detailed explanation!Load balancers and the issues caused by them might be a reason not to enable it, I understand.Well, still I hope that more content providers switch on this feature, for the better for both sides of the line 🙂

Comments are closed.