next up previous


An Initial Statistical Analysis of the Performance
of the UK National JANET Cache

Michael Sparks, George Neisser, Richard Hanby
Manchester Computing
University of Manchester
England, UK
Michael.Sparks@mcc.ac.uk, George.Neisser@mcc.ac.uk
http://wwwcache.ja.net/

Abstract

The UK National JANET Cache serves the UK academic & research community, which includes universities, colleges, research establishments & other organisations, and is funded by the UK Higher Education Funding Council. Its primary objectives are: a) to minimise request response time, (or latency); and b), to minimise the cost of using expensive communications links: in other words to deliver an excellent service whilst ensuring maximum value for money.

In this paper we describe briefly the JANET Cache and the national caching infrastructure in which the JANET Cache plays a pivotal role. We then move on to the main theme, namely the presentation of an initial analysis of JANET Cache usage and performance. Because of the vast amount of logging information available, the bulk of this analysis has been restricted to a four week period in September and October 1998. It is our intention, once additional computing resources come online, to extend this analysis to cover the UK National JANET Cache Service from its inception on 1st August 1997 to the present day. This will be the subject of a future paper.

The initial analysis has yielded many interesting and useful results and conclusions which are summarised and discussed below. In particular it indicates how effective caching is from a user service perspective and the amount of bandwidth actually saved in a real production environment.

The UK National JANET Cache

The UK National JANET Cache is funded by the UK Higher Education Funding Council and is hosted by the Universities of Manchester and Loughborough. It supports all UK Universities, many research establishments, Colleges of Further Education and various other organisations connected to the UK Joint Academic Network (JANET). Over 150 organisations are using the JANET Cache with more joining every month (we expect this number to exceed 250 in the near future).

The UK JANET Cache comprises some 25 machines based on Intel, Sun and Irix platforms, supporting FreeBSD, Linux, Solaris and Irix operating systems. Typically each machine supports 256MB RAM and around 20GBytes disk space. The machines are 'standalone' in other words they do not share eachother's filestores. Service resilience is provided by locating machines at both Manchester and Loughborough. The Squid [1] proxy cache software (version 1.NOVM.20) provides the caching functionality, recording a range of useful information associated with every request for subsequent analysis.

All organisations eligible to use the JANET Cache are asked to establish a local cache (usually Squid based) and to encourage users to point their Web browsers at their local cache (direct browser access to the Cache is discouraged). Cache managers are advised to configure their local caches to connect to the JANET Cache and use it in parent mode. This, in effect, creates a two-level national caching infrastructure with the JANET Cache at the top level and universities and other organisations at the second level.

This paper presents the results obtained from an initial analysis of the national caching operation. We state as a disclaimer that this work is by no means complete. It represents the first stage in what we intend to be a comprehensive analysis of usage patterns, user and service requirements,cost-effectiveness and the efficacy of caching on a national scale.

Introduction

This document details the results of a wide ranging analysis of the performance the JANET Cache focusing on a 4 week period from September 15th to October 12th. During this period the load on the JANET Cache was increasing daily due to the return of student populations to universities. The longer term stats relating to cost effectiveness of the service are based on historical data covering a period of about 9 months.

Key results are:

Data Sources & Terminology

Data Sources

Data has been culled from the following 2 sources:

The original log files. The results from circa 9 months worth of squidtimes [2] 1 analyses.

Where possible, the former has been used to cull data from, since it contains the level of detail required to fully assess machine characteristics. The latter has mainly been used for data volume figures collection, since the rest of it's calculated statistics consists largely of mathematical means. As we shall see whilst useful for "ballpark" figures, means are in general useless in a caching context, due to the high variation in times caused by cache hits and cache misses at times of high internet congestion.

Note we say where possible we use the original log files. Due to the size of such log files, the processing requirements for this analysis have stretched our stats processing machine to its complete limit. For this reason all analyses of the cluster as a whole are only valid over the 4 week period from 15th September 1998 through to 12th October 1998.

Terminology

We describe many graphs in terms of response time, pages, cache hits & cache misses. For the purposesof this document we define:

Page This represents a single request for any one item from the cache. In practice each page can be a gif, an html page, a postscript file, etc.

Response time This is culled directly from the squid logs. This is the period of time elapsed between the cache accepting the request and the point at which every byte of data has been accepted by the client. This can be considered the actual service time.

Cost Rate For the purposes of this document, this is calculated from the following formula:

\includegraphics[width=5.0in]{testing}

When going from the logs, this is the best estimate that can be made since .com sites can besituated in the UK and .co.uk sites may be situated abroad and so on.

Cache miss This means the cache did not contain the page requested, and was forced to request the page from the origin server.

Cache hit Squid logs several kinds of cache hit. For the purposes of this document, if the page requested was served from the cache store then this is a cache hit. Specifically in terms of squid logs, we define all the following as hits:

As far as we can ascertain, Squidtimes only takes into account hits of type 1, whereas in terms of cost effectiveness, all 4 of these have major relavence. Hits of type 3 are particularly useful since they represent an improvement in data availability, because data is available despite the remote host being uncontactable. (within a reasonable time frame)

There is also a further kind of cache hit logged in Squid 1.NOVM as a TCP_HIT, that is logged in Squid 2 as a TCP_MEM_HIT. This is a hit that has been satisfied by having the requested data object in memory. These hits are many orders of magnitude faster and show up on Class 2 graphs(see below) as dots all the way along the bottom of the graph up to and beyond the characteristic knee. If many more hits can be converted to TCP_MEM_HITs then response times of the cache would fall dramatically.

Standard statistical terminology used

All of the following are used with their normal definitions.

Mean The sum of all values divided by the number of values.

Median This is the middle number of the values in a sorted sequence.

Upper Quartile This is the median of the values greater than the median. ie 25% of all values are greater than this.

Lower Quartile This is the median of the values less than the median. ie 25% of all values are less than this.

Presentation of results & Mode of data extraction

Presentation of results

Most of the results of the analysis are available as graphs, a selection of which is presented in this document. In order to make trends easier to see in the graph, the daily savings made by the JANET Cache only show weekdays. There are 10 main classes of graphs available, for every machine in theworking set, these are broken down as follows:

All of the above are available for every machine, and additionally classes 3,4 & 5 are available for the JANET Cache cluster as a whole. All the above graphs are scatter graphs, with the exception of classes 4 & 5, which are presented as line graphs to aid the ability to see trends.

Scattergraphs of hit-rate vs bytes & hit-rate vs pages served per hour were also produced, which surprisingly does contain useful information - this is discussed later.

All graphs are available via this report's auxillary website. [3]

Mode of Data Extraction

Squidtimes

Data from squidtimes was extracted from the summaries we make readily available on a daily basis via our website. [4] Amongst the data these pages contain, is the following information per machine in the working set for that day:

The mean hit rate over the entire day

Total number of pages shipped

Total amount of data shipped

Using the data from these pages, the long term hit rate was found by the following algorithm:

We calculate the number of bytes delivered as hits for each machine. We use the following formula for this: hit rate * total data shipped = hit bytes. This is a valid calculation for a good estimate as we will show in section 4.2.

We then sum all the hit bytes for all the machines on a particular day - this allows us to gain an overall hitrate for the JANET Cache on a particular day.

This operation is performed for every day since February that squidtimes analyses exist. By taking the sum of all hit bytes and total bytes delivered per day over all days, a long term hitrate has been calculated. This time period is over 9 months.

The savings presented were calculated from this data. As noted above, as far as we can see squidtimes only logs a TCP_HIT as a cache hit, which is only 1 kind of hit. As a result, these are minimum savings per day. Since the long term hit rate was calculated from this data, it is a minimum hitrate, and as a result the longer term extrapolations made about caching cost effectiveness must also be viewed as worst case savings.

Note some major dips in the daily savings graph are due to failures of the stats processing machine to process all the logs for all machines for some days.

Extraction from log files

For each machine comprising the JANET Cache working set for every hour of every day in the time period 15/9/1998 - 12/10/1998, the following data was, calculated:

Each of the following for hits, misses and all requests irrespective of being a hit or miss:

From this data the total cost bytes over the period of time data was calculated along with the total miss bytes. This allows the calculation of a cost rate as defined in 3.2.

Also from this data, daily summary information for each machine and the cluster as a whole were created. These contain information summarised over the following 3 time periods:

The data summarised for each time period:

Cost Effectiveness of the JANET Cache

Current Savings of the JANET Cache

The direct savings being produced per day are presented below:

\includegraphics[width=5.0in]{daily_savings}

This is calculated by multiplying the hit bytes per day by the current rate of charging - £20.48 per gigabyte. In order to make the overall trend clearer, this graph only shows weekday savings. As can clearly be seen the JANET Cache is currently saving approx £600 per day. (See proviso in 3.3.2 about this being worst case saving)

Cost Effectiveness of the cache vs the off-peak period

Looking at the class 4 graph for the JANET Cache as a whole,

\includegraphics[width=5.0in]{cluster4}

we can clearly see that the number of requests satisfied by the the cache rather than going to origin servers dwarfs the number of requests made during the off-peak period.

This means that despite the availability of the "free" period, most users do the majority of their browsing during the peak periods. As a result, the savings they make during the off-peak time are negligible with respect to savings the caching service represents, even if the cost of cache misses was split between the institutions using the cache service.

An assumption here is that individual site caches do not redirect traffic away from the JANET Cache during the off-peak period. Whilst the above graph represents pages served per day, we see that the same overall shape is followed if we replace pages by bytes:

\includegraphics[width=5.0in]{cluster5}

This is to be expected however since there is a very high correlation between number of pages served per day vs number of bytes served per day:

\includegraphics[width=5.0in]{cluster3}

This trend of the majority of traffic being during peak hours also clearly shows up if you plot number of requests per hour vs hour of day: (for a typical machine)

\includegraphics[width=5.0in]{ginger7}

This trend also shows up if you replace pages with bytes. (See web pages)

Cache Cost Effectiveness under varying load conditions

We looked at how the hit rate varies throughout the day, and also under varying load levels. If we examine hit rate vs hour of the day, for any particular machine we don't see any major trends:

\includegraphics[width=5.0in]{ginger9}

But if we examine number of pages served per hour vs hit rate:

\includegraphics[width=5.0in]{hit_rate}

Again, the graph in terms of bytes vs hit rate has identical characteristics. To save paper this is left available on the website.

We notice an interesting point: the minimum likely hit rate increases proportionally to the number of pages served per hour. Unfortunately we also notice a second characteristic: the maximum likely hit rate decreases from 100% fairly linearly with respect to number of pages served per hour. This however is largely down to the fact that as the number of requests per hour increases, the sample size increases.

What it does do though is that it gives a good technique for potentially redoing this for new configurations, finding the upper and lower limits and finding a means of approximating a possible long term hit rate based on only relatively short period of time.

Experience with local site cache interworking that a 20% hit rate for a local machine can also be a 20% hit rate for a sibling cache (without traffic partitioning). This suggests that if this graph was redone for the entire cluster when the cache has an efficient method of cache interworking that the point at which the hit-rate maxima & minima lines intersect would be raised. Further investigation is required here.

First Approximation to Projected Savings of the JANET Cache over the coming year

To dampen the effect weekends have on JANET Cache traffic we work on the assumption that 4 weekends' traffic equals that of 1 working day. (This is an under estimate, looking at actual traffic figures each weekend day has 1/3-1/2 traffic of a weekday)

This means we have 365/7*5 + ((365/7)*2)/4 =286 working days traffic in the year.

By examining 9 months worth of logs we find the JANET Cache has a reliable mean hit rate of 22%.

Our initial estimates of growth in the caching service was that the number of requests would triple over the course of the first year from 3 million requests to 9 millions requests per day. The service is currently servicing 16 million requests per day,2almost double the number of requests. Also 12 months ago, the service was delivering approximately 10 Gb per day. Currently the JANET Cache is supplying 120 GB per day as the norm peaking upto 130GB in one day occasionally. This represents a growth of about 10 fold.

As a result if we limit our projected savings to a 3 fold growth in traffic we are likely to demonstrating the likely minimum saving.

Bandwidth use per day at year start: 100Gb
Bandwidth use per day at year end: 300Gb
If we assume a constant increase in growth over the year, we gain the following for total bandwidth consumed over the coming year of:

((100+300)/2)*286 Gb = 57,200 Gb. ie 57.2Tb

Since the hit rate has remained reliable over the 9 month period it was calculated from, and the hit-rate vs pages per hour graph implies that this is likely to stay this way, we gain a bandwidth saving over the year of:

(57.2Tb *22)/100 Tb = 12.6 Tb

This represents a financial saving of:

£257,000

We can also see the effect of increasing the hit rate by 1 percent:
(57.2Tb *1)/100 Tb = .572 Tb

representing a saving of £11,700. When we factor in the Cost Rate as defined in 3.2, we get:

Total potential saving of the JANET cache over the entire year:
£257,720 * .9949 = £256,000

Potential saving of each percentile point over the entire year:
£11,700 * .9949 = £11,600

Performance of the JANET Cache

Caching versus going direct

If the cache requests a page that is a miss, this should take no longer to be satisfied by the origin server to the cache than if it were being received by the client of the JANET Cache. As a result we compare cache miss times with cache hit times. Taking again a typical machine and plotting hit & miss times vs page requests per hour we gain the following graph:

\includegraphics[width=5.0in]{ginger6}

The times shown are median response times.

As we can see under light loading there is very little difference. As load increases however, hit & miss times increase fairly linearly, with more scattering in miss times up to a "knee" point at which point hit response time collapses and miss times become completely unreliable. This knee point happens under heaviest loading of the cache which typically happens during early to mid-afternoon, which corresponds to a time when the internet is generally congested, and hence you would expect miss times to become unreliable.

The user perception will be that the cache is being slow for 3/4 of pages and coping OK with 1/4. The facts however are is that for 1/4 of pages, response times are being substantially increased during times of internet congestion.

Effect of the Cache on Overall response times

If we plot the median, upper & lower quartile response times for the caches vs pages requested per hour,we clearly see a graph very similar to that in section 5.1.

\includegraphics[width=5.0in]{ginger1}

One thing to notice is that even the upper quartile response time is generally good up to the knee point,at which point it collapses. The same is more true for the median response times and, even more so for the lower quartile response times.

This knee shows up to a greater or lesser extent dependent on how overworked the machine has been. For example a more overworked machine:

\includegraphics[width=5.0in]{crash1}

The overriding factor in the effective reliability is the cache hits. This dominates the lower quartile response time, and is a major factor in keeping the median response time to a respectable level.

If we examine the various cache machines and their specifications we notice that each machine has a the same overall graph, each with this characteristic knee, at some point or another, with newer machines being able to handle substantially higher loads than the older ones, before reaching their knee.

We are currently investigating all possible characteristics of this curve, and currently one of the newest cache machines (dead.wwwcache.ja.net) is providing the clearest evidence yet that the faster the disks in the machine, the better the cache machine runs. (Intuition is useful, but proof is better) There is also possible evidence that FreeBSD runs better than Linux, though this is still to be fully investigated.

The effect of Loading on machine efficiency

If we plot pages served per hour vs bytes served hour for individual machines we notice that rather than being any curve, that the overall trend is for the number of misses & number of hits to increase in a fairly linear fashion, up to a certain load, and then begin to level off.

\includegraphics[width=5.0in]{ginger2}

Whilst this does seem at odds with the result in section 4.3, it should be noted that slight spreading at lower request levels will show up as a wide variation in the hit rate. As loading increases, the total spreading becomepercentage wise less, hence a narrowing of the hit rate in section 4.3.

The Usefulness of Means in Performance analysis

Whilst the arithmetic mean is one of the most commonly cited useful averages, it turns out to be almost useless in the caching context. This is clearly demonstrated if we take the same graph as section 5.2 and instead of plotting the median we plot against the mean response time. The resulting graph:

\includegraphics[width=5.0in]{ginger0}

Does give us a vague indication as to the fact a knee exists but leaves us puzzled as to the cause of the scattering and makes us doubt the consistency of the cache in its ability to give a response within a certain time period for various loading levels.

As a result the usefulness of means is only in terms of producing a fast ballpark figure and is otherwise fairly useless for making major decisions on. The implications for the current stats generated of a daily basis by readily available (and highly regarded) scripts and programs is quite large - the majority of them produce means, rather than medians, quartiles or percentiles and the results therefore must be viewed with caution.

The extra level of data we see in plotting medians and quartiles clearly shows the advantages of analysing the data directly from the logs. Such statistics are however much more CPU intensive, and currently producing medians/quartiles for an entire days worth of logs is beyond the memory capacity of the stats processor 3. Also since it took the stats processor 3 weeks to produce 4 weeks of summary files it is doubtful it could process larger chunks in an acceptable time frame.

Increase in Traffic through the JANET Cache over the summer of 1998

The following 2 graphs speak for themselves:

\includegraphics[width=5.0in]{bandwidth}

\includegraphics[width=5.0in]{requests}

Closing Points

This report ends the first iteration on the production of comprehensive data on the efficiency and cost-effectiveness of the JANET Cache. Assuming a continuing trend of producing statistics, a breakdown in terms of which subnets/IPs are using the JANET Cache would be particularly useful in terms of balancing the load on the caches more evenly.

The rationale behind this is:

Each machine has an effective capacity before it's knee as demonstrated above. Now that we have the scripts in place, calculating this is relatively simple.

Each institution makes a certain number of requests. Scripts to calculate this should be fairly trivial to write. Using this, it should be possible to take the results of both, and automatically load balance based on this, improving user perception.

If extra hardware was bought that increased the hit rate by just 2%, it would pay for itself more than twice over if it costed less than £12,000 . Methods for increasing the hit rate are:

Extra disk capacity (to avoid LRU expulsion)

Level 7 routers, such as the ArrowPoint product [5] (if it proves to be a useful product) or the EddieWare front end system. [6] Squid 2 itself also looks like it may be able to supply the same functionality in the form of basic CARP support. At this time however Squid 2's CARPsupport is still basic, and not quite what we require. Since it would represent a very neat solution we watching its development with interest.

Increased cache co-operation. If level 7 routing turns out not to be viable then the use of cache digests at the national level may be the only way forward. The key difference between CARP (and level 7 routing in general) & squid digests is that CARP doesn't send large digests round the network, and never performs a check before forwarding the request to the "correct" cache.

Clearly if load does increase by three times over the coming year, we will need to increase hit-rates using 1 or more of these methods so this will be an ongoing goal in the coming year. If hit rates are substantially increased (say doubled...) then this will substantially affect the perceived reliability of the JANET Cache as well since it will result in increased stability in median and upper quartile response times.

Finally, it does seem that the JANET Cache is financially cost effective, with a likely saving over the coming year of £1/4 million. If any of the measures above result in an increase of the hit rate, for each percentile gain we make, we save an extra £12,000 .

Acknowledgments

This work is supported by JISC, the Joint Information Systems Committee of the UK Higher Education Funding Councils.

Bibliography

1
Squid Proxy Cache Software, http://squid.nlanr.net/

2
Squidtimes, http://www.cineca.it/~nico/squidtimes.html , A squid log analysis program that produces an HTML report with a wide variety and includes a summary at the top of an HTML page.

3
http://wwwcache.ja.net/Reports/Preliminary_analysis/
This URL has the rest of the graphs for all the other machines in the JANET Cache.

4
http://wwwcache.ja.net/Statistics/Squidtimes_summaries/
This links to pages that summarise that Squidtimes summaries.

5
Arrowpoint CS100 Switch
http://www.arrowpoint.com/
From our presepective provides a load balancing solution at the switching level. Unlike Layer 3 switches though can spoof TCP connections in order to route connections to appropriate servers based on HTTP connections contents. Since this is done in a centralised fashion, more than one would be needed to avoid a single point of failure.

6
Eddieware
http://www.eddieware.org/
Designed to load balance a distributed web site without a single point of failure.


Footnotes

...squidtimes 1
Part of our daily statistics run includes generating this for all the machines. These are then summarised on the website. [4]
... day,2
As of Nov 1998
... processor3
As a result of this we have now procured newer hardware to enable a much longer term analysis in more depth

This document was generated using the LaTeX2HTML translator Version 98.2 beta3 (July 4th, 1998)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, Ross Moore, Mathematics Department, Macquarie University, Sydney.

next up previous


Michael Sparks, March 1999