Performance Comparison:Remote Usage, NFS, S3-fuse, EBS

From CSWiki

Contents

Experiment Description

This experiment tests the performance of task dispatch and execution of swift among different hardware platform, so far, we've tested difference types of Amazon instances and also tp-login server and a laptop. All test data are stored locally. We use a simple cats code for testing with different sizes/numbers of input data ( 10M each * 100, 1M each * 1000, 100K each * 10000). We measure the performance by using time command, In this experiments, we measured the "real time", and "user + sys" time as "rates with I/O time", "rates without I/O time", respectively. we run the swift as local mode,the input/output data is stored in the EC2 working node;there is only one working node---the one running swift program.

This experiment test the performance of swift in different usage mode; data stored at client's side (run swift remotely, the most simple and typical usage), data stored at EC2 NFS, Amazon EBS, Amazon S3.

There are two types of testing apps, one is CPU-intensive apps called OOPS, the other is I/O-intensice apps called mycat( same code used in Task dispatch and execution rates on different machines/instances experiment.) For a 100 job run, OOPS only have 50Mb of input and output data with 5-10 mins CPU time for each job, compare to cats, which has 1GB of input and output data with tiny of CPU usage for each job.

We compare the total running time of those two apps with different size of the EC2 cluster, we also measured the average I/O rate among Client's Computer, Amazon Locally, S3, EBS. We also pay attention to the CPU usage of each nodes especially the EC2 headnode ( PBS headnode, with swift installed), which may give us some clue about the bottleneck of this system)

The Goal of the experiment is trying to find in what scenario make the system performs best/worst, if possible, find the bottleneck in each scenario. After enough experiments/evaluations have been done, we may find a way to tune up swift especially for Cloud and make it have a optimum performance.


Experiment Methods

First, we created a PBS Cluster with 100 working nodes (Standard Small) and 1 head nodes(Standard Small) on Amazon EC2 (by using Context-agent), then run those two apps OOPS ( CPU-intensive) and CATS ( I/O Intensive) with different usage mode:

1. (Swift Remotely )Swift is installed at Client's computer. Client Side Submit Job to EC2 Cluster via Globus middleware( Gram2, GridFTP), The Input data will be transfered from client side to EC2 Cluster via GridFTP and Output Data will be transfer back after job has done.

2. (Swift S3) Swift is installed at PBS headnode, Data is stored at Amazon S3 and mounted by Headnode, Swift will submit Job directly to PBS server ( bypass globus middleware), Data will be pulled from s3 and wrote back after finished.

3. (Swift Locally), same as mode 2, expect data is stored at EC2 PBS headnode.

4. (Swift EBS), same as mode 2, expect data is store at Amazon EBS and mounted by PBS headnode.

Since Swift need a shared file system to store its own intermediate data, a NFS has been configured on /home/ directory.

For Each Run (100 jobs submission), OOPS has roughly 50MB Input Data and produce roughly the same size Output Data, CAT has 1GB input data and surely 1GB output data.


Hardware Platform

Amazon Standard Small
  • 1.7 GB memory
  • 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)
  • 160 GB instance storage (150 GB plus 10 GB root partition)
  • 32-bit platform
  • I/O Performance: Moderate
  • Price: $0.10 per instance hour
Amazon High-CPU Medium
  • 1.7 GB of memory
  • 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each)
  • 350 GB of instance storage
  • 32-bit platform
  • I/O Performance: Moderate
  • Price: $0.20 per instance hour
Teraport-Login Server
  • 4.0 GB memory
  • AMD Opteron(tm) Processor 248 2.2GHz
  • 3.6T instance storage (NFS)
  • I/O performance: Unknown
  • price: Unknown
Teraport-Workers Nodes
  • CPU Unknown
  • price: Unknown

Swift Environment

For Swift Remotely ( Swift installed on Client's Computer)
<config>
 <pool handle="ec2_basecluster">
   <gridftp url="gsiftp://ec2-75-101-170-6.compute-1.amazonaws.com" />
   <jobmanager universe="vanilla" url="ec2-75-101-170-6.compute-1.amazonaws.com/jobmanager-pbs" major="2" />
   <workdirectory >/home/torqueuser/swiftshare</workdirectory>
   <profile namespace="karajan"key="jobThrottle">100</profile>
 </pool>
</config>
For Other Usage Mode ( Swift installed on EC2 PBS Headnode)
<config>
 <pool handle="ec2_basecluster">
   <gridftp url="gsiftp://ec2-75-101-170-6.compute-1.amazonaws.com" />
   <jobmanager universe="vanilla" url="ec2-75-101-170-6.compute-1.amazonaws.com/jobmanager-pbs" major="2" />
   <workdirectory >/home/torqueuser/swiftshare</workdirectory>
   <profile namespace="karajan"key="jobThrottle">100</profile>
 </pool>
</config>


Experiment Result

Image:Performance test htm m6ea7baf.jpg Image:Performance test htm 60089c8f.jpg

Average Transmission Rate:


CI Server (via GridFTP) -------29.84MByte/sec------------> EC2 NFS*

EC2 NFS* ----------27.28NByte/sec--------------------> CI Server (via GridFTP)


EC2 S3 --------13.5MB/s -----------------------> EC2 NFS*

EC2 NFS* ---------9.09MB/s --------------------> EC2 S3


EC2 EBS -------------57.18MB/s-----------------> EC2 NFS*

EC2 NFS* -------------57.18MB/s----------------->EC2 EBS


EC2 Locally ---------37.14MB/s------------------>EC2 NFS*

EC2 NFS*--------------38.25MB/s------------------>EC2 Locally


Speed Rank (to EC2 Shared File System):

EBS (57) > Local Access* (37) > CI Server (27) > S3 via fuse (11)


  • Local Access means using mounted temp disk space when an instance has been created.
  • EC2 NFS is a Shared File System on EC2 Cluster used for Intermediate data storage by Swift.

Experiment LogFile

Some Observation During the Test

Issue One: CPU Resource Starvation on EC2 Cluster Headnode

When Swift Submitting jobs remotely ( i.e. usage mode 1), the PBS_Header consistently has a 100% CPU usage. When this happened, the job-submit rate dramatically decreased to 5-10 jobs/ per mins. Therefore, there are some working nodes always idling while swift also have lot of jobs submitting, this lead to system underutilized. But for Swift installed on Head_node (usage mode 2, 3, 4,) this never happened, 100 jobs was submitted to pbs queue in about 10-20 second, and the CPU usage on PBS_Header nodes is less than 20%.

This issues Happens on both OOPS and CAT apps.

Potential Reason:

Since it happens on both OOPS and CAT apps, and OOPS only has a small data file, the data transmission delay may not the reason. In other hand, Head_node has a 100% CPU usage, it is highly possible that Head_node is too busy to handle the incoming jobs. Since this high CPU usage only happens on remote mode, and for remote mode, the globus middleware is involved. So, it is suspected that the Gram2 or the Gridftp may make the starvation of CPU on head node.

Solution:

We may use Amazon High-CPU instance type instead of default one (normall small), according to Amazon Webpage, the High-CPU Extra Large Instance can provide (8 virtual cores with 2.5 EC2 Compute Units each) compared to Small Instance ((1 virtual core with 1 EC2 Compute Unit), which is 20x faster. It may solve the CPU Resource Starvation on EC2 Cluster Headnode problem.

Issue Two: For CAT apps, performance decrease while working nodes increased

For a normal Grid-Site, the system performance should increase while the cluster become large( more working node), but for the CAT test code, we get a contract result; for 1 test run (100 jobs), the 100 working nodes performs the worst and the 10 working nodes cluster performs the best.

Potential Reason

During the experiments, for a cluster with a 100 working nodes, swift submit all of the 100 jobs to pbs queue instantly, since there are already some working nodes available, PBS scheduler will schedule all of the submitting jobs which may created lot of file transmission thus cause traffic congestion inside the Cloud. For a cloud cluster with lower nodes (e.g. 10 working-nodes), extra submitted job will be put into pbs queue thus there are less chance to have a traffic congestion.

(Correct me if I am wrong)

Solution

Amazon EC2 may not perform well on burst I/O transmission between nodes, to avoid such scenario happens, we can change the throttle/score of the swift to make a smoothly job submitting.

Experiment Evaluation

The data transfer between EBS and EC2 NFS has the best I/O performance, where S3-fuse* has the worst. Compare the price of the two storage, EBS has a $0.10 per allocated GB per month plus $0.10 per 1 million I/O requests where S3 has $0.150 per GB /month plus $10.0 per 1 million PUT, COPY, POST, or LIST requests.

According to the test result, , Since EBS is much cheaper and provide a better performance, so we recommend using EBS as permanent storage for user data.

  • S3 via fuse may have a lower performance than using Amazon SE API tools to get/put data from/to S3 Storage.

Performance Issues

After carefully looked at Figure 1 (OOPS Performance Test), we find some performance issues. Ideally, as number of working nodes increased, the executing time should decrease proportionally, But the Figure doesn't seems like that. When the nodes increased from 10 to 25, which is 2.5x than before,but the execution time only decreased half. when node increased from 50 to 100, there is relative small change on execution time.

After rerun the test and a carefully analysis the run-time status , we think we've found the reason of the issue: the long tail problem. The Total Execution Time is calculated by the first job has been submitted until the last job finished, when the execution approach to the end, there are some nodes is idling because there are not enough job in the queue to run. When the rate of (number of nodes/total number of jobs) is high, this problem effect more. Therefore, In our first experiment, there are 100 jobs to run and the long tail problem effect a lot at the "100 nodes" test, so that the performance is not good as we expected.

In order to justify our assumption, we've made another OOPS performance test, In this test, we set the working node from 1 to 10, for each test, we run 50 jobs with each job cost 30-50 second in average, so by use such setting, we reduce the effect of long tail,and we hope we can get the result we wished. We can get the CPU time for each job run shown in swift wrap log, thus we calculate the communication time by using this formular:

Total_Running_Time = Computing Time + Communication Time

The Total_running_time (begins from submitting of first job until finishing of last job)can be easily measured, we can also find "per job computing time" by analysis the swift wrap log. we calculate the Computing time by:

Computing time = (C1+C2+C3+C4+...Cn) * n / number of working nodes

Cx: the computing time of the xth job. n: the total number of jobs to run.

By simple substation, we can calculate the Communication time, such communication time includes Swift stage-in, stage-out, Data Copy, job arrangement, etc.)

Image:OOPSresult1.jpg

For this time, we can see that result is similar than what we expected. we executing time halved when we double the node. ( e.g. from 1 to 2, from 2 to 4, from 4 to 8), there are also some long tail effect when node is large ( e.g 8 ,9 10), but the result is also drop in our acceptable range.


Image:Oopsresult2.jpg

We also evaluated the Computing time and Communication time of each run, the effective rate ( Compute Time/Total Time) is between 60% - 75%. The communication time is consist of stage-in, stage-out, job scheduling, internal network tranmission when working nodes access data through NFS. Michael mentioned that they are trying to eliminate Shared File System system requirement for swift by transfer the data directly to each working nodes. If this feature is accomplished, we can have a better efficiency rate in the test.

Performance & Cost Analysis ( Compare among CI Teraport, Amazon Small, Amazon Medium(High-CPU)

We also use the same method above to test the performance among the CI teraport, Amazon Small Type Instance, Amazon Medium Type (High CPU Instance), we did not test the Amazon Extra Large Instance.(Since it need 64bit Machine which is not compatible with our existed AMI image file) the result is shown below:

Image:Teraport smal medium.jpg


According to result, Amazon Medium Type Instance has similar CPU Power as Teraport , and the Amazon Small Type has nearly half CPU Speed as other two.

About the cost, for the 50 jobs test, Amazon Medium Type will cost about 2162.22 secs * 11 nodes ( 10 workers node + 1 head node) * 0.20 (per hour) = USD 1.36 dollars, while Amazon Small type will cost about 4160.63 secs * 11 * 0.10 (per hour) = USD 1.27 dollars.

As a result, the Amazon Small Type Instance has a slightly better ( about 7%) cost-efficient rate than Medium Type Instance, while spend nearly as twice as much time. In a real scenario , Choosing Small or Medium Type will depends on the time restrict of jobs.

Image:Test 1.jpg Image:Test 2.jpg

So I plotted another two graph to show the "efficient rate" of our previous experiment.

The First graph shows the difference between ideally execution time and Actually execution time for node scale from 1 to 10. The Ideally Time is calculated in this way: for the same 50 jobs, by using only 1 nodes, it cost x secs, so the ideally time of using N nodes would be x / N)

The second graph shows the efficient rate, the efficient rate is calculated by diving Actually Time by Ideally Time, The result shows that the efficient rate range from (83% to 74%), which is "good" in my opinion. --YiZhu 14:48, 13 August 2009 (CDT)