Skip to content

Disk and service constraint indicators in AzCopy

Ze Qian Zhang edited this page Apr 23, 2019 · 21 revisions

Here’s when AzCopy v10 will display messages saying ‘Disk may be limiting speed’ or ‘Service may be limiting speed’. This description applies as of version 10.0.9, the release candidate. This wiki covers the on-screen messages. See final heading below for how to get the same information out of the logs.

Disk may be limiting speed

Never displays in the first 30 seconds of a job. Only displays after first 30 seconds, since we want things to stabilize before we consider displaying this message.

After first 30 seconds of job it displays in the following cases:

  1. If data is moving over the network faster than it can be written or read to/from disk. When uploading, we detect this by measuring the queue of file chunks that have been read from disk, but not yet sent out over the network. If that queue is full, it means disk is faster than network (so no message is displayed); but if the queue is empty, that means network is faster than disk, so we display the message. (In practice, the queue is virtually never half full. It tends to be either full or empty). Similar logic applies in reverse for downloading. Suggested support response in this case:

    1. Explain to the customer that their speed appears to be limited by disk performance
    2. There are currently no options in AzCopy v10 to tune disk read behavior. (Some options may be added in a future release, but in most customer situation there’s not a lot we can do. The disk speed is what it is).
    3. For downloads, if the customer wants to test how fast the tool can go without the disk constraint, they can download to /dev/null. In AzCopy v10, that works on both Linux and Windows, and it just throws the data away after it is received.
    4. For uploads, if the customer wants to test how fast the tool can go without the disk constraint, ask the product group for guidance. There is an undocumented command line option that can be used, but it takes a little work to set it up. (Which is why its not publicly documented for GA).
  2. If scanning is so slow that we can’t feed new files quickly enough to the rest of AzCopy. This can happen if there are millions of small files. It might also happen if reading from an older SAN or NAS where enumerating directory contents is very slow. We detect it using the same queue-based method as #1. The difference here is that AzCopy’s on-screen status line will say “scanning”. You can also tell from the logs that scanning is still happening while scanning is happening we log lines that look like this: scheduling JobID=, Part#=x, Transfer#=y, priority=0. Suggested support response in this case:

    1. Explain to the customer that the cause appears to be due to the time taken to scan the file system.
    2. Contact the product team, because if these cases are happening we need to know about that, so that we can make decisions about whether to performance tune the scanning. Maybe one day we will speed up the scanning.
    3. If possible, supply the log file from the customer to the product group.
  3. (This last case only affects customers with unusually fast networks) If the number of files in the job is less than 10, AND the user has included --put-md5 on the command line, and the available network bandwidth is greater than about 3 Gbps, there’s currently an error where it can say “Disk may be limiting speed” when really it’s our MD5 computations that are limiting the speed.

    1. Suggested debugging steps in this case
      • Does the customer have lots of other files that also need uploading? If possible, they use larger upload jobs (i.e. more than 10 files per job). If they do that, this problem will almost always go away.
      • Also, if the customer would like to run a test without –put-MD5, that will show them the true disk performance without the overhead of MD5 calculations. For some customers, this will be just a test because for production usage they will need the MD5 hash. But for others, maybe they don’t need the hash and they can run in this faster mode as a solution to the problem.

Service may be limiting speed

Never displays in the first 30 seconds of a job. Only displays after first 30 seconds, since we want things to stabilize before we consider displaying this message.

After first 30 seconds of job it displays in the following case:

  1. If the job includes at least one page blob and the Storage Service is limiting the transfer rate for that page blob. As at April 2019, all page blobs have per-blob throughput limits, and in many cases those limits are much stricter than the overall throughput limit on the storage account. If those limits are affecting the transfer of any file in the job, this message will be displayed.
    1. Suggested support response
      • Explain the situation to the customer
      • If you want to see the decisions that AzCopy is making, about speed, search the log file for “Page Blob”. This will show you two types of log lines: lines where Service returned a 503 status (telling us to slow down), and lines that report the speed that AzCopy has chosen as a result of the 503s. AzCopy chooses a separate speed for each page blob, based on the 503s for that particular blob. The speed will drop after a burst of 503 messages, climb back towards its last-known-best speed, linger there for a while, then probe upwards to see if it can go any faster.

Note on use of log files, and brief tips on debugging other performance issues not covered above

When searching AzCopy 10.0.9 log files, it is advisable to use grep (if on Linux) or select-string (if in PowerShell on Windows). For large log files, that’s much more practical that trying to open the whole file in a text editor. For example, in PowerShell here’s how to extract all the performance information from a log file into a separate file named “perfLines.txt”

select-string .\858ebb30-4796-914f-632e-dc355cda0e1c.log -Pattern "PERF" | Out-File perfLines.txt

The performance log likes look something like this:

858ebb30-4796-914f-632e-dc355cda0e1c.log:94549:2019/04/03 06:42:16 PERF: primary performance constraint is Unknown. States: R: 0, D: 289, W: 1046, F: 0, B: 12, T: 1347

The section that says “primary performance constraint is” corresponds to the messages described above. It may say that the primary constraint is:

  • Disk (this is what gets logged when the screen says “Disk may be limiting speed”)
  • Service (this get logged when the screen says “Service may be limiting speed”)
  • Unknown. This is logged in all other cases. i.e. when there is no limit message on screen. When it says “Unknown” that doesn’t mean that there is nothing limiting throughput (there’s always something). It just means that AzCopy hasn’t figured out exactly what the constraint is. It may be:
  • CPU (to diagnose, check CPU usage on the AzCopy machine, using Task Manager or similar. Solution is to run on a machine with more CPUs, if possible).
  • Memory (to diagnose, check memory usage on the AzCopy machine, using Task Manager or similar. Solution is to ensure no other apps are using a lot of RAM on the same machine).
  • Specs of network card/interface
    • check specs of network adapter (including both host and VM if virtualized). E.g. you won’t get more than 1 Gbps if you only have a 1 Gbps network card.
    • If an Azure VM, check max documented network throughput for the size of VM that is being used. E.g. see public docs such as: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-general
    • Configuration of network interface (downloads only). For downloads, sometimes it can help to configure the network adapter/interface to use larger buffers.
    • Provisioned network bandwidth (if used on-premise). To diagnose, discuss with networking staff at customer. Consider both the pipe out to the internet and the internal network. E.g. can’t fill a 10 Gbps internet pipe if machine is connected to a portion of the internal network that only supports 1 Gbps.
    • Available network bandwidth (this is provisioned bandwidth, minus bandwidth used by other traffic). This can be a tricky one to diagnose. Easiest way is probably to ask networking staff a the customer if they have records or telemetry of typical throughput when AzCopy is not running.
    • AzCopy concurrency value. In in the PERF lines from the log, extracted as above, look at the number after “B”(which stands for HTTP Body). If uploading, use the “B” value on its own; but if downloading, use “H” PLUS “B”. If that number is typically about the same in all the PERF lines, then see if it is approximately equal to the AZCOPY_CONCURRENCY_VALUE. That value defaults to 32 when the machine as 4 or fewer CPUs, 300 when the machine has more than 18 CPUs, and 16 * number-of-CPUs in all other cases. If the number you find in the logs seems to be consistently equal to the AZCOPY_CONCURRENCY_VALUE, you can try setting a higher value by setting an environment variable called AZCOPY_CONCURRENCY_VALUE to a higher number (e.g. 500).

Other

Supply the extracted PERF lines to the product team for analysis.