-
Notifications
You must be signed in to change notification settings - Fork 18k
archive/zip: compression performance #20031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Please provide some sample code for your go implentation. Perhaps there is
an obvious improvement that can be suggested.
…On Wed, 19 Apr 2017, 07:49 bobjalex ***@***.***> wrote:
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (go version)?
go version go1.8.1 windows/amd64
What operating system and processor architecture are you using (go env)?
set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=C:\GoLib
set GORACE=
set GOROOT=C:\Go
set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0
-fdebug-prefix-map=C:\Users\Bob\AppData\Local\Temp\go-build786174103=/tmp/go-build
-gno-record-gcc-switches
set CXX=g++
set CGO_ENABLED=1
set PKG_CONFIG=pkg-config
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
What did you do?
Create a zip archive using archive/zip.
If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.
Not sure how to provide a program for this. Recipe is, in general, create
an archive using archive/zip and compare its run time with other
implementations. I provided information on my timing comparisons below.
What did you expect to see?
Timing in line with other existing implementations.
What did you see instead?
Writing the ZIP archive is several times slower that other ZIP
implementations.
At least for large archives. Not so noticeable with small archives, but
painfully slow for large ones.
Based on comparison of Go archive/zip with archive/tar and with the
included libraries of Python and Java distributions. For most operations,
comparisons are pretty close, but for writing a ZIP archive, Go is *way*
slower than the others.
Of course, the problem could be with my code. But the archive-writing part
is pretty simple and is based on the documentation examples. And, my
similar code that does the TGZ archiving performs OK.
Here is a table of the results of my experiments, followed by a profile of
the archived hierarchy.
ZIP (468.2 MB archive file size)
Read all metadata
Go 1ms
Java 32ms
Python 210ms
Unpack all data
Go 31s
Java 43s
Python 27s
Pack all data
Go 5m3s *!!!!*
Java 38s
Python 28s
TGZ (466.9 MB archive file size)
(Java JDK does not have a tar module in its distribution so a 3rd party
package org.apache.commons.compress.archivers.tar is used.)
Read all metadata
Go 7.4s
Java 4.9S
Python 3.4s
Unpack all data
Go 29s
Java 38s
Python 34s
Pack all data
Go 23s
Java 29s
Python 34s
Profile of archived hierarchy:
Directory count: 416
File count: 2918
Total size: 1.1G
Average size: 361K
Median size: 3898
Maximum size: 53M
Size distribution:
0 : 19
1..10 : 8
10..100 : 168
100..1000: 639
1000..10K : 1036
10K..100K: 513
100K..1M : 96
1M..10M : 428
10M..100M: 11
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#20031>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA4IfuFWStnnEfOtlA_GqcE70ges3ks5rxS_CgaJpZM4NA8Fg>
.
|
CC @dsnet |
It's really hard to optimize something without a performance test to optimize for. I'm inclined to blame compression, but the |
I learned how to use markdown, so here is part of my original submission, formatted. Didn't know that my cool, indented text would be flattened :) What did you see instead?Writing the ZIP archive is several times slower that other ZIP implementations. Based on comparison of Go archive/zip with archive/tar and with the included libraries of Python and Java distributions. For most operations, comparisons are pretty close, but for writing a ZIP archive, Go is way slower than the others. Of course, the problem could be with my code. But the archive-writing part is pretty simple and is based on the documentation examples. And, my similar code that does the TGZ archiving performs OK. Here is a table of the results of my experiments, followed by a profile of the archived hierarchy. ZIP (468.2 MB archive file size)
TGZ (466.9 MB archive file size)
Profile of archived hierarchy:
|
@bobjalex, Markdown is cool but code so we could get reproducible numbers would be even better. Also, do you have a CPU profile? Could you share your code & data? We can host it if you don't have a place. |
Furthermore, it's great that you have distribution based on filesizes, but I'm willing to bet that only a single file from that distribution is sufficient to demonstrate the performance slow-down. |
Here is the code snippet that archives a specified directory. I did try using buffered streams (bufio) in both directions but made no difference. Seems that the underlying code does a decent job of buffering. Suggestions??? BTW, I agree that this smells like a compression issue. It seems to be exasperated by large archives/files. But maybe that's because it's just not as frustrating with small files :) func (a *ArchiverZip) Write(zipPath, rootPath string) error {
file, err := os.Create(zipPath)
if err != nil {
return err
}
defer file.Close()
w := zip.NewWriter(file)
filepath.Walk(rootPath,
func(path string, info os.FileInfo, err error) error {
if err != nil {
fmt.Printf("Error, not stored: %s, %s\n", path, err)
return nil
}
if info.IsDir() {
return nil
}
relPath, err := filepath.Rel(rootPath, path)
if err != nil {
fmt.Printf(
"Error creating file relative path, not stored: %s, %s\n",
path, err)
return nil
}
inFile, err := os.Open(path)
if err != nil {
fmt.Printf("Error opening file, not stored: %s, %s\n",
path, err)
return nil
}
defer inFile.Close()
hdr, err := zip.FileInfoHeader(info)
if err != nil {
fmt.Printf("Error getting ZIP entry info, not stored: %s, %s\n",
path, err)
return nil
}
hdr.Name = filepath.ToSlash(relPath)
hdr.Method = zip.Deflate
// Mod time is offset such that local time ends up being stored
// in the DOS-like time field, to be compatible with most
// existing ZIP implementations.
hdr.SetModTime(localAsUtc(info.ModTime()))
//hdr.SetModTime(info.ModTime())
//hdr.Flags |= 2 // set "max compression" bit (no effect in Go 1.6)
f, err := w.CreateHeader(hdr)
if err != nil {
fmt.Printf("Error creating ZIP entry, not stored: %s, %s\n",
path, err)
return nil
}
_, err = io.Copy(f, inFile)
if err != nil {
fmt.Printf("Error writing file, not stored: %s, %s\n",
path, err)
return nil
}
err = inFile.Close()
if err != nil {
fmt.Printf("Error closing file,: %s, %s\n", path, err)
return nil
}
return nil
})
// Close the archive.
return w.Close()
}
func localAsUtc(local time.Time) time.Time {
_, offset := local.Zone()
return local.UTC().Add(time.Duration(offset) * time.Second)
} |
I made a little test program from the code I sent: |
Experimenting with the little test program I posted, I learned something that changes the nature of the problem. The timings I sent were writing the archive to a USB3 drive (I failed to mention that). When writing to a real disk, the time is much better. BUT, the difference between disk and USB performance is unique to zip. While the Java and Python implementations took only a little bit longer writing to the USB drive, the Go implementation took a lot longer. The issue now is: why is the zip-archive-writing I/O so much slower when writing to a USB drive? It seems to be something specific to zip, since tgz does not show much difference for USB. New ZIP timings on disk and USB drive:
New TGZ timings on disk and USB drive:
|
Your USB3 stick's filesystem is probably mounted sync (default safe option), so every VFS write is one end-to-end write to flash. You want a bufio.Writer to do a few big writes to the OS. |
Looks like there's nothing to do here. Can we close this bug? |
OK with me. Thanks. |
According to golang/go#20031 (comment) the Go zlib is now on par with the C zlib implementation in terms of speed.
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go version go1.8.1 windows/amd64
What operating system and processor architecture are you using (
go env
)?set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=C:\GoLib
set GORACE=
set GOROOT=C:\Go
set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64
set GCCGO=gccgo
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\Bob\AppData\Local\Temp\go-build786174103=/tmp/go-build -gno-record-gcc-switches
set CXX=g++
set CGO_ENABLED=1
set PKG_CONFIG=pkg-config
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
What did you do?
Create a zip archive using archive/zip.
If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.
Not sure how to provide a program for this. Recipe is, in general, create an archive using archive/zip and compare its run time with other implementations. I provided information on my timing comparisons below.
What did you expect to see?
Timing in line with other existing implementations.
What did you see instead?
Writing the ZIP archive is several times slower that other ZIP implementations.
At least for large archives. Not so noticeable with small archives, but painfully slow for large ones.
Based on comparison of Go archive/zip with archive/tar and with the included libraries of Python and Java distributions. For most operations, comparisons are pretty close, but for writing a ZIP archive, Go is way slower than the others.
Of course, the problem could be with my code. But the archive-writing part is pretty simple and is based on the documentation examples. And, my similar code that does the TGZ archiving performs OK.
Here is a table of the results of my experiments, followed by a profile of the archived hierarchy.
ZIP (468.2 MB archive file size)
Read all metadata
Go 1ms
Java 32ms
Python 210ms
Unpack all data
Go 31s
Java 43s
Python 27s
Pack all data
Go 5m3s !!!!
Java 38s
Python 28s
TGZ (466.9 MB archive file size)
(Java JDK does not have a tar module in its distribution so a 3rd party
package org.apache.commons.compress.archivers.tar is used.)
Read all metadata
Go 7.4s
Java 4.9S
Python 3.4s
Unpack all data
Go 29s
Java 38s
Python 34s
Pack all data
Go 23s
Java 29s
Python 34s
Profile of archived hierarchy:
Directory count: 416
File count: 2918
Total size: 1.1G
Average size: 361K
Median size: 3898
Maximum size: 53M
Size distribution:
0 : 19
1..10 : 8
10..100 : 168
100..1000: 639
1000..10K : 1036
10K..100K: 513
100K..1M : 96
1M..10M : 428
10M..100M: 11
The text was updated successfully, but these errors were encountered: