Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weblist gives wrong flat count #720

Open
prattmic opened this issue Aug 5, 2022 · 4 comments
Open

weblist gives wrong flat count #720

prattmic opened this issue Aug 5, 2022 · 4 comments
Labels
Priority: p2 Buganizer priority - P2 type: bug Buganizer type - Bug

Comments

@prattmic
Copy link
Member

prattmic commented Aug 5, 2022

Initially filed at golang/go#52000 by @fumin. Moving here as this seems to be an issue in pprof.

What version of Go are you using (go version)?

$ go version
go version go1.17.6 windows/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\a3367\AppData\Local\go-build
set GOENV=C:\Users\a3367\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\a3367\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\a3367\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.17.6
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=NUL
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\a3367\AppData\Local\Temp\go-build2334633044=/tmp/go-build -gno-record-gcc-switches
gdb --version: GNU gdb (GDB) 8.1

What did you do?

I profiled the standard library's png.Encode function and looked at this attached profile:

png_encode.zip

I then ran both weblist and list on the png.filter function.

What did you expect to see?

I expect both weblist and list to show the same result.

What did you see instead?

I saw list showing the correct result, with the sum of the flat time of png.abs8 in png.filter tallying with the total time spent on png.abs8.
In particular, the below log shows that png.abs8 took 1.56 seconds in total.

PS C:\Users\a3367\work\misc\seg\output> go tool pprof .\png_encode.pprof
Type: cpu
Time: Mar 29, 2022 at 1:28pm (CST)
Duration: 31.43s, Total samples = 17.53s (55.77%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 16.92s, 96.52% of 17.53s total
Dropped 91 nodes (cum <= 0.09s)
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
     7.96s 45.41% 45.41%     13.07s 74.56%  image/png.filter
        2s 11.41% 56.82%      2.01s 11.47%  image/png.abs
     1.74s  9.93% 66.74%      1.74s  9.93%  runtime.memmove
     1.56s  8.90% 75.64%      1.56s  8.90%  image/png.abs8
     1.44s  8.21% 83.86%      3.46s 19.74%  image/png.paeth
     0.90s  5.13% 88.99%      0.90s  5.13%  hash/adler32.update
     0.80s  4.56% 93.55%      0.81s  4.62%  compress/flate.(*deflateFast).matchLen
     0.29s  1.65% 95.21%      0.31s  1.77%  compress/flate.(*huffmanEncoder).bitCounts
     0.12s  0.68% 95.89%      0.12s  0.68%  runtime.asyncPreempt
     0.11s  0.63% 96.52%      1.04s  5.93%  compress/flate.(*deflateFast).encode
(pprof)

Adding the time spent on png.abs8 in png.filter also shows the same 1.56 seconds.

(pprof) list png.filter
Total: 17.53s
ROUTINE ======================== image/png.filter in C:\Program Files\Go\src\image\png\writer.go
     7.96s     13.07s (flat, cum) 74.56% of Total
         .          .    202:   return len(b), nil
         .          .    203:}
         .          .    204:
         .          .    205:// Chooses the filter to use for encoding the current row, and applies it.
         .          .    206:// The return value is the index of the filter and also of the row in cr that has had it applied.
      10ms       10ms    207:func filter(cr *[nFilter][]byte, pr []byte, bpp int) int {
         .          .    208:   // We try all five filter types, and pick the one that minimizes the sum of absolute differences.
         .          .    209:   // This is the same heuristic that libpng uses, although the filters are attempted in order of
         .          .    210:   // estimated most likely to be minimal (ftUp, ftPaeth, ftNone, ftSub, ftAverage), rather than
         .          .    211:   // in their enumeration order (ftNone, ftSub, ftUp, ftAverage, ftPaeth).
         .          .    212:   cdat0 := cr[0][1:]
     110ms      110ms    213:   cdat1 := cr[1][1:]
         .          .    214:   cdat2 := cr[2][1:]
         .          .    215:   cdat3 := cr[3][1:]
         .          .    216:   cdat4 := cr[4][1:]
         .          .    217:   pdat := pr[1:]
         .          .    218:   n := len(cdat0)
         .          .    219:
         .          .    220:   // The up filter.
         .          .    221:   sum := 0
     310ms      310ms    222:   for i := 0; i < n; i++ {
     1.02s      1.03s    223:           cdat2[i] = cdat0[i] - pdat[i]
     230ms      940ms    224:           sum += abs8(cdat2[i])
         .          .    225:   }
         .          .    226:   best := sum
         .          .    227:   filter := ftUp
         .          .    228:
         .          .    229:   // The Paeth filter.
         .          .    230:   sum = 0
         .          .    231:   for i := 0; i < bpp; i++ {
         .          .    232:           cdat4[i] = cdat0[i] - pdat[i]
         .          .    233:           sum += abs8(cdat4[i])
         .          .    234:   }
     790ms      790ms    235:   for i := bpp; i < n; i++ {
     3.49s      6.99s    236:           cdat4[i] = cdat0[i] - paeth(cdat0[i-bpp], pdat[i], pdat[i-bpp])
     240ms      760ms    237:           sum += abs8(cdat4[i])
     180ms      180ms    238:           if sum >= best {
         .          .    239:                   break
         .          .    240:           }
         .          .    241:   }
      40ms       40ms    242:   if sum < best {
         .          .    243:           best = sum
         .          .    244:           filter = ftPaeth
         .          .    245:   }
         .          .    246:
         .          .    247:   // The none filter.
         .          .    248:   sum = 0
         .          .    249:   for i := 0; i < n; i++ {
         .          .    250:           sum += abs8(cdat0[i])
      10ms       10ms    251:           if sum >= best {
         .          .    252:                   break
         .          .    253:           }
         .          .    254:   }
         .          .    255:   if sum < best {
         .          .    256:           best = sum
         .          .    257:           filter = ftNone
         .          .    258:   }
         .          .    259:
         .          .    260:   // The sub filter.
         .          .    261:   sum = 0
         .          .    262:   for i := 0; i < bpp; i++ {
         .          .    263:           cdat1[i] = cdat0[i]
         .          .    264:           sum += abs8(cdat1[i])
         .          .    265:   }
     190ms      210ms    266:   for i := bpp; i < n; i++ {
     990ms         1s    267:           cdat1[i] = cdat0[i] - cdat0[i-bpp]
      80ms      390ms    268:           sum += abs8(cdat1[i])
     150ms      150ms    269:           if sum >= best {
         .          .    270:                   break
         .          .    271:           }
         .          .    272:   }
         .          .    273:   if sum < best {
         .          .    274:           best = sum
         .          .    275:           filter = ftSub
         .          .    276:   }
         .          .    277:
         .          .    278:   // The average filter.
         .          .    279:   sum = 0
         .          .    280:   for i := 0; i < bpp; i++ {
      70ms       80ms    281:           cdat3[i] = cdat0[i] - pdat[i]/2
         .          .    282:           sum += abs8(cdat3[i])
         .          .    283:   }
         .          .    284:   for i := bpp; i < n; i++ {
      30ms       30ms    285:           cdat3[i] = cdat0[i] - uint8((int(cdat0[i-bpp])+int(pdat[i]))/2)
      10ms       30ms    286:           sum += abs8(cdat3[i])
      10ms       10ms    287:           if sum >= best {
         .          .    288:                   break
         .          .    289:           }
         .          .    290:   }
         .          .    291:   if sum < best {
         .          .    292:           filter = ftAverage
(pprof)

However, weblist shows the wrong result, it seems to exagerate the result.
Pprof listing.zip
For example, on line 268, list shows only 80ms spent, whereas weblist shows 360ms spent.

@prattmic
Copy link
Member Author

prattmic commented Aug 5, 2022

Plain text of the HTML from Pprof listing.zip to see L268 referred above without opening the file:

image/png.filter
C:\Program Files\Go\src\image\png\writer.go

  Total:       9.41s     13.07s (flat, cum) 74.56%
    202            .          .           	return len(b), nil 
    203            .          .           } 
    204            .          .            
    205            .          .           // Chooses the filter to use for encoding the current row, and applies it. 
    206            .          .           // The return value is the index of the filter and also of the row in cr that has had it applied. 
    207         10ms       10ms           func filter(cr *[nFilter][]byte, pr []byte, bpp int) int { 
    208            .          .           	// We try all five filter types, and pick the one that minimizes the sum of absolute differences. 
    209            .          .           	// This is the same heuristic that libpng uses, although the filters are attempted in order of 
    210            .          .           	// estimated most likely to be minimal (ftUp, ftPaeth, ftNone, ftSub, ftAverage), rather than 
    211            .          .           	// in their enumeration order (ftNone, ftSub, ftUp, ftAverage, ftPaeth). 
    212            .          .           	cdat0 := cr[0][1:] 
    213        110ms      110ms           	cdat1 := cr[1][1:] 
    214            .          .           	cdat2 := cr[2][1:] 
    215            .          .           	cdat3 := cr[3][1:] 
    216            .          .           	cdat4 := cr[4][1:] 
    217            .          .           	pdat := pr[1:] 
    218            .          .           	n := len(cdat0) 
    219            .          .            
    220            .          .           	// The up filter. 
    221            .          .           	sum := 0 
    222        310ms      310ms           	for i := 0; i < n; i++ { 
    223        1.02s      1.03s           		cdat2[i] = cdat0[i] - pdat[i] 
    224        870ms      940ms           		sum += abs8(cdat2[i]) 
    225            .          .           	} 
    226            .          .           	best := sum 
    227            .          .           	filter := ftUp 
    228            .          .            
    229            .          .           	// The Paeth filter. 
    230            .          .           	sum = 0 
    231            .          .           	for i := 0; i < bpp; i++ { 
    232            .          .           		cdat4[i] = cdat0[i] - pdat[i] 
    233            .          .           		sum += abs8(cdat4[i]) 
    234            .          .           	} 
    235        790ms      790ms           	for i := bpp; i < n; i++ { 
    236        3.49s      6.99s           		cdat4[i] = cdat0[i] - paeth(cdat0[i-bpp], pdat[i], pdat[i-bpp]) 
    237        750ms      760ms           		sum += abs8(cdat4[i]) 
    238        180ms      180ms           		if sum >= best { 
    239            .          .           			break 
    240            .          .           		} 
    241            .          .           	} 
    242         40ms       40ms           	if sum < best { 
    243            .          .           		best = sum 
    244            .          .           		filter = ftPaeth 
    245            .          .           	} 
    246            .          .            
    247            .          .           	// The none filter. 
    248            .          .           	sum = 0 
    249            .          .           	for i := 0; i < n; i++ { 
    250            .          .           		sum += abs8(cdat0[i]) 
    251         10ms       10ms           		if sum >= best { 
    252            .          .           			break 
    253            .          .           		} 
    254            .          .           	} 
    255            .          .           	if sum < best { 
    256            .          .           		best = sum 
    257            .          .           		filter = ftNone 
    258            .          .           	} 
    259            .          .            
    260            .          .           	// The sub filter. 
    261            .          .           	sum = 0 
    262            .          .           	for i := 0; i < bpp; i++ { 
    263            .          .           		cdat1[i] = cdat0[i] 
    264            .          .           		sum += abs8(cdat1[i]) 
    265            .          .           	} 
    266        190ms      210ms           	for i := bpp; i < n; i++ { 
    267        990ms         1s           		cdat1[i] = cdat0[i] - cdat0[i-bpp] 
    268        360ms      390ms           		sum += abs8(cdat1[i]) 
    269        150ms      150ms           		if sum >= best { 
    270            .          .           			break 
    271            .          .           		} 
    272            .          .           	} 
    273            .          .           	if sum < best { 
    274            .          .           		best = sum 
    275            .          .           		filter = ftSub 
    276            .          .           	} 
    277            .          .            
    278            .          .           	// The average filter. 
    279            .          .           	sum = 0 
    280            .          .           	for i := 0; i < bpp; i++ { 
    281         70ms       80ms           		cdat3[i] = cdat0[i] - pdat[i]/2 
    282            .          .           		sum += abs8(cdat3[i]) 
    283            .          .           	} 
    284            .          .           	for i := bpp; i < n; i++ { 
    285         30ms       30ms           		cdat3[i] = cdat0[i] - uint8((int(cdat0[i-bpp])+int(pdat[i]))/2) 
    286         30ms       30ms           		sum += abs8(cdat3[i]) 
    287         10ms       10ms           		if sum >= best { 
    288            .          .           			break 
    289            .          .           		} 
    290            .          .           	} 
    291            .          .           	if sum < best { 
    292            .          .           		filter = ftAverage 

@prattmic
Copy link
Member Author

prattmic commented Aug 5, 2022

I notice that there are separate implementations of computing flat and cum counts in list vs weblist.

If I understand correctly, the values in list come from https://github.com/google/pprof/blob/main/internal/report/source.go#L925, and weblist from https://github.com/google/pprof/blob/main/internal/report/source.go#L633. So differences/bugs between the two seems reasonable.

I wonder if abs8 is inlined and the function they are getting assigned to differs between the two implementations? @fumin could you check if abs8 is inlined into filter?

@fumin
Copy link

fumin commented Aug 5, 2022

@prattmic Using the following commands to reproduce the problem, and check if abs8 is inlined or not:

# Enter the standard library's png package folder.
pwd
# C:\Program Files\Go\src\image\png

# Generate a profile containing this bug
go test -v -bench . -cpuprofile="C:\Users\a3367\Desktop\tmp\cpu.prof" -o="C:\Users\a3367\Desktop\tmp\png.test.exe"

go tool pprof "C:\Users\a3367\Desktop\tmp\cpu.prof"
# In the pprof console, compare the following commands to see the bug:
# list png.filter$
# weblist png.filter$

# Dump the assembly to check if abs8 is inlined or not?
go tool objdump -s "png.filter$" "C:\Users\a3367\Desktop\tmp\png.test.exe"

I get the following assembly:

writer.go:223         0x592eef                4839c8                  CMPQ CX, AX                             
  writer.go:223         0x592ef2                0f83e2040000            JAE 0x5933da                            
  writer.go:223         0x592ef8                440fb60c38              MOVZX 0(AX)(DI*1), R9                   
  writer.go:223         0x592efd                4529cd                  SUBL R9, R13                            
  writer.go:223         0x592f00                4c39d8                  CMPQ R11, AX                            
  writer.go:223         0x592f03                0f83c9040000            JAE 0x5933d2                            
  writer.go:223         0x592f09                45882c04                MOVB R13, 0(R12)(AX*1)                  
  writer.go:90          0x592f0d                4180fd80                CMPL $0x80, R13                         
  writer.go:90          0x592f11                7306                    JAE 0x592f19                            
  writer.go:91          0x592f13                450fb6cd                MOVZX R13, R9                           
  writer.go:224         0x592f17                ebc1                    JMP 0x592eda                            
  writer.go:93          0x592f19                450fb6cd                MOVZX R13, R9                           
  writer.go:93          0x592f1d                4981c100ffffff          ADDQ $-0x100, R9                        
  writer.go:93          0x592f24                49f7d9                  NEGQ R9                                 
  writer.go:224         0x592f27                ebb1                    JMP 0x592eda

Given the absense of the CALL instruction in the above assembly,
I guess this implies abs8 is indeed inlined?

@raidancampbell
Copy link
Contributor

I came across this issue a couple of days ago and I think I've got an idea about what's going on here. There's a setup description and minimal replication here.

A discussion back on #628 points to a section of code that accumulates the samples for a given address. I suspect, but have not been able to fully verify, that this accumulation for weblist commands is duplicated per unique callstack if the binary is disassambled.


I've tried to fix it, but have been running into issues where a homebuilt pprof where fail the check here, as textSegment.Addr is 0x100000000, but start is 0 (inherited further up the callstack from here). go tool pprof continues to function as expected and recognizes the binary. This is a completely unrelated issue and is likely something on my side: I'd like to avoid muddying the topic at hand, but if it's a trivial fix, I'd love to have a pass at fixing this duplicate counting issue.

@Louis-Ye Louis-Ye added type: bug Buganizer type - Bug Priority: p2 Buganizer priority - P2 labels Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: p2 Buganizer priority - P2 type: bug Buganizer type - Bug
Projects
None yet
Development

No branches or pull requests

4 participants