Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(bug): txtar to reproduce the VM hanging during Render #1736

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

jefft0
Copy link
Contributor

@jefft0 jefft0 commented Mar 5, 2024

This PR adds cpu-cycle-overrun1.txtar in gno/gno.land/cmd/gnoland/testdata . To run it, in gno/gno.land/cmd/gnoland enter:

go test -v . --run TestTestdata/cpu-cycle-overrun1

The call to Render hangs. You can see that the Render function is not very complicated, but some combination of avl.Tree and pointers seems to hang the GnoVM. We originally observed the problem in GnoSocial, but this code has been simplified as much as I can.

As explained in the file, if you comment out the final call to AddData, then Render returns quickly. So it is not a problem with too much data in memory. My wild guess is that there is indeed a CPU cycle overrun, but it hangs instead of quitting and returning the "CPU cycle overrun" status.

@jefft0 jefft0 requested a review from moul as a code owner March 5, 2024 14:37
@github-actions github-actions bot added the 📦 ⛰️ gno.land Issues or PRs gno.land package related label Mar 5, 2024
@jefft0 jefft0 force-pushed the chore/add-cpu-cycle-overrun1.txtar branch from e0f3370 to 57953f1 Compare March 5, 2024 14:55
jefft0 added 2 commits March 5, 2024 18:15
Signed-off-by: Jeff Thompson <jeff@thefirst.org>
…args

Signed-off-by: Jeff Thompson <jeff@thefirst.org>
@jefft0 jefft0 force-pushed the chore/add-cpu-cycle-overrun1.txtar branch from 57953f1 to 2f78ae2 Compare March 5, 2024 17:18
@jefft0 jefft0 requested a review from gfanton as a code owner March 5, 2024 17:18
@gfanton gfanton requested a review from thehowl March 7, 2024 10:08
@gfanton gfanton mentioned this pull request Mar 7, 2024
7 tasks
@thehowl
Copy link
Member

thehowl commented Mar 8, 2024

I'm still investigating, but my preliminary findings is that this is actually occurring while attempting to do Machine.String(). Generating a string value for an AVL node this big takes a truckload of time, up to where it looks like it's stuck (it might spit out something after a couple of hours). Though nothing to do with the actual VM cycles. Will provide more details as soon as I have them...

@thehowl
Copy link
Member

thehowl commented Mar 8, 2024

Okay, as I was mentioning in the last comment this has to do with printing "very recursive" values like the avl trees you're working with.

If we apply this patch, which essentially forces a maximum depth of 8 when printing values, we get to see the "true" error:

diff --git a/gnovm/pkg/gnolang/values_string.go b/gnovm/pkg/gnolang/values_string.go
index 34187e32..86274b19 100644
--- a/gnovm/pkg/gnolang/values_string.go
+++ b/gnovm/pkg/gnolang/values_string.go
@@ -11,19 +11,22 @@ type protectedStringer interface {
 	ProtectedString(*seenValues) string
 }
 
-// This indicates the maximum ancticipated depth of the stack when printing a Value type.
-const defaultSeenValuesSize = 32
+const seenValuesSize = 8
 
 type seenValues struct {
-	values []Value
+	values [seenValuesSize]Value
+	len    int
 }
 
+func (sv *seenValues) Full() bool { return sv.len == seenValuesSize }
+
 func (sv *seenValues) Put(v Value) {
-	sv.values = append(sv.values, v)
+	sv.values[sv.len] = v
+	sv.len++
 }
 
 func (sv *seenValues) Contains(v Value) bool {
-	for _, vv := range sv.values {
+	for _, vv := range sv.values[:sv.len] {
 		if vv == v {
 			return true
 		}
@@ -41,11 +44,11 @@ func (sv *seenValues) Contains(v Value) bool {
 //     being printed if the value has already been print
 //   - this is NOT recursion and SHOULD be printed
 func (sv *seenValues) Pop() {
-	sv.values = sv.values[:len(sv.values)-1]
+	sv.len--
 }
 
 func newSeenValues() *seenValues {
-	return &seenValues{values: make([]Value, 0, defaultSeenValuesSize)}
+	return &seenValues{}
 }
 
 func (v StringValue) String() string {
@@ -72,6 +75,9 @@ func (av *ArrayValue) ProtectedString(seen *seenValues) string {
 	if seen.Contains(av) {
 		return fmt.Sprintf("%p", av)
 	}
+	if seen.Full() {
+		return "<trimmed>"
+	}
 
 	seen.Put(av)
 	defer seen.Pop()
@@ -104,6 +110,9 @@ func (sv *SliceValue) ProtectedString(seen *seenValues) string {
 	if seen.Contains(sv) {
 		return fmt.Sprintf("%p", sv)
 	}
+	if seen.Full() {
+		return "<trimmed>"
+	}
 
 	if ref, ok := sv.Base.(RefValue); ok {
 		return fmt.Sprintf("slice[%v]", ref)
@@ -134,6 +143,9 @@ func (pv PointerValue) ProtectedString(seen *seenValues) string {
 	if seen.Contains(pv) {
 		return fmt.Sprintf("%p", &pv)
 	}
+	if seen.Full() {
+		return "<trimmed>"
+	}
 
 	seen.Put(pv)
 	defer seen.Pop()
@@ -154,6 +166,9 @@ func (sv *StructValue) ProtectedString(seen *seenValues) string {
 	if seen.Contains(sv) {
 		return fmt.Sprintf("%p", sv)
 	}
+	if seen.Full() {
+		return "<trimmed>"
+	}
 
 	seen.Put(sv)
 	defer seen.Pop()
@@ -204,6 +219,9 @@ func (mv *MapValue) ProtectedString(seen *seenValues) string {
 	if seen.Contains(mv) {
 		return fmt.Sprintf("%p", mv)
 	}
+	if seen.Full() {
+		return "<trimmed>"
+	}
 
 	seen.Put(mv)
 	defer seen.Pop()
@@ -286,6 +304,9 @@ func (tv *TypedValue) ProtectedSprint(seen *seenValues, considerDeclaredType boo
 	if seen.Contains(tv.V) {
 		return fmt.Sprintf("%p", tv)
 	}
+	if seen.Full() {
+		return "<trimmed>"
+	}
 
 	// print declared type
 	if _, ok := tv.T.(*DeclaredType); ok && considerDeclaredType {

The "true" error:

        [stderr]
        "gnokey" error: allocation limit exceeded

        FAIL: testdata/cpu-cycle-overrun1.txtar:22: unexpected "gnokey" command failure: allocation limit exceeded

(The VM encounters this error, then attempts to dump the machine state for debugging into the error message, and thus "hangs" while trying to print the state).

As for the solution, I'm not terribly sure. What I've thought of:

  • This approach works for now, but is naive. Ultimately the underlying problem is that we're getting too much data to print out, both in the stringer but also because we're dumping each block, often containing redundant data, (ie. already contained and printed in other blocks), or contained in multiple variables, which together with the recursiveness blows up the sheer amount of data printed in Machine.String(), and as a consequence also allocated just to return the error.
  • Also, reaching a depth of 8, while not terribly common, is still possible even with a data structure that is not recursive like the avl.Tree (think of an AST tree)
  • So a better approach seems to have a limit of how many values can be printed when stringing a VM value (say 1000)
  • ... but in order for it to maintain effectiveness of dumped values as debugging tools, they should be printed in a "breadth-first" manner. (ie. print all items at depth=0, then depth=1, depth=2... until you reach the limit of items).

I'm not sure how/if the last part can be implemented in a nice and efficient way. cc/ @deelawn @mvertes @ajnavarro as you probably have ideas.

@ajnavarro
Copy link
Contributor

@thehowl IMHO, Machine.String() output as it is right now, should not be used as a log output, even adding depth limitations.

If we really want to do what Machinge.String() is doing, we have to change the method to return a reader, avoiding using huge amounts of memory.

@thehowl
Copy link
Member

thehowl commented Mar 14, 2024

Yeah.

What would be best is to implement a way to print an actual stacktrace (of the Gno code alone) so we can use that to debug in place of the machine state dump :/

@jefft0
Copy link
Contributor Author

jefft0 commented Mar 21, 2024

I reverted commit 2f78ae2 and merged from master which now has the same fix from PR #1745 .

Copy link

codecov bot commented Mar 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.73%. Comparing base (dc0cc26) to head (bc9f816).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1736      +/-   ##
==========================================
- Coverage   60.85%   60.73%   -0.12%     
==========================================
  Files         563      563              
  Lines       75193    74979     -214     
==========================================
- Hits        45760    45541     -219     
- Misses      26065    26078      +13     
+ Partials     3368     3360       -8     
Flag Coverage Δ
contribs/gnodev 61.46% <ø> (ø)
contribs/gnofaucet 14.46% <ø> (ø)
gnovm 65.63% <ø> (ø)
misc/genstd 80.54% <ø> (ø)
misc/logos 19.88% <ø> (-0.36%) ⬇️
tm2 62.07% <ø> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@thehowl
Copy link
Member

thehowl commented Mar 21, 2024

From the review meeting: the agreed solution is to switch from doing Machine.String() to printing a stacktrace. (This is likely to be more useful and parseable information by the programmer anyway... :) )

Copy link
Member

@zivkovicmilos zivkovicmilos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for isolating this in a txtar 🙏
I've left a small comment on how we can include this txtar, but not run it until we fix it (have it as a regression)

@@ -0,0 +1,64 @@
loadpkg gno.land/p/demo/avl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run this txtar conditionally, so it doesn't hang in the local suite, but is present in the codebase? cc @gfanton

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have failing tests in the codebase unless we have a system that expects them to fail and checks that they are still failing

Ie. what we should do for the "challenges" directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the "bug" label so that this PR gets attention as demonstrating a bug.

@jefft0 jefft0 added the 🐞 bug Something isn't working label Jun 28, 2024
@jefft0 jefft0 changed the title chore: txtar to reproduce the VM hanging during Render test(bug): txtar to reproduce the VM hanging during Render Jul 3, 2024
@thehowl thehowl mentioned this pull request Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working 📦 ⛰️ gno.land Issues or PRs gno.land package related
Projects
Status: In Progress
Status: No status
Status: In Review
Development

Successfully merging this pull request may close these issues.

4 participants