Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Use osmosis-labs/cosmos-sdk as default import #941

Closed
ethanfrey opened this issue Aug 15, 2022 · 40 comments
Closed

Proposal: Use osmosis-labs/cosmos-sdk as default import #941

ethanfrey opened this issue Aug 15, 2022 · 40 comments

Comments

@ethanfrey
Copy link
Member

ethanfrey commented Aug 15, 2022

Given the poor track record of the upstream cosmos-sdk repo:

  • Development on the upstream cosmos/cosmos-sdk repo has been painful slow
  • QA has historically been very poor and I would wait for 4+ mainnet deployments (and a 0.47 or 0.48 release) to consider the current 0.46 branch safe
  • The major feature in 0.46 (groups and govs) is not so useful for CosmWasm chains and brings a very dangerous attack surface
  • The defence of heavy use of non-deterministic items in the repo, which they don't even see as bugs to be fixed (last 2 Juno halts were shrugged off as "expected behavior" from their side)
  • Lack of progress on features many chains have requested (like faster storage layer, solid orm)

Given the solid track record and aligned goals of osmosis-lab's fork:

  • Patching a number of 0-day exploits before the upstream SDK
  • Making a number of performance improvements
  • Desire to replace all use of map with BTreeMap generic or such, to avoid random sort order
  • Desire to stabilize core modules to eventually make StargateMsg and StargateQuery safe, like in a sane blockchain sdk

I propose:

  • Wasmd make an official release using osmosis-lab's fork (eg. v0.28.0-osmosis)
  • From the next release on, we make this default, and maintain the "original" sdk as another tag (eg v0.29.0, v0.29.0-cosmos)
  • Confio work closely with Osmosis labs to provide proper stability guarantees in the cosmos-sdk needed for CosmWasm chains
  • We invite all other organizations (especially IG) to collaborate on this fork focused on the demanding needs of all CosmWasm-enabled chains

I would first focus on stabilising the 0.45.x branch and then collaborate on a new release that ports over any useful work from 0.46 (like mempool integration). And I would consider deprecating cosmos/cosmos-sdk@0.46 as they just showed was possible with Tendermint. Sometimes you need to break a few eggs to one forward...

@ethanfrey
Copy link
Member Author

ethanfrey commented Aug 15, 2022

This impacts the whole ecosystem and I would like many comments on it. We still plan to support those who stick with the upstream sdk, but will not be expending the effort to debug all those sdk integration pain points, when the bug is clearly in the sdk.

@sunnya97 @ValarDragon @JakeHartnell @marbar3778 @shanev @assafmo @yihuang @jackzampolin happy for feedback and other options. The decision is not taken, but the frustration has grown enough that sticking with cosmos/cosmos-sdk and upgrading to 0.46 does not seem like a realistic path forward.

I would like the sdk development to focus on the critical needs affecting many chains in productions, not imaginary problems

@maurolacy
Copy link
Contributor

maurolacy commented Aug 16, 2022

👍🏼 Agree with this.

Regarding

Desire to replace all use of map with BTreeMap generic or such, to avoid random sort order

Best option IMO would be to remove that silly randomisation in the golang code base directly. It is one thing to tell developers not to rely on the iteration order of map keys, but it is quite another to explicitly introduce randomness into them.

As a developer, I can accept not to rely in a specific order for map keys. But I don't see why I should accept not to rely on a fixed order for them. Al least, in the same golang version, and between calls to the same underlying map.

@faddat
Copy link
Contributor

faddat commented Aug 16, 2022

Commentary (general)

https://twitter.com/gadikian/status/1559381295737647105

I think it would be a good idea (great idea) for IG to rebrand, or for the best of the best at IG to seek roles in more commercial organizations with funding sources other than the ICF. We should likely keep in mind that it there is indeed clear responsibility for things like SMTs. For appointments that don't lead to delivery. For v35's that explode and subsequent denialism.

I'm not in favor of rebranding concepts that can confuse the IBC network with the interchain foundation, and I don't want IG confused with the interchain foundation either. I believe that this situation is a consequence of the interchain foundation's REFUSAL to engage with the community on an open, public basis, as they claim to do. Sadly, hollow claims do not help the community to grow, kinda like SMTs delivered by folks working on alternative Gaias whose ultimate goal most likely was never to deliver SMTs but instead supplant the hub. (eg: vulcanize/laconic) (4 years in stealth, 4 years funded by icf....)

Unique to the cosmos hub, contributors are treated very poorly, which I've discussed extensively on the cosmos hub forum:

https://forum.cosmos.network/u/jacobgadikian/activity

Sorry to bring branding and finance into what is properly a technical discussion. Sadly, I am certain that the branding and finance topic is related to what's being discussed here, on the technical side, and I did not want for these things to remain unsaid. So I said them.

Commentary (technical)

Unfortunately, I agree.

I also think that it would make a ton of sense to convene a call. I'd love to be on that call. I reckon that Marko should be on it, too, because in my experience he is a plain dealer unaffected by the maladies stemming from the ICF. I would much prefer to see a

Realistically, the fork relationship in the osmosis-sdk may be the end problem. EG: maybe it is no longer a fork.

Then we need to think about other issues, like shared security and IBC-go generally speaking. I uh, upgraded that to 46. Seems a bit of a waste though, as I did that mainly to ship a chain that uses both groups and cosmwasm, because we used groups to replace gov.

Anyhow, daodao is much better than groups but didn't exist when we began to build craft. I'm interested in helping it to be able to more directly administer chains.

Confio/CW team shouldn't have to deal with the burden of maintaining two upstream SDK's.

Proposal (technical and political and brand and the like)

The modularity being planned for SDK 47 seems good. I also think that by decomposing the sdk some, we enable swapping in and out pieces of the SDK. This is much like my own journey from enormous bits of code, to progressively smaller and smaller bits of code.

What we've observed at Notional, is that from a business perspective, CosmWasm has become De Rigeur-- that is why we put so much effort into getting SDK 46 and CW to play together nicely.

I think that it would be wisest to try our best to converge on a vision for the SDK that doesn't lead to further painful bifrucation.

@tac0turtle
Copy link
Contributor

Its unfortunate this issue has come up, I have tried super hard to hear the pain points of users and in the sdk community call I present the quarterly roadmaps and sprint themes and ask for feedback and if there is something that people feel like we are missing. This is the sort of thing that I would love to hear from people, the team working on cosmos/cosmos-sdk is willing to move things around or downright throw things out if the community says they wont use it or don't want it. I talk with dev and others with the osmosis team on how to uncork their work and are already working on a trajectory on how to do this. We recently caught up with the osmosis fork of iavl with 0.19.1 (https://github.com/cosmos/iavl/releases/tag/v0.19.1). I would love to expand the community call to anyone that would like to join and voice concerns over what the cosmos/cosmos-sdk team is working on. If you come in and say cosmwasm needs you to fix this, we will fix it. If you say X is horrible and provide reasoning as to why, we will take the feedback and work on a different design. We are here to serve you, the community not anyone else

I understand the frustration and want to work on immediate steps to remedy them. We are working on a plan to remedy the concerns you have and more. Before migrating, lets talk and I would like you to join the cosmos-sdk community call, where we discuss these sorts of things.

The defence of heavy use of non-deterministic items in the repo, which they don't even see as bugs to be fixed (last 2 Juno halts were shrugged off as "expected behavior" from their side)

Want to bring attention to this point. We have been discussing on how to fix this and others on the sdk team have been pushing for bringing events into consensus for a while, well over a year. Anton and I did the implementation on the tendermint side to enable this in a seamless manner but there was push back and we ended up reverting it. No one is defending events being non-deterministic since we also want them to be deterministic, but the current design is that they are not. 0.46 and 0.46.1 can have different events as of now due to this.

@faddat
Copy link
Contributor

faddat commented Aug 16, 2022

Marko, your efforts to make the SDK serve the community at large have always been very, very clear to me, and I hope that Confio joins the next community call.

@assafmo
Copy link
Contributor

assafmo commented Aug 16, 2022

Thanks for the tag @ethanfrey.

I generally agree with the goal and share the pain. This however puts more responsibility in Osmosis' hands, which I'm not sure is fair to them. For example before using Osmosis' fork I'd like to know what customizations they did make compared to the vanilla SDK. E.g. are there gov/staking/distribution/etc changes? Did they cherry-pick features from 0.46 or other appchains?

With that said, I'm sure the Osmosis team will kick ass doing this. Their fork is already very impressive.

@tomtau
Copy link

tomtau commented Aug 16, 2022

How will this work with ibc-go? It seems the fork still keeps the original name: https://github.com/osmosis-labs/cosmos-sdk/blob/osmosis-main/go.mod#L3 so it should just be a drop-in replacement that needs to be specified in downstream go.mod via replace, right?
One other potential headache may be for existing networks to migrate (I haven't looked into the fork, so not sure how "compatible" it is on the front, e.g. executing custom store migrations only meaningful when coming from the fork or assuming the previous version was already on the fork / had something different which it didn't if it was on the upstream SDK).

@ethanfrey
Copy link
Member Author

ethanfrey commented Aug 16, 2022

I have tried super hard to hear the pain points of users and in the sdk community call I present the quarterly roadmaps and sprint themes and ask for feedback and if there is something that people feel like we are missing

Thank you for running these. I have not attended these in well over a year, as they were my Friday evening and historically run by Regen as long, pointless discussions as they had already made their decisions and didn't really listen to feedback or just kept arguing til you agreed (consensus by attrition). I was unaware of any significant changes recently, just the mess of 0.46

No one is defending events being non-deterministic since we also want them to be deterministic, but the current design is that they are not.

Both Zaki and Bez argued this point in a slack channel with me. The last complaints I had about non-determinstic queries (also using maps), I got a similar feedback (not sure from whom, but I think it was Regen)

I would love to expand the community call to anyone that would like to join and voice concerns over what the cosmos/cosmos-sdk team is working on. If you come in and say cosmwasm needs you to fix this, we will fix it. If you say X is horrible and provide reasoning as to why, we will take the feedback and work on a different design. We are here to serve you, the community not anyone else

I will show up. Please send me a new invite (ideally not Friday evening). CosmWasm needs stability and the ability to call into the sdk without constant fear it will break consensus. All CosmWasm based chain halts in the last year have been due to some place in the sdk being non-deterministic when we assumed otherwise. And there has never been any interest to even fix these bugs, let alone systematically address them.

I do appreciate your help @marbar3778 in getting 0.45 out which resolved a number of unbilled gas issues that CosmWasm contracts could exploit.

@faddat
Copy link
Contributor

faddat commented Aug 16, 2022

I want to leave my commentary above, because I think it is really important that CW have a safe playground-- and CW basically makes the sdk into a playground for devs. I am sure that we all see this as desirable.

I also see a coherent set of libraries with rapid releases as desirable, and I think that the mainline cosmos-sdk is heading in that direction. So, despite my inital gut feeling being: "I agree" -- I do wonder if we can't get this going more smoothly, eg:

Yes, osmosis has baked a ton of desirable features into their fork.

I've put too much work into 46-- and even deployed it on a live network-- for me to support its retraction, unless we were to do a 43 style retraction with an immediate upgrade to 47, but I am not sure that it would work that way. I think that a lot of my work would end up as one of the broken eggs. I mean months and months of it.

I think that this is the real issue:

But I am not going to pretend to be able to build the needed solution on the cheap. To detect the issues in v0.35.* for example, one could not use typical CI-- well, one could but you'd need to stand up and tear down nodes located across the world.

Fact is though, neither ibc-go nor cw tracked 46 related changes, so we couldn't really test them out.

We did of course, but they got caught up in the series of explosions surrounding v0.35.*

pragmatism

The pragmatic solution here is to ensure that certain projects treat one another as first class citizens, and sadly:

  • tendermint failed sdk
  • sdk failed cw
  • ibc-go failed sdk
  • cw failed sdk
  • no one maintained tm-db and efforts to do so were consistently ignored, except by Marko & Osmosis team

and by doing so, this cluster of projects failed to deliver on their full potential

I suppose that it's reasonable to point out that Osmosis... pays for work.
Juno.... pays for work

That is not and has never been the case with the icf (at lest for Notional, as we know, they prefer the dudleys/vulcanizes/laconics of the world) and the random strings of the world.

As far as I am aware, Osmosis and Juno do not have the same penchant for malinvestment.

Additonally, the Osmosis and Juno teams have always been highly responsive to any and all queries from developers and would regard failing to respond to queries as a failure. I believe that Marko feels that way, too, but the interchain foundation does not. This is highly evident from my communications with both (or really lack of communication with icf, other than to admonish me for speaking. out on egregious failures-- or worse, threaten me with code of conduct violations). thus, I am concerned about repo control issues & organizational issues

:(

@aaronc
Copy link

aaronc commented Aug 16, 2022

All CosmWasm based chain halts in the last year have been due to some place in the sdk being non-deterministic when we assumed otherwise.

I'm sorry this happened to CosmWasm chains, but please take responsibility for what was your responsibility to understand. It was not fair to place blame for this on the SDK. The key point here is that you "assumed otherwise". The SDK's policy on events and queries not being part of consensus was around since well before I ever worked on the SDK. The reason for this is actually quite simple - it allowed events and queries to be changed in patch releases which historically was important for apps building on top of the SDK. There have also been discussions about bringing events and queries into consensus and it is something many of us, including myself, would like to see. However, it is not as simple as just calling it a bug and saying that it should be changed from one day to the next with little discussion because there was a misunderstanding from CosmWasm team. All I'm asking is please take responsibility for what is your responsibility without unfairly pointing fingers at other contributors in this open source ecosystem. Beyond that we are definitely willing to participate in discussions on how to properly bring events and queries into consensus.

Thank you for running these. I have not attended these in well over a year, as they were my Friday evening and historically run by Regen as long, pointless discussions as they had already made their decisions and didn't really listen to feedback or just kept arguing til you agreed (consensus by attrition). I was unaware of any significant changes recently, just the mess of 0.46

Wow these are some pretty heavy accusations. Maybe it did happen that way in some cases and if that's so I'm sorry because that was never our intention. I also do recall us having changed directions many times because of feedback we received and I have very few recollections of you having attended enough of these calls to make such claims.

The major feature in 0.46 (groups and govs) is not so useful for CosmWasm chains and brings a very dangerous attack surface

This is the first I'm hearing of these and if there is such a dangerous attack surface, we would like to know about it to address issues.

In general, I want to call attention to the tone of this discussion. I see a lot of finger pointing and I want to encourage the participants here to be respectful and sensitive to the other contributors to these projects.

@faddat
Copy link
Contributor

faddat commented Aug 16, 2022

It is not even remotely a heavy accusation to say that 46 has been a mess.

It is also not a heavy accusation to say that the interchain foundation listens poorly at best.

I know this and I know this for certain because Notional is the largest user of 46. Feel free to disprove if you'd like. Of course doesn't seem like anybody has addressing this as a systemic failure (and it was a systemic failure) oh yeah sure let's throw Mr cosmwasm under the bus just like we did Jacob....

sure that's going to work

I wish to point clearly to the reality that the ideal outcome here is that we work together and we improve the SDK and tendermint.

@maurolacy
Copy link
Contributor

maurolacy commented Aug 16, 2022

As a developer, I can accept not to rely in a specific order for map keys. But I don't see why I should accept not to rely on a fixed order for them. Al least, in the same golang version, and between calls to the same underlying map.

Taking a look at the go code https://github.com/golang/go/blob/a55793835f16d0242be18aff4ec0bd13494175bd/src/runtime/map.go#L844-L852
, the random order introduces performance advantages, and (it seems) it's mostly there because of that. See commit 55c458e05f3.

That makes more sense. Anyway, I think there could be an option when creating the map (or better, the iterator), in which you specify which behaviour you want: Random but more performant, or fixed order.

Will create an issue in the go repository, and work on a tentative implementation.

Update: Created proposal 54500.

@codehans
Copy link

codehans commented Aug 16, 2022

I am a relative newcomer to the Cosmos dev ecosystem, and as such I don't have a full picture of the context (or politics) here, however I do have more years of engineering experience than I wish to count, and skin in the game with this proposal.

Apologies for being blunt (and obviously take this given my limited context), but this feels like an easy way out. A cut-and-run approach.

This is a significant module within the ecosystem, probably the most important after the SDK itself. From my own point of view, having an upstream dependency change will create an enormous surface for divergence between the two, where the "original" version remains the de-facto SDK, especially for non-cw chains. It's also likely to cause confusion amongst new entrants to the space and make the whole getting-started with Cosmos more difficult.

If there are problems upstream, in my opinion they should be tackled head-on, albeit possibly the more challenging option. We're talking about the foundations of an ecosystem worth billions of dollars. The decision here should be the correct one, not the easy one.

@jackzampolin
Copy link
Contributor

@codehans I strongly agree. Also @ethanfrey I think we can get your concerns with SDK maintainership addressed.

@ethanfrey
Copy link
Member Author

If there are problems upstream, in my opinion they should be tackled head-on, albeit possibly the more challenging option. We're talking about the foundations of an ecosystem worth billions of dollars. The decision here should be the correct one, not the easy one.

I do agree that would be the preferable path. Given the track record, I had felt this was impossible to change. As "no one" seems to have responsibility or power to actually change the status quo besides making forks. (This "no one" has come from a few conversations with people I assumed actually had some authority). If no one in the ICF or IG has this authority, then what can a downstream project do?

In the meantime, @marbar3778 has written out to me about a restructured SDK team inside IG and I will attend their next public call and listen. These problems existed well before Regen took over the SDK, but with multiple other teams capable of taking over maintenance, the status quo is no longer bearable.

@the-frey
Copy link
Contributor

Two thoughts -

  1. This has been brewing a while and this conversation needed to happen. I'm glad it has stayed relatively civil and constructive.
  2. The promise of SDK modularity is a nice idea - will believe it when I see it, I suppose. It would be nice to stay on the main SDK, but assuming we actually reach a modular design in future, I actually don't see an issue with a de-facto cosmwasm-specific SDK and a mainline SDK as long as they can talk IBC to each other. Although, as a dev, and as @codehans suggests, the thought of having to keep two SDKs in my head does bring on a slight migrane.

@faddat
Copy link
Contributor

faddat commented Aug 17, 2022

@the-frey and @ethanfrey, hereafter "The Freys"

The Freys:

The sdk is actually going modular, and 100% of new chains notional is working on, uses cw. The demand is that strong. Thus, upstream should keep with the modularization, and be accommodating. I think a lot of the pains that @ethanfrey described, don't exist currently, but I can confirm that they most certainly did.

@alexanderbez
Copy link
Contributor

alexanderbez commented Aug 17, 2022

@marbar3778 and I, along with the rest of the SDK crew, have have the ability to hear, understand, and address all your concerns @ethanfrey. We're happy and excited to get the SDK back on track to be an amazing framework to use and build apps on and we see CW as a major customer of said framework. We're extremely excited and happy to work with you and address all the needs that you might have.

I do think @codehans brings up some really valid points. The Osmosis fork has some impressive key improvements -- I know as I've worked on some of them, but I think sticking with the main repo and trying to bring in those changes as much as possible is best for the ecosystem at large.

SDK + CW ❤️

@faddat
Copy link
Contributor

faddat commented Aug 17, 2022

Amazing, @alexanderbez

I just said the same thing and will shortly be distributing related meme propaganda

To be more technical:

100% of our work at Notional uses the SDK and CW.

@ethanfrey
Copy link
Member Author

Thank you @alexanderbez ❤️

@ValarDragon
Copy link

ValarDragon commented Aug 17, 2022

I share a ton of the frustration laid out in the OP, and unless there is radical change in SDK development process and roadmap in the next 3 months, I think Osmosis will need to maintain a permanent fork or be moving increasingly more core functionality to our own repo that we maintain, as well.

EDIT: To be clear, I think radical change is underway right now, and want to help see it through to shipping to remedy such concerns

I want to highlight the systemic risk to cosmos is the sdk from the following perspectives:

  • app developer
  • State machine / module complexity
  • client developer (go, JS, modules, rust)
  • security risk management
  • node operator
  • performance

Developing against the SDK is painful, and is security-hole laden. These problems have been delayed for a year, but under recent direction from Marko have now been very helpfully put onto core SDK roadmap. Historically, core focus has been patchwork applied that net increases complexity. I feel like osmosis has been ~the only team doing work to significantly improve these fronts in the last year, with exception to Notional and their performance improvements. This feel like its been changing for the better in the last two months.

The continual focus on patchwork which hides the mess underneath but not fixing any of it, has been leaving us with a system that's less understandable, hardly maintainable and of high security risk. (When core team is most well positioned to be fixing the interfaces / guarantees of different sub-components)

Core team designs have been increasing architectural complexity and introducing security holes rather than fixing them.

  • Authz has introduced problems of vm escape, and patchwork added message execution in an unsafe way, rather than unified safe formats. This is of cascading problems with chains adding more work to ante handlers, ica, cosmwasm, gov, and modules using messages.
  • Dependency injection has been a multi-month patchwork over the current insane interfaces that make no sense, rather than going through and doing literally 1-2 week of work to fix the interfaces that are uncomprehendable in the first place.
  • (Many more examples)

There is a large list of layering on complexity and more app-dev LOC, and me getting tons of resistance in any effort to improve these. As a result Osmosis just builds a lot of improvements ourselves in our own repo (most of our core work isn't even in the SDK fork, and I don't bring up in calls to avoid bikesheds to death).

We've measured, upstreaming to the SDK is awful. PR's or minor issues are randomly closed, or get mired in bikesheds. The social coordination to push even minor things is incredibly draining. Branch management and fork problems are notably high as well. Its often the case that upstreaming process is 4x the time to do, and minor nits we are asked to do in the PR, actually make it harder for us to ever update our SDK fork - this is not a flow that makes any sense. We often hold off on basic improvements to core interfaces so that we can at least retain full compatability, instead making issues on the sdk that get bikeshed and ~never get into releases / branches we can timely upgrade to. I'm hugely appreciative of recent directions of the SDK team to minimize such bikeshed, or carry them over into subsequent work. This has been notably improving for the better, and want to shoutout @marbar3778 @alexanderbez for work towards this.

We've been coming to the conclusion that upstreaming things, or relying on SDK to ship them has empirically been too much overhead. The alternative is we move more changes and interfaces into our own Osmosis repo for our own productivity, and maintaining some level of module interface compatibility with the sdk.

There are often promises of fixes or long adrs, e.g. codegen cli, which we are asked to provide feedback, we do, trust in sdk team to timely deliver and adjust our internal dev cycles around this landing, and nothing happens. Its worse than if there was never an ADR in the first place in many cases, because there is no 'get things shipped' mindset. I believe recent direction has stopped this, so some things become shipped.

The SDK release process is both slow, and unsafe. SDK 44 was a debacle, and I have little reason to believe in solutions that have mitigated such problems from being present in 46, so I feel both unsafe upgrading to it for months, and unclear as to why it even matters since imo the gov change wasn't done well.

I'm pretty confident in @marbar3778 having come in and improving a lot of this / acting as product owner to eliminate overheads. But until the mainline priorities are shifted, and produces releases Osmosis can actually upgrade to under our fork UX, were being pushed towards committing further to long term fork, and make a very divergent dev UX.

@tac0turtle
Copy link
Contributor

I share a ton of the frustration laid out in the OP, and unless there is radical change in SDK development process and roadmap in the next 3 months, I think Osmosis will need to maintain a permanent fork or be moving increasingly more core functionality to our own repo that we maintain, as well.

It unfortunate to hear this as we have talked to great lengths about the sdk and lots of the feedback we have gotten from you we work on. You have been attending the community call where I ask for feedback on the roadmap and if anyone things we should remove something and/or add something. We have had conversations on how to integrate the sdk team closer to osmosis to tighten the feedback loop with the team pushing the software the hardest, offered that we start listening in on your standups. I have followed up a few times to ask how we can do this and what are the next steps. To say we have not been trying to accommodate teams feels like a personal attack and makes me want to leave this project because we are all trying to hear users.

  • Authz has introduced problems of vm escape, and patchwork added message execution in an unsafe way, rather than unified safe formats. This is of cascading problems with chains adding more work to ante handlers, ica, cosmwasm, gov, and modules using messages.

Authz was audited by informal and went through extensive testing, the issues you brought up were not found. Maybe we didnt document assumptions made but im not sure if this falls 100% on the sdk team.

  • Dependency injection has been a multi-month patchwork over the current insane interfaces that make no sense, rather than going through and doing literally 1-2 week of work to fix the interfaces that are uncomprehendable in the first place.

There is a long term plan to replace all the interfaces, this was the core module that was in review. It was recently closed in order to write an ADR to better show what is the goal. While this falls on us for not documenting the potential changes before hand, there is constant work on fixing these interfaces. Everyone on the team agrees that these interfaces are horrible and need to be changed, this was the goal and still is. We will write an ADR on how things will change.

We've measured, upstreaming to the SDK is awful. PR's or minor issues are randomly closed, or get mired in bikesheds. The social coordination to push even minor things is incredibly draining. Branch management and fork problems are notably high as well. Its often the case that upstreaming process is 4x the time to do, and then minor nits we are asked to do in the PR, actually make it harder for us to ever update our SDK fork - this is not a flow that makes any sense. We often hold off on basic improvements to core interfaces so that we can at least retain full compatability, instead making issues on the sdk that get bikeshed and ~never get into releases / branches we can timely upgrade to.

I have made this a huge focus and many of the recent prs, from osmosis, in the past 2 months we have merged within days if not hours of being opened. I have mentioned the sdk team will upstream osmosis changes so you don't have to, as its unfair since you also have your own product to build in a growingly competitive landscape. I have offered my own time to do this as well on top of the many things I have to do. We see the pain you encounter and have offered remedies.

There are often promises of fixes or long adrs, e.g. codegen cli, which we are asked to provide feedback, we do, trust in sdk team to timely deliver and adjust our internal dev cycles around this landing, and nothing happens. Its worse than if there was never an ADR in the first place in many cases, because there is no 'get things shipped' mindset.

This has been changing. I closed a lot of adrs or merged them because we need to focus on 1-2 things instead of 5. We needed to rebuild our foundation as a team and ship things instead of making little progress on many things. This has been the methodology the team has been following. I take the blame for trying to slow the team down in order to deliver things instead of working on many scopes.

The SDK release process is both slow, and unsafe. SDK 44 was a debacle, and I have little reason to believe in solutions that have mitigated such problems from being present in 46, so I feel both unsafe upgrading to it for months, and unclear as to why it even matters since imo the gov change wasn't done well.

This is also another conversation we have had many times. Yes we know that 0.46 was long and it took 2 months longer due to issues not related to the sdk. We are already planning our next release for 2 months from now and were one week away from spinning modules out into their own go mods. We had the idea to try to spin out at least 5 modules in the next sprint but due to this issue and now diverting focus to other items we will be aiming for 1-2.

I'm pretty confident in @marbar3778 having come in and improving a lot of this / acting as product owner to eliminate overheads. But until the mainline priorities are shifted, and produces releases Osmosis can actually upgrade to under our fork UX, were being pushed towards committing further to long term fork, and make a very divergent dev UX.

Thank you, but I am part of the team you are calling out. I stand with them and if you are calling them out you are calling me out.

We have been hearing the pain points of the ecosystem and are actively working on them. I asked the team to be focused on larger epics instead of grabbing random issues. We try to focus on larger epics as theme like spinning modules out, rewriting tests to use mocks, deprecating the param module, rewriting cli tests to be cli interface tests. On top of these every sprint we grab a few issues that are not related to the sprint to help push these along. The team is focused and moving faster than ever. We have even been asked to slow down by a team or two because they were trying to test things with the sdk and couldn't keep up. The development process has changed, we want feedback and a tighter feedback loop. I was hoping the community call in which we ask for this feedback would work but it seems it is not. That's on me, I need to find a new way to get this feedback as the current way is not working.

At the end of the day how we operate the sdk is as a steward on behalf of the community. We want feedback, we seek feedback on how things are going and if we are focusing on the right things. I want to invite anyone here and reading in the future to reach out to me to be added to the community call so at least you can gain better insight into the what the sdk team is working on and say if we are headed in the wrong direction. Secondly, please checkout our project board: https://github.com/orgs/cosmos/projects/26. We have spent time trying to make it self describing so the ecosystem can easily see what we are working on, what our focus is on and how we are progressing. If anyone sees issues they think we are wasting our time on, please reach out and tell me. Thirdly, if anyone wants to know what the sdk team is working on right now or may not understand what the project board shows, im happy to hop on a 1:1 call to explain things and get your feedback.

@faddat
Copy link
Contributor

faddat commented Aug 18, 2022

Hey Marko, Hey Dev hey jack, hey bez, aaron, and -- my, it is a full house. Guess we all care very much about CW & cosmos :)

There's a really awful scene around -- in fact maybe not the sdk itself-- maybe the org funding it.

You're both super-earnest, and yes, I've sadly seen this thing with the "fight huge for the minor wins"

I've had it happen with myself, and heh, it's quite a pattern at this point. There are IMO-- super clear reasons why everyone's rallying around Marko & Bez in this situation, and I suppose, at risk of sounding gauche-- they work with us.

I'd like to shift to a set of purely practical concerns for a moment, and guide the conversation into a bit of a flow. I suppose some praise is in order:

  • @ethanfrey I am sure that you would not have this particular problem right now, if CosmWasm hadn't exceeded everyone's expectations for it. Thanks! Best way to build ibc apps, hands down in my opinion. Sometime in december of last year it became obvious that CW would be supported on almost any chain in Cosmos-- due to excellence.
  • @aaronc hi there sir, I really like the groups module, and so does Vuong. We've used it to kinda sorta replace gov in craft, and we're happy with how it came out.
  • Dev, Bez, Marko, you've all been amazing teachers to both myself and the notional team
  • I reckon the IBC-go team will read this too-- we got off to a slow start, but the grind towards v5 with you was really fun and educational
  • thanks to all the folks at tendermint the company, who helped me learn too

What I had in mind for a stripped down sdk, deisgned around the needs of cw, was more like what I think sdk v0.47.0 (v1?) is shaping up to be, something where components can be used more selectively. Craft.... is at present unshippable, due in my opinion to issues of software integration (ibc-go v5 is in beta and before that it was tendermint). I have a saying

"software integration is the doom beast from hell"

and this has led me more and more towards working in multi-language, single product software repositories that make really heavy use of various automation tooling:

  • linters
  • static analysis
  • per-commit integration tests between separate components of the same product

I am repeating myself, and apologize for that. I continue to feel that the core issues are:

  • ICF (certainly neither Marko nor Bez) acts in a super-entitled manner, and doesn't always keep the best interests of the community in mind (claiming transparency but not doing it, is super bad if you ask me)
  • Somehow, under the icf umbrella, disjoint teams formed:
    • tendermint core did not test against the sdk in ci
    • sdk wasn't testing against ibc-go
    • ibc-go waited and waited to support sdk v46
    • tm-db was ignored (though it clearly no longer is, that required creating cosmos-db)
    • CW wished to see the cosmos hub use a new sdk version, before officially supporting it, because they felt that was all the ICF team had incentive to care about (I am reading into this a bit, and expressing my own thoughts, too -- if disagree, feel free to say so)

I also strongly believe that folks aren't comfortable speaking openly about issues, and who could blame them? I've all too frequently seen discussions of fact, made about my person to distract from the facts.

I want to figure out how we can ensure that the pieces fit together well in the future. My biggest concern at present is that I cannot easily articulate a "next year ought to look like" type statement.

I think we need to describe and define this possible future SDK and we've got to think deeply about its ibc components

what if we tried to put together a bulleted list? Together, I mean. The things Notional and our clients are looking for:

  • compatibility in as many places as possible
  • transparency from orgs like icf
  • mix and match modularity where possible
  • roadmaps for predictability
  • improved onboarding processes (at times I think that the way to get this is to have less code overall)
  • governance directed projects-- or removal of governance

I don't have any doubt that osmosis team can make a banging fork. I am simply concerned from a mateianance perspective, what that will eventually entail.

I don't quite know how to get these things. I am very happy to talk with anyone here, without restriction, so that we can have a faster moving codebase overall.

little concerned I did not express myself well, but here goes, click

@faddat
Copy link
Contributor

faddat commented Aug 18, 2022

I'd like to call attention to this comment:

#941 (comment)

@ethanfrey
Copy link
Member Author

I think there have been plenty of voices on this issue and thank you all for the feedback.

I know you are on vacation this week @marbar3778 and I feel bad having inadvertently pulled you into this when you should be getting some well deserved rest. I very much appreciate and respect your work on the sdk and on the cosmos ecosystem in general, starting with the builder's program and every other initiative you have taken to improve things.

I think everyone has had a chance to express themselves and I would request that we all let the issue calm down a bit while we re-read and reflect on what others have written (and maybe @ValarDragon can answer that question from @tomtau)

I look forward to concrete and productive discussion at the next Cosmos SDK meeting (next Thursday) and hope @ValarDragon can also join with a desire to try to find a common ground.

For those of you from ICF, IG, or Regen listening in here, I ask you to reflect on the following:

  • Do you recognise the pain points various downstream projects are expressing?
  • Do you believe that @marbar3778 and @alexanderbez are capable of leading a Cosmos SDK that addresses them far better than current track record?
  • How can you support those two individuals, so they feel empowered to enact any changes they need to make?

Such changes as Marko discussed only work when he has organisational support, not constant resistance and "death by paper cuts". I don't ask you to answer this publicly, but please do reflect and communicate directly with Marko and Bez on this. For them to be able to enact changes, they need support and authority to do so.

If they do not have such support, be honest with them, so they can stop trying to enact a doomed mission. If they do have such support, please give it to them explicitly, so they can be empowered leaders and actually negotiate with other downstream projects to find a shared solution, we can all trust them to be able to actually implement.

@ValarDragon
Copy link

ValarDragon commented Aug 18, 2022

I want to say again, I very much appreciate and respect the work @marbar3778 has done, especially in hearing Osmosis needs to get upgrades, and make upstreaming doable. The team has had tons of improvement and getting user needs hit. Your right, it was unfair of me in the that main post to talk about upstream UX, when you are committing time to get things upstreamed, and get things merged within hours, and user issues well prioritized.

I also want to highlight, the big thing I thought has shipped amazingly: Upgrades, migrations, Cosmovisor. This has been a huge win, just has dev UX improvements to go!

I share a ton of the frustration laid out in the OP, and unless there is radical change in SDK development process and roadmap in the next 3 months, ...

I should have better specified. I do think there has been radical change undergoing for the last few months. I should have rewritten that message, better stating that and with the optimism I do truly have. E.g. the clear sprint priorities, with making independent go mods reliably ship soon. Tighter release schedules, talking to downstream users. We do just have a large sunk cost, that the ecosystem has to bite with v0.46.0

Many of the components of that message were expressing frustrations of the last 1 year, many of which you have in the last 3 months eliminated. (E.g. too many ADRs, clear goals and reliable progress towards those goals, measurably fewer bikesheds) I'm very excited for the work in a more constrained and usable SDK v0.47!

Some parts that weren't addressed / haven't been, is captured by Frey. The intent of my message was truly twofold:

  1. Highlight that reducing systemic complexity is crucial, and should be a major focus. (And work of the last year has not been doing this, and should be treated as a failure that the team is now excitedly improving)
  2. Highlight the problems and the areas you've made them better, to further the goal Frey elegantly stated:

Such changes as Marko discussed only work when he has organisational support, not constant resistance and "death by paper cuts". I don't ask you to answer this publicly, but please do reflect and communicate directly with Marko and Bez on this. For them to be able to enact changes, they need support and authority to do so.

This was referenced Aug 21, 2022
@robert-zaremba
Copy link

Few observations to share:

  • delivering something for one team is always easier than delivering to the whole ecosystem. Of course Osmosis can do their own changes faster.
  • I know the pain of upstreaming a feature. Requested changes are painful - but we need to understand that it is in a good faith and for a good reason: QA.
  • Using "other" fork doesn't necessary mean that other teams won't fork as well - upstreaming to Osmosis will not be easier, because Osmosis has their own priorities and maybe they won't have time to analyze new features or take a risk of potential bugs and maintainanceship.
  • I know that many teams are keeping a fork of SDK - because it's easier to change something in the core rather than creating wraps. I truly believe that the new module wiring system will greatly improve customization and will create foundations to move it to the next levels (here I want to thank @aaronc for the massive work he did towards that).
  • everyone is welcome to propose an improvement to the SDK. Personally I spent lot of time on discussing some improvements, and sometimes I think it would be better to just do PoC directly.

@robert-zaremba
Copy link

Also wanted to second @faddat, that a good testing framework for core modules integration (tendermint, SDK, IBC, and CW) with latest functionality (trunk) is really important for a smooth adoption. We spent lot of time testing and migrating core dependencies (IBC, GB, ibcbech32, peggo) to SDK 0.46 and then we will need to wait for CW being migrated to 0.46.

@faddat
Copy link
Contributor

faddat commented Aug 22, 2022

Also wanted to second @faddat, that a good testing framework for core modules integration (tendermint, SDK, IBC, and CW) with latest functionality (trunk) is really important for a smooth adoption. We spent lot of time testing and migrating core dependencies (IBC, GB, ibcbech32, peggo) to SDK 0.46 and then we will need to wait for CW being migrated to 0.46.

Yes, upgrades were really tough for 46, and it does seem to be a generalized, systemic failure. I consider CW to be part of the "core" now. We no longer have projects that do not use cw.

@alexanderbez
Copy link
Contributor

upgrades were really tough for 46

What were the toughest parts? What can we improve or we did we fail greatly?

.. and it does seem to be a generalized, systemic failure ...

What and why?

@faddat
Copy link
Contributor

faddat commented Aug 22, 2022

Okay sir you got it

What I'm going to do is type this out here and then make it an issue on The SDK although a lot of it is covered in my test core modules against one another issue. Everybody please keep in mind I'm basically speaking from my own commercial reality, and I deeply recognize that everybody's reality is different.

SDK 46 Fails

  • Tendermint 35

    • Problem: To be crazy specific about this, I think that we were the first team to hit this problem. The first craft test net failed. Well it didn't exactly fail, it did all kinds of weird and crazy things, just like Celestia did, and just like penumbra did. We didn't know why it did that and we ended up trying to fix our own code, over and over, in vain. We did not suspect upstream until sometime later.
    • Solution: Cosmos SDK 46 has a power feature in my opinion, it's the test net command. I love it. However, in this case it masked the peer-to-peer problems because it was not deployed onto a global set of nodes the way that we do when we start a live network, I think that the solution is actually to automate kicking off a live network with every commit since we already do pretty extensive CI. It should be global and we could even get a little freaky with it, this would torture the network some.
  • IBC (general. Notional upgraded IBC many times for both the 34 flavor and the 35 flavor)

    • Problem: I think it's fair to say that the IBC team was maybe a little slow on the uptake I wanted to deliver them super super solid praise for the way that they walked me through the version 5 upgrade spending a lot of time on it and engaging in very serious hand holding. It is appreciated. So I guess here are the issue is speed
    • Solution: The IBC team should migrate at beta or earlier to ensure that the community has an easy time upgrading.

CW: The enormous success story

  • Problem: users would like to see a blessed version. The team is rightly gun-shy.
  • solutions:
    • We could fork and use the osmosis SDK. Robert is at least in part right, all teams will have some maintenance difficulties. This said, if I thought that Ethan was point-blank wrong, I would say so but overall I think that Ethan is correct and that there have been some pretty serious issues surrounding 46. I hope that for simplicity's sake we can now just call it the osmosis SDK, I have been doing that for a year now. This is not my preference.
    • We can maintain a more blessed version using the osmosis SDK, and a less blessed version using the cosmos SDK. This is also not my preference.
    • We can carefully outline what needs to come from osmosis into mainline, and we can have a small group call, completely dedicated to fixing what are called insane interfaces, I assume that @ValarDragon was referring to the interfaces of the modules themselves, and also I do think he's probably right, document that, and make banging it out for 47 the top priority of 47.

SDK 46 wins

  • Groups
  • Gov

I feel like I might be alone here but I really do like the way that governance works now and while we did end up forking the group's module to do this, we did some really cool stuff with the groups module for craft. In fact we replaced governance with groups. Sort of.

  • math module
  • Errors

DX stragglers

  • We still have several docker containers but we could have just one ecosystem wide and it could contain all of the tooling that is needed to build a chain or a contract in go or rust
  • We could use that container everywhere
  • That container could be the canonical way to build protocol buffer files
  • Protocol buffers builds are still painful even after buf
  • We could connect Dan Lynch with the ignite team, and we could fix the ignite CLI and we could get it into the SDK. I actually think this is a surprisingly important item. His expertise in AST should not go untapped.
  • Whether or not the ignite CLI lands inside the SDK itself, the SDK needs a templating tool. I'm fine with that being the ignite CLI, I really like the team, despite removing that code from Juno. Denis has always struck me as extremely extremely capable as have the other developers on the ignite CLI.
    • If we aren't using the ignite CLI we should remove all mention of it. I prefer that we use it after we make it use ASTs.
    • If we are using the ignite CLI then relevant packages need to be pushed to darts pub package manager. They're currently unmaintainable.
  • We should figure out if there's anything in the cosmos SDK that we can plain remove.

a broad concern

I usually work in monorepos. We dramatically accelerated development at pylons by moving everything into the same repository and the second piece of that was that everybody on the team was able to see what everybody else was doing.

There was a unifying effect.

I have been pitching the idea that osmosis should move to a mono repo for a fairly long time and I still think that's the case, because if we did we would be able to test for example, changes in the chain against the UI, automatically.

In juno, we test some contracts in the CI, but they're not in the repository, probably I'll put them there soon and the reason for that is to make sure that everything we do in the repository gets tested in CI.

The problem with this mono repo approach is that here's how I view the cosmos stack currently:

the cosmos stack

  • Tendermint
    • TMDb
  • Cosmos SDK
  • Cosmos db
  • Iavl
  • IBC Go
  • Cosmwasm

It's way too many components for a monorepo, maybe. Thing is, any reasonably sized chain uses all of these and this is why we absolutely need a template in the SDK so that when we make changes to the SDK, that template is used to spin up a chain with all of these components.

the happy part

Look in the ecosystem I'm sometimes known as a critic or what have you and that'll probably always be the case but my friends my colleagues my internet homies, we work on an almost uniquely valuable piece of software or pieces of software that are beginning to have the types of problems that Linux has and guys this is good this is utterly amazing. Like I don't know if this is our Ubuntu from Debian moment or something much earlier but we should be realistic about the reality that these are problems of success. From my own personal perspective, all of the issues that I'm having with the SDK are driven by overwhelming demand for the SDK.

If we end up having a specialized distributions of the cosmos SDK it has crossed my mind recently that that could be a good thing. Am I certain of this?

Hell no

But it really is a pleasure working with all of you and I feel quite lucky to have the opportunity to do so.

@alpe
Copy link
Contributor

alpe commented Sep 28, 2022

A quick update on this topic: @ethanfrey, @marbar3778 and me had very good conversations already and established a direct channel to communicate and address the stated concerns and future demands. Working closer together and share the other sides pain and success is vital in this environment, IMHO.

The plan discussed was to skip SDK v0.46 and integrate with v0.47 early.

@robert-zaremba
Copy link

@alpe why skipping 0.46? Do we know when 0.47 will be released? usually it takes 1-3 months once beta is released.

@tac0turtle
Copy link
Contributor

tac0turtle commented Sep 28, 2022

End of October is the goal. 1-3 months was for 0.46 which had major scope creep. in the past it was faster.

Thank you @alpe looking forward to further collaboration with you and many others ❤️

@robert-zaremba
Copy link

I don't want to sound skeptical, but would be good to make planning to really make sure we are not talking about Feb next year because it takes time for other migrations (IBC etc...). 0.47 is more complex than 0.45 which was just a bunch of backports.

Umee wants to integrate cosmowasm asap. On the other hand Juno wants SDK 0.46 but it's blocked by the wasmd migration. Crypto.com is also working on 0.46 migration, but will not finish until cosmowasm will be migrated.

@the-frey
Copy link
Contributor

the-frey commented Sep 28, 2022

Well, to chip in - Juno will go with the judgement of the wasmd maintainers on 0.46. The release that would have had LSM + 46 will just be a bit further in the future I guess and be LSM + 47.

@alpe alpe moved this from 🆕 New to ❓ Needs more info in wasmd backlog Sep 30, 2022
@faddat
Copy link
Contributor

faddat commented Oct 2, 2022

+1 @the-frey

@robert-zaremba
Copy link

Was chatting recently with @ethanfrey . He is against migrating to 0.46 for various reasons (not tested enough, too many things , etc...). Personally, I don't know how 0.47 will make it better.

@faddat
Copy link
Contributor

faddat commented Oct 23, 2022

Hey Robert, you know what's terrible? I fully agree with you. This will mean that I have changed my position based on the fact that I thought that we were going to abandon 46 per Bucky's post on the Cosmos Hub forum, but that doesn't seem to be the case, and 47 has been delayed due to dragon fire therefore, I think it is especially important that we make 46 happen, so I am going to resume my previous efforts to ensure that 46 happens and happens in a bug-free fashion.

@ethanfrey
Copy link
Member Author

Closing this as we will be going with 0.47 #1028

I think this could be much closer if they trim down the size. Seems like some large epics that are barely started that are blocking release. Happy for other views and comments here: cosmos/cosmos-sdk#13456 (comment)

We still plan to go straight to 0.47. Just hope it is sooner rather than later.

Repository owner moved this from ❓ Needs more info to ✅ Done in wasmd backlog Nov 4, 2022
@alpe alpe removed this from wasmd backlog Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests