Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some crates in the crates.io index have invalid json #1168

Closed
est31 opened this issue Nov 12, 2017 · 8 comments
Closed

Some crates in the crates.io index have invalid json #1168

est31 opened this issue Nov 12, 2017 · 8 comments
Labels
C-bug 🐞 Category: unintended, undesired behavior

Comments

@est31
Copy link
Member

est31 commented Nov 12, 2017

I'm trying to parse the crates.io index using the serde and json crates. As we have a specification now, I know what I can require! I've created some structs and am iterating over the directory hierarchy, parsing every single file.

Now I have encountered some crates where the crates.io index violates that specification.

There are 100 older (read: 2014) crates that inside their dependencies list are either lacking a "kind" attribute, or have it set to null. (the only crate where kind is set to null is rope, due to being yanked).

Click to expand the list of affected crates
"unicode_names_macros"
"tempan"
"pipes"
"pidfile"
"plugin"
"uchardet-sys"
"uchardet"
"julius"
"openssl-sys"
"opentype"
"openssl2-sys"
"leveldb"
"tgff"
"vorbisfile-sys"
"vorbis-sys"
"glium"
"gl_generator"
"glutin"
"lazy"
"lapack"
"rope"
"ogg-sys"
"epsilonz"
"matrix"
"email"
"hotspot"
"miniz-sys"
"redis"
"resources_package"
"rethinkdb"
"basehangul"
"free"
"fractran_macros"
"chrono"
"docopt_macros"
"date"
"event"
"event-emitter"
"cssparser"
"obj"
"xsv"
"zip"
"mio"
"url"
"utp"
"rusqlite"
"rust-crypto"
"capnp-rpc"
"capnpc"
"probability"
"image"
"git2"
"fingertree"
"diecast"
"blas"
"slow_primes"
"gl"
"enforce"
"encoding-index-tradchinese"
"encoding"
"encoding-index-simpchinese"
"encoding-index-japanese"
"encoding-index-korean"
"encoding-index-singlebyte"
"bzip2-sys"
"bzip2"
"conduit-log-requests"
"conduit-middleware"
"conduit-utils"
"conduit-conditional-get"
"conduit-router"
"conduit"
"conduit-test"
"conduit-static"
"conduit-json-parser"
"typemap"
"libssh2-sys"
"liblapack-sys"
"libz-sys"
"libgit2-sys"
"libressl-pnacl-sys"
"quickcheck_macros"
"error"
"tiled"
"time"
"civet"
"ssh2"
"stream"
"strided"
"string_telephone"
"sfunc"
"ordered-float"
"flate2"
"monad"
"modifier"
"tailrec"
"tabwriter"
"taskpool"
"fftw3-sys"
"grabbag"

As the index repo doesn't have a bugtracker, I chose to file the bug here. The missing of that attribute seems to be a violation of that specification. If you disagree that this is a violation, I'd really love to see an amendment of the spec that explicitly states that missing or null deps fields are allowed, together with a statement about the value you can assume if the deps field is not set.

@carols10cents
Copy link
Member

Yes, we should amend the spec.

@sfackler
Copy link
Member

@alexcrichton how would you feel about rewriting these hundred records? It seems like it'd be nicer to not have to have a weird exception around kind being required just because a small number of bad records were created.

@alexcrichton
Copy link
Member

I'd personally prefer to not update old entries in the registry, presumably with Serde this is a one line fix, no?

@est31
Copy link
Member Author

est31 commented Jan 22, 2018

@alexcrichton ultimately it should be a one liner but right now it is not.

Generally I'm against editing of existing crates (crates.io being append only is a really good thing) but there are a few issues present in the crates.io index where I think some light editing is a good idea. This is one such an issue, others are #1201 #1177 #1179 .

@wycats
Copy link
Contributor

wycats commented Feb 26, 2018

If we want to allow editing, we would want a strict policy about when it's appropriate. At minimum, any cleanups should not affect older versions of Cargo.

One thing to keep in mind is that wholesale editing of the index creates commits that further bloat the size of the repository. We might want to avail ourselves of the option to flatten the repository back to a single commit whenever we do any rewriting, but as we've never done it before, we need to investigate what the logistics of that would be.

@withoutboats
Copy link
Contributor

We discussed this in the cargo meeting today and we decided that rather than rewrite the history, we'd prefer to modify the spec for the index format. In particular, we think that when writing entries to an index, implementations MUST write the kind field, but when parsing entries, implementations SHOULD accept that the kind field may be missing or null, and if it is should interpret it as the default value.

@est31
Copy link
Member Author

est31 commented Feb 26, 2018

If we want to allow editing, we would want a strict policy about when it's appropriate. At minimum, any cleanups should not affect older versions of Cargo.

Having a policy on this would be a great idea. I've just seen edits both by carol and by alex so I've thought edits that are just cleaning up after a technical mishap/bug that maintain the general spirit of crates.io are okay. With "spirit" I mean that people who want to force their users to use the latest versions of their libs and are asking the crates.io team to delete their older versions are being refused.

All I am asking for is an edit similar to its predecessors. A policy that states this would be a good idea. If we are talking about this, I'd also love transparency about admin interactions. E.g. why was the nom_lua crate deleted? When did admins have to delete crates due to a CoC violation? When did admins give a crate to someone else or helped someone who was locked out of their account? lobste.rs has a very nice moderation log. Discussions on this probably better fit into its own RFC, I'd love to write one if there is interest in the team.

I think this particular issue can be resolved with changing the spec only, but for issues like #1177 or #1201 or #1179 there is no fitting fix in the spec... Do you really think that allowing two hashes for one (crate, version) tuple is a good idea :)? Or keeping crates in the index that can't be downloaded from S3? I think if you can't recover the crate's content, you should admit that it is unrecoverably lost, or if there was a conscious decision to delete those crates, you should delete them from the index as well.

We might want to avail ourselves of the option to flatten the repository back to a single commit whenever we do any rewriting

Rewriting the history would be really bad. The index is the only public document on when and in which order crates got added or modified (the API exists as well but you can't easily bulk download the entire history like you can download the index and it is filled with repetitive and predictable content... the API is less a document and more a service). In order to keep the index small, having shallow clones is enough. Then you only get the blobs actually used by the latest commit.

Also, and very importantly, one could imagine having a "time machine" feature for cargo where you say "get me to crates.io of oct 3, 2015". All it would have to do is to get the full history from github and then use the last index commit before that date instead of latest master.

@Turbo87
Copy link
Member

Turbo87 commented Oct 13, 2020

with rust-lang/rfcs#2301 and #1168 (comment) it looks like this issue is covered, so I'll go ahead and close this now.

@Turbo87 Turbo87 closed this as completed Oct 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug 🐞 Category: unintended, undesired behavior
Projects
None yet
Development

No branches or pull requests

7 participants