Skip to content

Conversation

@superbobry
Copy link
Contributor

@superbobry superbobry commented Dec 15, 2017

What changes were proposed in this pull request?

The format of event logs uses a redundant representation for storage
levels, for instance, StorageLevel.DISK_ONLY is represented as

{"Use Disk":true,"Use Memory":false,"Deserialized":false,"Replication":1}

which is 64 bytes more. This commit changes the event log representation
of the StorageLevel to predefined constants: NONE, DISK_ONLY, etc. The
change is fully backwards compatible.

How was this patch tested?

core unit tests.

The format of event logs uses redundant representation for storage
levels, for instance StorageLevel.DISK_ONLY is represented as

    {"Use Disk":true,"Use Memory":false,"Deserialized":false,"Replication":1}

which is 64 bytes more. This commit changes the event log representation
of the StorageLevel to predefined constants: NONE, DISK_ONLY, etc. The
change is fully backward compatibly, because

* StorageLevel constructor is private, meaning that existing event
  logs can only contain these predefined levels;
* The JsonProtocol supports reading both the old format and the new one.
("Deserialized" -> storageLevel.deserialized) ~
("Replication" -> storageLevel.replication)
def storageLevelToJson(storageLevel: StorageLevel): JValue = storageLevel match {
case StorageLevel.NONE => "NONE"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can change StorageLevel.toString to do this or add another method e.g. StorageLevel.name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, that would be more robust.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewor14 I think you touched this last, and a long time ago. Just pinging you in case you have an opinion.

("Deserialized" -> storageLevel.deserialized) ~
("Replication" -> storageLevel.replication)
def storageLevelToJson(storageLevel: StorageLevel): JValue = storageLevel match {
case StorageLevel.NONE => "NONE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, that would be more robust.

val replication = (json \ "Replication").extract[Int]
StorageLevel(useDisk, useMemory, deserialized, replication)
def storageLevelFromJson(json: JValue): StorageLevel = json match {
case _: JString => StorageLevel.fromString(json.extract[String])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth some comments about why there are two read paths

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

| "Port": 300
| },
| "Block ID": "rdd_0_0",
| "Storage Level": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also want to retain some tests of the old format to ensure it's still read. Maybe there are outside of the diff and I'm not seeing them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a to/from JSON test for a custom StorageLevel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I mean, doesn't this no longer test whether it can read the verbose, old style format? like this test does here and the ones above, that are being removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strings are used with testEvent which assert that the serialized representation matches the one given in the string literal. See https://github.com/criteo-forks/spark/blob/7869e63a569a6fb6725996084f0c5c55fc130ac8/core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala#L457

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share sean's concern, I don't think your response is addressing it. You added one test on L151 makes sure that some event which is not predefined still works. But you don't have a test making sure that the old, verbose string can still be parsed (or is it somewhere else?)

probably this is indirectly covered by HistoryServerSuite but a more direrct test would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test ensuring all predefine storage levels can be read from the legacy format.

Sidenote: I've also noticed that the legacy format incorrectly handled the predefined StorageLevel.OFF_HEAP and an fact any other custom storage level with useOffHeap = true. It looks like a bug to me, wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, I completely agree that off heap is not respected in the json format. can you file a bug? I think its still relevant even after this goes in, for custom levels

Note that the previous commit contained a bug -- user-defined storage
levels caused an exception in JsonProtocol.
override def hashCode(): Int = toInt * 41 + replication

/** Name of the storage level if it is predefined or [[None]] otherwise. */
def name: Option[String] = this match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromString below has basically the opposite of this. How about storing the mapping in a Seq[(StorageLevel, String)] and using that in both methods? e.g. here it would be:

knownLevels.collect { case (level, name) if level == this => name }.headOption

And pretty similar code in fromString.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good, will do. A slightly unrelated point: I feel that the name fromString somehow implies that it's the opposite of toString. What do you think about renaming it to fromName now that we have name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about renaming

It's a public method so renaming means breaking compatibility.

@vanzin
Copy link
Contributor

vanzin commented Dec 16, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Dec 16, 2017

Test build #84984 has finished for PR 19992 at commit 58e93bb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2017

Test build #85006 has finished for PR 19992 at commit 13fe385.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@superbobry
Copy link
Contributor Author

superbobry commented Dec 16, 2017

Minor update: I've simulated #18162 on one of our 80G event logs and (unless there is a bug in the filtering code) the log shrank to 157M. The effect of this patch was almost negligible, it brought the size down to 155M. It is unclear for now if this pattern generalizes to other workloads. See JIRA ticket for details.

@SparkQA
Copy link

SparkQA commented Dec 16, 2017

Test build #85012 has finished for PR 19992 at commit e171f03.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2017

Test build #85019 has finished for PR 19992 at commit 81b980f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@superbobry
Copy link
Contributor Author

Can someone have a look at the tests, please? I can't see the failure (and in theory, the change should not affect SparkR).

@srowen
Copy link
Member

srowen commented Dec 18, 2017

Ignore the error, it's being fixed separately.

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #4016 has finished for PR 19992 at commit 81b980f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/** Name of the storage level if it is predefined or `None` otherwise. */
def name: Option[String] = StorageLevel.PREDEFINED
.collectFirst { case (storageLevel, name) if storageLevel == this => name }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the method body needs multiple lines it's better to wrap it with { }. Same thing below.

("Deserialized" -> storageLevel.deserialized) ~
("Replication" -> storageLevel.replication)
}
def storageLevelToJson(storageLevel: StorageLevel): JValue =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Braces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed it. I've decided not to add braces to storageLevelFromJson because it seems to look OK with the toplevel match.

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85135 has finished for PR 19992 at commit 59ed873.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

("Replication" -> storageLevel.replication)
}
def storageLevelToJson(storageLevel: StorageLevel): JValue =
storageLevel.name match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

storageLevel.name.getOrElse(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Sadly, in this case, getOrElse requires explicit type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after seeing the compilation error, I recall why I went with a match instead of getOrElse -- the former does not require an explicit conversion to JString.

val replication = (json \ "Replication").extract[Int]
StorageLevel(useDisk, useMemory, deserialized, replication)
case _ =>
throw new IllegalArgumentException(json.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

throw new IllegalArgumentException(s"Invalid storage level from json: $json.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've changed the message to match the one in accumValueFromJson.

@SparkQA
Copy link

SparkQA commented Dec 26, 2017

Test build #85402 has finished for PR 19992 at commit 1fc6e75.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@superbobry superbobry force-pushed the compact-storage-level branch from 1fc6e75 to b1e6f5f Compare December 26, 2017 12:28
@jiangxb1987
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Dec 26, 2017

Test build #85403 has finished for PR 19992 at commit b1e6f5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 1, 2018

Test build #85579 has finished for PR 19992 at commit 9fbfe40.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@superbobry superbobry force-pushed the compact-storage-level branch from 9fbfe40 to cb1fe6a Compare January 1, 2018 13:32
@SparkQA
Copy link

SparkQA commented Jan 1, 2018

Test build #85580 has finished for PR 19992 at commit cb1fe6a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Jan 5, 2018

change is fine, but from discussion on the jira I'm unclear if this is really worth it -- gain seems pretty small after the other fix in 2.3.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@superbobry
Copy link
Contributor Author

@squito I think it's fine to just close the PR/JIRA issue.

@squito
Copy link
Contributor

squito commented Jan 22, 2018

thanks for looking into this @superbobry -- can you actually close this yourself? we can't directly close it (there is a way but its more complicated)

@superbobry superbobry closed this Jan 22, 2018
@Willymontaz Willymontaz deleted the compact-storage-level branch April 2, 2019 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants