Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain consistency when deserializing to JSON #5114

Merged
merged 11 commits into from
Jun 19, 2023

Conversation

dhruvmanila
Copy link
Member

@dhruvmanila dhruvmanila commented Jun 15, 2023

Summary

Maintain consistency while deserializing Jupyter notebook to JSON. The following
changes were made:

  1. Use string array to store the source value as that's the default (https://github.com/jupyter/nbformat/blob/57817204230a8f7173b8bec380e571b5d410de0d/nbformat/v4/nbjson.py#L56-L57)
  2. Remove unused structs and enums
  3. Reorder the keys in alphabetical order as that's the default. (https://github.com/jupyter/nbformat/blob/57817204230a8f7173b8bec380e571b5d410de0d/nbformat/v4/nbjson.py#L51)

Side effect

Removing the preserve_order feature means that the order of keys in JSON output (--format json) will be in alphabetical order. This is because the value is represented using serde_json::Value which internally is a BTreeMap, thus sorting it as per the string key. For posterity if this turns out to be not ideal, then we could define a struct representing the JSON object and the order of struct fields will determine the order in the JSON string.

Test Plan

Add a test case to assert the raw JSON string.


Edit: Both have been fixed, keeping here for posterity

Still a few inconsistencies...

1

execution_count is a required field only in code cell while for others
it's not required. We use a single Cell struct and a cell_type field
instead of having structs for each cell type (CodeCell, MarkdownCell).
This means that the field execution_count will always be added to the JSON
string.

What to do then?

  • We could either use a custom deserializer and check for the cell_type.
  • Split the Code struct into 3 distinct types using enum:
enum CellType {
    RawCell(RawCell),
    MarkdownCell(MarkdownCell),
    CodeCell(CodeCell)
}

struct RawCell {
    ...
}

struct MarkdownCell {
    ...
}

struct CodeCell {
    ...
}

2

For some fields, the additionalProperties is true which means there could be
unknown properties added which we've to keep while deserializing. We do this
using:

pub struct RawNotebookMetadata {
    // ... other fields

    #[serde(flatten)]
    pub other: BTreeMap<String, Value>
}

But, now the order won't be alphabetical. Wherever the other field is present,
all of the keys in it will be flattened at that position.

In the following example, the nbconvert_exporter and version keys are extra.
Now, as the other field in our struct is at the end, both the extra keys have
been moved to the end.

    "language_info": {
     "codemirror_mode": {
      "name": "ipython",
      "version": 3
     },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
-   "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
+   "nbconvert_exporter": "python",
    "version": "3.11.3"
   }

@dhruvmanila
Copy link
Member Author

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@dhruvmanila
Copy link
Member Author

Actually, we should resolve (1) as otherwise the validation fails:

NotebookValidationError: Additional properties are not allowed ('execution_count' was unexpected)

Failed validating 'additionalProperties' in markdown_cell:

On instance['cells'][0]:
{'cell_type': 'markdown',
 'execution_count': None,
 'id': 'f3c286e9-fa52-4440-816f-4449232f199a',
 'metadata': {},
 'source': '# Ruff Test'}

@github-actions
Copy link
Contributor

github-actions bot commented Jun 15, 2023

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      7.5±0.02ms     5.4 MB/sec    1.01      7.6±0.02ms     5.4 MB/sec
formatter/numpy/ctypeslib.py               1.00   1572.2±5.75µs    10.6 MB/sec    1.01   1583.7±9.09µs    10.5 MB/sec
formatter/numpy/globals.py                 1.00    152.3±0.53µs    19.4 MB/sec    1.00    152.5±1.02µs    19.4 MB/sec
formatter/pydantic/types.py                1.00      3.1±0.01ms     8.3 MB/sec    1.01      3.1±0.01ms     8.2 MB/sec
linter/all-rules/large/dataset.py          1.01     16.2±0.10ms     2.5 MB/sec    1.00     16.0±0.08ms     2.5 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      3.9±0.01ms     4.3 MB/sec    1.00      3.9±0.01ms     4.2 MB/sec
linter/all-rules/numpy/globals.py          1.00    493.3±0.97µs     6.0 MB/sec    1.02    502.7±7.38µs     5.9 MB/sec
linter/all-rules/pydantic/types.py         1.01      6.9±0.11ms     3.7 MB/sec    1.00      6.9±0.02ms     3.7 MB/sec
linter/default-rules/large/dataset.py      1.00      8.0±0.02ms     5.1 MB/sec    1.00      8.0±0.02ms     5.1 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1719.8±7.63µs     9.7 MB/sec    1.00   1726.1±7.83µs     9.6 MB/sec
linter/default-rules/numpy/globals.py      1.00    189.8±0.37µs    15.5 MB/sec    1.01    191.3±0.95µs    15.4 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.6±0.01ms     7.1 MB/sec    1.00      3.6±0.01ms     7.0 MB/sec

Windows

group                                      main                                    pr
-----                                      ----                                    --
formatter/large/dataset.py                 1.00      7.8±0.31ms     5.2 MB/sec     1.08      8.4±0.40ms     4.8 MB/sec
formatter/numpy/ctypeslib.py               1.00  1589.7±111.20µs    10.5 MB/sec    1.05  1676.2±81.96µs     9.9 MB/sec
formatter/numpy/globals.py                 1.00   153.7±11.20µs    19.2 MB/sec     1.02   156.1±11.42µs    18.9 MB/sec
formatter/pydantic/types.py                1.00      3.2±0.16ms     8.0 MB/sec     1.08      3.4±0.20ms     7.4 MB/sec
linter/all-rules/large/dataset.py          1.00     15.9±0.49ms     2.6 MB/sec     1.03     16.4±0.50ms     2.5 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      4.2±0.17ms     4.0 MB/sec     1.00      4.2±0.16ms     4.0 MB/sec
linter/all-rules/numpy/globals.py          1.00   506.5±32.20µs     5.8 MB/sec     1.00   505.1±30.05µs     5.8 MB/sec
linter/all-rules/pydantic/types.py         1.00      7.1±0.34ms     3.6 MB/sec     1.02      7.2±0.32ms     3.5 MB/sec
linter/default-rules/large/dataset.py      1.00      8.3±0.21ms     4.9 MB/sec     1.01      8.3±0.26ms     4.9 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00  1767.2±81.47µs     9.4 MB/sec     1.00  1762.8±77.52µs     9.4 MB/sec
linter/default-rules/numpy/globals.py      1.00   198.9±11.73µs    14.8 MB/sec     1.00   198.2±12.87µs    14.9 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.7±0.15ms     6.8 MB/sec     1.01      3.7±0.19ms     6.8 MB/sec

@konstin
Copy link
Member

konstin commented Jun 15, 2023

Re 1: My intution goes towards using an enum and #[serde(flatten)], but both should do. Maybe you can put only the cell type and execution_count on each variant and tell serde to decide by cell type from a container attribute.

Re 2: That's a tricky problem! Maybe https://stackoverflow.com/a/67792465/3549270 would work?

@konstin
Copy link
Member

konstin commented Jun 15, 2023

I tried the following:

>    "execution_count": null,
14a16
>    "execution_count": null,
112c114,115
<     "\n"
---
>     "\n",
>     ""
395d397
<     "  model_name = f\"rank_{rank_num}\"\n",
476c478,479
<     "\"\"\"))\n"
---
>     "\"\"\"))\n",
>     ""
504a508
>    "execution_count": null,
610,617d613
<   "accelerator": "GPU",
<   "colab": {
<    "gpuType": "T4",
<    "include_colab_link": true,
<    "machine_shape": "hm",
<    "name": "AlphaFold2.ipynb",
<    "provenance": []
<   },
631d626
<    "nbconvert_exporter": "python",
632a628
>    "nbconvert_exporter": "python",
633a630,637
>   },
>   "accelerator": "GPU",
>   "colab": {
>    "gpuType": "T4",
>    "include_colab_link": true,
>    "machine_shape": "hm",
>    "name": "AlphaFold2.ipynb",
>    "provenance": []

This means the edition works correctly (nice!), we add an extra "" at the end of list where jupyter doesn't, the execution_count field you mentioned and metadata fields get reordered.

@dhruvmanila
Copy link
Member Author

Re 2: That's a tricky problem! Maybe stackoverflow.com/a/67792465/3549270 would work?

I tried this, it didn't work. It just keeps the same order as defined in the struct which is the default behavior.

we add an extra "" at the end of list where jupyter doesn't,

Nice! Let me take a look at this.

@dhruvmanila
Copy link
Member Author

@konstin Do you mean something like this? (I'm not sure how to use #[serde(flatten)] here)

#[derive(Debug, Serialize, Deserialize, Clone, PartialEq)]
#[serde(tag = "cell_type")]
pub enum Cell {
    #[serde(rename = "code")]
    Code(CodeCell),
    #[serde(rename = "markdown")]
    Markdown(MarkdownCell),
    #[serde(rename = "raw")]
    Raw(RawCell),
}

#[skip_serializing_none]
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq)]
pub struct RawCell {
    pub attachments: Option<Value>,
    pub id: Option<String>,
    pub metadata: Value,
    pub source: SourceValue,
}

/// Notebook markdown cell.
#[skip_serializing_none]
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq)]
pub struct MarkdownCell {
    pub attachments: Option<Value>,
    pub id: Option<String>,
    pub metadata: Value,
    pub source: SourceValue,
}

/// Notebook code cell.
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq)]
pub struct CodeCell {
    pub execution_count: Option<i64>,
    pub id: Option<String>,
    pub metadata: Value,
    pub outputs: Vec<Value>,
    pub source: SourceValue,
}

@charliermarsh charliermarsh requested a review from konstin June 15, 2023 18:48
@konstin
Copy link
Member

konstin commented Jun 15, 2023

Like this (untested, but i've used flatten in other projects)

use serde::{Serialize, Deserialize};

#[derive(Debug, Serialize, Deserialize, Clone, PartialEq)]
#[serde(tag = "cell_type")]
pub enum Cell {
    #[serde(rename = "code")]
    Code {
        #[serde(flatten)]
        pub cell: CellInner,
        pub execution_count: Option<i64>,
    },
    #[serde(rename = "markdown")]
    Markdown {
        #[serde(flatten)]
        pub cell: CellInner,
    },
    #[serde(rename = "raw")]
    Raw {
        #[serde(flatten)]
        pub cell: CellInner,
    },
}

#[skip_serializing_none]
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq)]
pub struct CellInner {
    pub attachments: Option<Value>,
    pub id: Option<String>,
    pub metadata: Value,
    pub source: SourceValue,
}

@konstin
Copy link
Member

konstin commented Jun 15, 2023

Could you add a notebook saved from the web interface of jupyter notebook and the same notebook after running it through ruff --fix, and then a test that checks that the string output of ruff --fix matches the saved fixed notebook? This way we can both see how the differences look in the repo and have a way to ensure that we don't accidentally change the json output. I'd compare the raw text and not the parsed json because that is what people will see in their git diffs.

@dhruvmanila
Copy link
Member Author

dhruvmanila commented Jun 17, 2023

I would argue to go with my implementation because:

  • Using flatten will give us the same problem as mentioned in (2). The order of the fields will probably not be alphabetical.

     {
       "cell_type": "code",  // coming from the tag
       // inner
       "id": "5e3ef98e-224c-450a-80e6-be442ad50907",
       "metadata": {
         "tags": []
       },
       "outputs": [],
       "source": [
         "import contextlib\n",
         "import random"
       ],
       // cell type specific fields
       "execution_count": 1
     }
  • There are other keys which has the same problem as execution_count

    • outputs is required only in code cells and not others
    • attachments is required only in raw and markdown cells and not the code cell

@dhruvmanila
Copy link
Member Author

dhruvmanila commented Jun 17, 2023

Oh, I found the reason why the sorting wasn't working earlier. We've declared serde_json with preserve_order feature which will preserve the insertion order and that's why the sorting was not done alphabetically.

@konstin It seems like the feature was added during the initial PR to support Jupyter notebook (https://github.com/astral-sh/ruff/pull/3440/files#diff-2e9d962a08321605940b5a657135052fbcef87b5e360662bb527c96d9a615542). Any reason why this was added and can we remove it?

@konstin
Copy link
Member

konstin commented Jun 18, 2023

Using flatten will give us the same problem as mentioned in (2). The order of the fields will probably not be alphabetical.

That's a pity, yours is clearly the better solution then

It seems like the feature was added during the initial PR to support Jupyter notebook (https://github.com/astral-sh/ruff/pull/3440/files#diff-2e9d962a08321605940b5a657135052fbcef87b5e360662bb527c96d9a615542). Any reason why this was added and can we remove it?

none beyond "feels useful", feel free to remove it

@dhruvmanila
Copy link
Member Author

Update

  • Update Cell schema to use tags to have independent structure for each cell type. This is to accommodate fields specific to the cell type.
  • Sort the notebook fields in alphabetical order manually to allow any field order in the struct definition.
  • Add a test case to assert JSON consistency (raw JSON, not parsed)
  • Update output snapshots as the preserve_order feature got removed from serde_json dependency. The fields just got reordered.

That trailing newline...

Well, the JSON string might contain a trailing newline which is handled by black: https://github.com/psf/black/blob/01b8d3d4095ebdb91d0d39012a517931625c63cb/src/black/__init__.py#L1024. Currently, we don't write the JSON string with a trailing newline. This needs to be handled.

@dhruvmanila dhruvmanila requested review from konstin and removed request for konstin June 19, 2023 08:11
@dhruvmanila
Copy link
Member Author

@konstin I believe the integration test for Jupyter notebook here can be removed?

#[cfg(test)]
#[cfg(feature = "jupyter_notebook")]
mod test {
use std::path::PathBuf;
use std::str::FromStr;
use anyhow::Result;
use path_absolutize::Absolutize;
use ruff::logging::LogLevel;
use ruff::resolver::{PyprojectConfig, PyprojectDiscoveryStrategy};
use ruff::settings::configuration::{Configuration, RuleSelection};
use ruff::settings::flags::FixMode;
use ruff::settings::flags::{Cache, Noqa};
use ruff::settings::types::SerializationFormat;
use ruff::settings::AllSettings;
use ruff::RuleSelector;
use crate::args::Overrides;
use crate::printer::{Flags, Printer};
use super::run;
#[test]
fn test_jupyter_notebook_integration() -> Result<()> {
let overrides: Overrides = Overrides {
select: Some(vec![
RuleSelector::from_str("B")?,
RuleSelector::from_str("F")?,
]),
..Default::default()
};
let mut configuration = Configuration::default();
configuration.rule_selections.push(RuleSelection {
select: Some(vec![
RuleSelector::from_str("B")?,
RuleSelector::from_str("F")?,
]),
..Default::default()
});
let root_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("..")
.join("ruff")
.join("resources")
.join("test")
.join("fixtures")
.join("jupyter");
let diagnostics = run(
&[root_path.join("valid.ipynb")],
&PyprojectConfig::new(
PyprojectDiscoveryStrategy::Fixed,
AllSettings::from_configuration(configuration, &root_path)?,
None,
),
&overrides,
Cache::Disabled,
Noqa::Enabled,
FixMode::Generate,
)?;
let printer = Printer::new(
SerializationFormat::Text,
LogLevel::Default,
FixMode::Generate,
Flags::SHOW_VIOLATIONS,
);
let mut writer: Vec<u8> = Vec::new();
// Mute the terminal color codes.
colored::control::set_override(false);
printer.write_once(&diagnostics, &mut writer)?;
// TODO(konstin): Set jupyter notebooks as none-fixable for now
// TODO(konstin): Make jupyter notebooks fixable
let expected = format!(
"{valid_ipynb}:cell 1:2:5: F841 [*] Local variable `x` is assigned to but never used
{valid_ipynb}:cell 3:1:24: B006 Do not use mutable data structures for argument defaults
Found 2 errors.
[*] 1 potentially fixable with the --fix option.
",
valid_ipynb = root_path.join("valid.ipynb").absolutize()?.display()
);
assert_eq!(expected, String::from_utf8(writer)?);
Ok(())
}
}

@konstin
Copy link
Member

konstin commented Jun 19, 2023

I believe the integration test for Jupyter notebook here can be removed?

yep feel free to remove it

Copy link
Member

@konstin konstin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test looks good!

crates/ruff/src/jupyter/schema.rs Show resolved Hide resolved
@dhruvmanila dhruvmanila merged commit 48f4f2d into main Jun 19, 2023
@dhruvmanila dhruvmanila deleted the dhruv/jupyter-json-consistency branch June 19, 2023 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants