Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for --wrap option #129

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mhucka
Copy link

@mhucka mhucka commented Feb 5, 2019

The previous behavior of bagit.py was asymmetrical with respect to how
it treated long tag lines: it would handle folded lines when reading
the bag-info.txt file, but it would not fold long lines when writing
long tag values into bag-info.txt. This was hardwired into the
function _make_tag_file(), which explicitly stripped line endings from
tag values before writing them.

Section 2.2.2 of the BagIt spec (https://tools.ietf.org/html/rfc8493)
states the following:

It is RECOMMENDED that lines not exceed 79 characters in length.
Long values MAY be continued onto the next line by inserting a LF,
CR, or CRLF, and then indenting the next line with one or more linear
white space characters (spaces or tabs). Except for linebreaks, such
padding does not form part of the value.

This PR adds a new command-line option, --wrap, and a new
parameter named line_width to the functions make_bag() and
save(), to make it possible to follow the recommendation. The
default value is 0, which means don't wrap (which is the original
behavior). An integer value greater than 0 causes line-wrapping to be
performed on a best-effort basis to limit line lengths to the given
value.

This addresses issue #126.

The previous behavior of `bagit.py` was asymmetrical with respect to how
it treated long tag lines: it would handle folded lines when reading
the `bag-info.txt` file, but it would not fold long lines when writing
long tag values into `bag-info.txt`.  This was hardwired into the
function `_make_tag_file()`, which explicitly stripped line endings from
tag values before writing them.

Section 2.2.2 of the BagIt spec (https://tools.ietf.org/html/rfc8493)
states the following:

   It is RECOMMENDED that lines not exceed 79 characters in length.
   Long values MAY be continued onto the next line by inserting a LF,
   CR, or CRLF, and then indenting the next line with one or more linear
   white space characters (spaces or tabs).  Except for linebreaks, such
   padding does not form part of the value.

This commit adds a new command-line option, `--wrap`, and a new
parameter named `line_width` to the functions `make_bag()` and
`save()`, to make it possible to follow the recommendation. The
default value is 0, which means don't wrap (which is the original
behavior).  An integer value greater than 0 causes line-wrapping to be
performed on a best-effort basis to limit line lengths to the given
value.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 83.303% when pulling 80686e8 on caltechlibrary:wrap-tag-lines into 8a8263e on LibraryOfCongress:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 83.303% when pulling 80686e8 on caltechlibrary:wrap-tag-lines into 8a8263e on LibraryOfCongress:master.

@coveralls
Copy link

coveralls commented Feb 5, 2019

Coverage Status

Coverage increased (+0.3%) to 83.848% when pulling ad9a27d on caltechlibrary:wrap-tag-lines into 8a8263e on LibraryOfCongress:master.

If the given line length is shorter than the length of a tag/property
name, then we should start the value on the next line in order to try
to keep to the desired length.  This commit adds a test for that
situation and makes line wrapping behave a little bit better.
@acdha acdha self-requested a review February 6, 2019 20:22
bagit.py Outdated
@@ -1499,6 +1527,17 @@ def _make_parser():
" without performing checksum validation to detect corruption."
),
)
parser.add_argument(
"--wrap",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea but I was wondering whether “wrap” might be a little too generic a name for the option. What do you think about something like --wrap-tags or --max-tag-line-length?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I struggled with naming the option! The current result was the outcome of trying to satisfy two goals: (1) emulating the existing bagit.py style of using mostly single words (for those options other than the metadata fields), and (2) accounting for the fact that the argument parser displays the associated variable name as part of the help text. The latter means that one ends up having to think about the argument name and the variable jointly. The current combination leads to the following output for bagit.py -h:

--wrap LINE_LENGTH    Limit line lengths in the tag file (bag-info.txt) by
                      wrapping long values and indenting subsequent lines
                      with a space character. (Default: don't.)

I had hoped this was clear enough, but I have to agree that by itself, the current combination leaves ambiguous what is being wrapped. I'm fine with changing it. I lean towards --wrap-tags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on --wrap-tags. I'd prefer clarity to single words and with that we don't to touch either dest or add a metavar option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just mentioning it in case it wasn't obvious: all these discussed changes are in the update to the PR made a few days ago.)

bagit.py Outdated
@@ -137,7 +138,8 @@ def find_locale_dir():


def make_bag(
bag_dir, bag_info=None, processes=1, checksums=None, checksum=None, encoding="utf-8"
bag_dir, bag_info=None, processes=1, checksums=None, checksum=None,
encoding="utf-8", line_length=0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether line_length should be something like max_tag_line_length so it's a little more obvious what it applies to

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree line_length is a bit too general. In thinking what would make sense with an option named --wrap-tags (c.f. conversation above), I worry max_tag_line_length might be too much:

  --wrap-tags MAX_TAG_LINE_LENGTH
                        Limit line lengths in the tag file (bag-info.txt) by
                        wrapping long values and indenting subsequent lines
                        with a space character. (Default: don't.)

What about tag_wrap_column? That leads to:

  --wrap-tags TAG_WRAP_COLUMN
                        Limit line lengths in the tag file (bag-info.txt) by
                        wrapping long values and indenting subsequent lines
                        with a space character. (Default: don't.)

bagit.py Outdated
first_break = prop_width + len(txt.split(None, 1)[0])
if line_length <= first_break:
# Start value on the next line.
txt = '\n'.join(textwrap.wrap(txt, width=line_length,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two calls are almost identical but it takes time to confirm that. Do you think it would be worth assigning initial_indent first so we can have an identical textwrap call? I'm also wondering whether this block of code is now long enough that we should break it into a separate function so this would just be for txt in values: txt = wrap_tag(h, txt, line_length)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the duplication: although initial_indent could be pulled out so that there's only one call to wrap, the problem is that there's a 2nd step after the call to wrap that depends on the same condition (see line 1257). So, the resulting alternative code would have a conditional for setting initial_indent, followed by the call to wrap, followed by an identical conditional to adjust the output of wrap. I find this alternative less readable than the current code (I tried it to see), although I realize this may be a matter of personal preference. Still, what about simply moving the parameter initial_indent higher up in the call to wrap, so that it's easier to spot the difference quickly?

Regarding splitting it out to a separate function: agreed. However, I might recommend naming the function differently, if you don't object, because in this case it wouldn't necessarily wrap the tag value. So perhaps something like _format_tag_value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and implemented the changes discussed above in the latest commit.

This attempts to address issues raised during the review of [pull
request
129](LibraryOfCongress#129).
Changes involve renaming the `--wrap` argument and associated internal
variable, and factoring out the line wrapping code into a separate function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants