Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension handling and refactor #105

Merged
merged 33 commits into from
Apr 4, 2018
Merged

Conversation

nawagers
Copy link
Contributor

Here is a rather big pull request. It will fix #73 #81 #98 in addition to a big step towards pr #102.
@tkrajina and @gluap, have a look

Changes in dev:

  • Change Parsers: Minidom is slower and less memory efficient than lxml

and (c)ElementTree from the Standard Library. The only good reason to keep

minidom is to have StdLib solution, but StdLib's ElementTree package API is

nearly the same as lxml. The default parser is now lxml and falls back to

cElementTree and ElementTree.

  • Parser interface removed: Many functions had a parser option. This

option no longer makes sense and was removed. This may break old code.

  • Remove XML parse wrappers: Many functions for node traversal were

wrapped to switch between minidom and eTree. The functions have all be

written back to their native eTree calls and in some cases, the logic was

cleaned up to be more Pythonic. Functionality from XMLParser, LXMLParser and

GPXParser have all been condensed down to GPXParser.

  • Improve parsing speed: After removing the abstraction layers needed to

support minidom, parsing speed improved a fair amount for lxml. More

interesting is that cElementTree runs through all the unittests 2-3x faster

than lxml. lxml is generally faster than the StdLib for serialization, but

gpxpy has it's own serialization routines since it's not in an ElementTree

structure. Moving forward, consider dropping lxml completely. Someone should

do so benchmarking on vastly different platforms (webservers, desktops).

  • Extensions are list of Elements: Extensions were previously a dict

with tags as keys. This made the assumption that there was only one

extension, and that it was only 1 level deep, and no interesting information

was in xml attributes. The schema supports nearly arbitrary xml and multiple

extensions. The easiest way to represent this is a list of Element (and

their subElements). It now parses and serializes in this way.

  • Namespace handling: Previously, gpxpy was namespace agnostic, and

read/wrote invalid GPX 1.1 when handling extensions. Extensions must live in

their own namespace. GPXParser now uses Regular Expressions to read in all

the namespace declarations and save them to the nsmap dictionary. In nsmap,

'defaultns' is the GPX namespace. This is read in on parsing, but

overwritten on serialization. During serialization, the xsi namespace is

also added. On serialization of extensions, the Clark notation is converted

back to a prefix automatically. Anyone building an extension instead of

parsing one just needs to add the prefix/URI to the dictionary.

  • Drop support for 2.6: Unfortunately, the namespace handling of the

StdLib ElementTree is poor and was even worse in 2.6, making it

incompatible. gpxpy on Python 2.6 still works with lxml, but the CI testing

was dropped. No one should be writing new code on 2.6 anyway.

  • CI Testing: The package does not offer the user the option of choosing the

parser, so the tests need 2 environments: with and without lxml. The new

environment variable XMLPARSER is set and installs lxml appropriately. For

python 3.2, lxml is pinned to version 3.6, the last supported version. A

function was added to GPXParser to return which implementation is used. A

unittest was added to check that lxml loads properly when running in the

LXML environment. Unittests were being run twice, once inside and once

outside of the coverage container. The extra run was removed, also pep8 was

changed to pycodestyle (new name).

  • No empty tags: During serialization, empty "container" tags were made

even if there was no data in subelements. These fields were defined by the

*POINT_FIELDS and were just stored as the text of the tag, like 'copyright'.

The tags are now aware of the subelements. If none of the dependents are

there, all fields are suppressed until the close of that tag. For example,

'copyright' depends on copyright_author, copyright_year, and

copyright_license. The new string is

'copyright:copyright_author:copyright_year:copyright_license'. Specifically

required subelements should be prefaced by an '@', like 'link' requires a

subelement named 'link' so the new string is link:@link. Currently '@' is

stripped, but enforcement isn't implemented yet.

  • Hash functions: The __hash__ dunder should not be replaced for mutable

objects. The default behavior if __hash__ is written such that hash(A) ==

hash(B) only when A is B and not necessarily when A == B. Implementing

__hash__ on mutable objects will break the usage on a number of StdLib

collections. All __hash__ functions have been removed. Unittests for the

hash functions have also been removed.

  • gpx 1.1 with all fields: The gpx for unittests was updated to include

properly namespaced extensions (required by schema).

  • gpx with extensions: A new test file was added for unittests. The

extensions feature multiple extensions, depth, namespaced attributes, text

and tail text.

  • Unittest: The GPX 1.1 with all fields test was updated for the new

extension syntax. A new test was added to read a complex extension from xml

and verify it reads. Another test builds a minimal gpx from scratch, puts a

new extension in each field, serializes the gpx, parses it back, and

verifies each field. An additional test adds several extensions, then

serializes with version 1.0 to verify that no extensions are written.

  • String handling: The language preferred way to format strings is with

the .format() function instead of the % operator. This was changed in many

places. Long string concatenation should be done by appending to a list and

then using ''.join(). Strings are immutable, so a new string is created for

every += operation.

  • to_xml functions: The extensions need access to the nsmap, so all

to_xml functions have a nsmap argument that defaults to None. Also added was

prettyprint and indent. These default to True and '' respectively and

implement prettyprinting by increasing the indent for every nested element.

  • default output version: Extension data can only be stored in version

1.1. If the gpx is written in 1.0, all extension data will be lost. Since

it's better to preserve data, changing default version to 1.1. Version 1.1

probably more widely supported also.

nawagers and others added 30 commits December 10, 2017 13:55
removed get_children, get_node_data, get_node_attribute from abstraction layer and put code directly in gpxfield parser
Removed extra checking from get_first_child. Updated travis to only run tests once (just with coveralls)
Completely remove XMLParser and consolidate some code in GPXParser
Std Lib doesn't read in comments and doesn't support the remove_comments parser option. Std Lib also doesn't support QName methods the same as lxml.
Fixed a circular import on older versions of python, expected to fail unittests with 2.x and LXML
This function seems to do nothing of consequence. Also adding htmlcov (the coverage html folder) to .gitignore, and a few pycodestyle fixes in parser.py.
Added a new __library function to provide convenient access to whether LXML is loaded or not. The LXML status should only be needed when running the test suite.
Best guess for Python 4+ is that strings will remain unicode
removed deprecated getchildren() call by using node.find().
Numerous changes here:
changed extensions to list of ETrees
added serialization of extensions
changed namespace handling to properly prefix and search
removed __hash__ functions on mutable objects (makes no sense and broken with extensions)
removed tests using hash()
OK now... time for a successful build :(
ElementTree in 2.6 does not support register_namespaces, which is a key part of namespace handing for prefixes. No one should be writing new code on 2.6 anyway...
This reverts commit f607dd7.
Added attribute serialization for ETree nodes
Updated several serialization routines and changed handling for empty containter tags like <link></link>.  Laid out some of the framework for prettyprint
Finished adding pretty print handling for serialization.
Added unittests for each extension spot, initialized all extensions as lists instead of None. Code formatting cleanups throughout.
changed to set individual attribs as key, value pairs
Always run unittests one last time before committing...
Changed default output version to 1.1, added a unittest to verify no extensions are written in 1.0, found a small bug in the Email handling.
Accidentally left work in progress
@coveralls
Copy link

coveralls commented Dec 14, 2017

Coverage Status

Coverage decreased (-1.6%) to 83.167% when pulling 6c0dfa2 on nawagers:ExtensionETree into e6c8022 on tkrajina:dev.

@coveralls
Copy link

coveralls commented Dec 14, 2017

Coverage Status

Coverage increased (+0.2%) to 85.0% when pulling 93ee2d8 on nawagers:ExtensionETree into e6c8022 on tkrajina:dev.

"""
Parse the XML and return a GPX object.

Args:
version: str or None indicating the GPX Schema to use.
Options are '1.0', '1.1' and None. When version is None
the version is read from the file or falls back on 1.0.

the version is read from the file or falls back on 1.0.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the change it now falls back on 1.1.

@ghost
Copy link

ghost commented Dec 28, 2017

I would like to support this pull request and I suggest pulling it in before #102. The refactorings here allow for cleaner re-write of #102:

  1. With nsmap handed down in this PR, extensions can be serialized with abbreviated namespaces instead of specifying the full namespace of the elements for each extension element (as is currently the case in Add support for Garmin and Cluetrust Trackpoint extension #102)
  2. The now-stored tree below the extension element now allows for very convenient access to the data -- this was harder before the removal of minidom. Serialization can also be handled conveniently by the ElementTree/lxml modules instead of hand-writing strings.

Furthermore I think that the removal of abstraction layers that became obsolete when removing minidom will make it easier for newcomers to contribute.

Well done @nawagers!

@nawagers
Copy link
Contributor Author

@tkrajina Any estimate when you'll have a chance to look at this? Would it help if it was broken into many smaller PRs? Do you want another person to help maintain the package by reviewing and approving PRs?

@javisantana
Copy link

I've been working with this branch and I didn't find any problem (a little bit faster with lxml but not impressive)

@nawagers
Copy link
Contributor Author

Glad to hear it's working for you.

@tkrajina
Copy link
Owner

tkrajina commented Jan 26, 2018

@nawagers assuming the email in your commits is correct, I sent you an email now.

@micooke
Copy link

micooke commented Mar 28, 2018

Working for me. Just a note that the test to confirm the extensions children contents is not fleshed out.

@tkrajina tkrajina merged commit 93ee2d8 into tkrajina:dev Apr 4, 2018
@tkrajina
Copy link
Owner

tkrajina commented Apr 4, 2018

Merged in dev, will be included in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants