Extension handling and refactor #105

nawagers · 2017-12-14T18:57:44Z

Here is a rather big pull request. It will fix #73 #81 #98 in addition to a big step towards pr #102.
@tkrajina and @gluap, have a look

Changes in dev:

Change Parsers: Minidom is slower and less memory efficient than lxml

and (c)ElementTree from the Standard Library. The only good reason to keep

minidom is to have StdLib solution, but StdLib's ElementTree package API is

nearly the same as lxml. The default parser is now lxml and falls back to

cElementTree and ElementTree.

Parser interface removed: Many functions had a parser option. This

option no longer makes sense and was removed. This may break old code.

Remove XML parse wrappers: Many functions for node traversal were

wrapped to switch between minidom and eTree. The functions have all be

written back to their native eTree calls and in some cases, the logic was

cleaned up to be more Pythonic. Functionality from XMLParser, LXMLParser and

GPXParser have all been condensed down to GPXParser.

Improve parsing speed: After removing the abstraction layers needed to

support minidom, parsing speed improved a fair amount for lxml. More

interesting is that cElementTree runs through all the unittests 2-3x faster

than lxml. lxml is generally faster than the StdLib for serialization, but

gpxpy has it's own serialization routines since it's not in an ElementTree

structure. Moving forward, consider dropping lxml completely. Someone should

do so benchmarking on vastly different platforms (webservers, desktops).

Extensions are list of Elements: Extensions were previously a dict

with tags as keys. This made the assumption that there was only one

extension, and that it was only 1 level deep, and no interesting information

was in xml attributes. The schema supports nearly arbitrary xml and multiple

extensions. The easiest way to represent this is a list of Element (and

their subElements). It now parses and serializes in this way.

Namespace handling: Previously, gpxpy was namespace agnostic, and

read/wrote invalid GPX 1.1 when handling extensions. Extensions must live in

their own namespace. GPXParser now uses Regular Expressions to read in all

the namespace declarations and save them to the nsmap dictionary. In nsmap,

'defaultns' is the GPX namespace. This is read in on parsing, but

overwritten on serialization. During serialization, the xsi namespace is

also added. On serialization of extensions, the Clark notation is converted

back to a prefix automatically. Anyone building an extension instead of

parsing one just needs to add the prefix/URI to the dictionary.

Drop support for 2.6: Unfortunately, the namespace handling of the

StdLib ElementTree is poor and was even worse in 2.6, making it

incompatible. gpxpy on Python 2.6 still works with lxml, but the CI testing

was dropped. No one should be writing new code on 2.6 anyway.

CI Testing: The package does not offer the user the option of choosing the

parser, so the tests need 2 environments: with and without lxml. The new

environment variable XMLPARSER is set and installs lxml appropriately. For

python 3.2, lxml is pinned to version 3.6, the last supported version. A

function was added to GPXParser to return which implementation is used. A

unittest was added to check that lxml loads properly when running in the

LXML environment. Unittests were being run twice, once inside and once

outside of the coverage container. The extra run was removed, also pep8 was

changed to pycodestyle (new name).

No empty tags: During serialization, empty "container" tags were made

even if there was no data in subelements. These fields were defined by the

*POINT_FIELDS and were just stored as the text of the tag, like 'copyright'.

The tags are now aware of the subelements. If none of the dependents are

there, all fields are suppressed until the close of that tag. For example,

'copyright' depends on copyright_author, copyright_year, and

copyright_license. The new string is

'copyright:copyright_author:copyright_year:copyright_license'. Specifically

required subelements should be prefaced by an '@', like 'link' requires a

subelement named 'link' so the new string is link:@link. Currently '@' is

stripped, but enforcement isn't implemented yet.

Hash functions: The __hash__ dunder should not be replaced for mutable

objects. The default behavior if __hash__ is written such that hash(A) ==

hash(B) only when A is B and not necessarily when A == B. Implementing

__hash__ on mutable objects will break the usage on a number of StdLib

collections. All __hash__ functions have been removed. Unittests for the

hash functions have also been removed.

gpx 1.1 with all fields: The gpx for unittests was updated to include

properly namespaced extensions (required by schema).

gpx with extensions: A new test file was added for unittests. The

extensions feature multiple extensions, depth, namespaced attributes, text

and tail text.

Unittest: The GPX 1.1 with all fields test was updated for the new

extension syntax. A new test was added to read a complex extension from xml

and verify it reads. Another test builds a minimal gpx from scratch, puts a

new extension in each field, serializes the gpx, parses it back, and

verifies each field. An additional test adds several extensions, then

serializes with version 1.0 to verify that no extensions are written.

String handling: The language preferred way to format strings is with

the .format() function instead of the % operator. This was changed in many

places. Long string concatenation should be done by appending to a list and

then using ''.join(). Strings are immutable, so a new string is created for

every += operation.

to_xml functions: The extensions need access to the nsmap, so all

to_xml functions have a nsmap argument that defaults to None. Also added was

prettyprint and indent. These default to True and '' respectively and

implement prettyprinting by increasing the indent for every nested element.

default output version: Extension data can only be stored in version

1.1. If the gpx is written in 1.0, all extension data will be lost. Since

it's better to preserve data, changing default version to 1.1. Version 1.1

probably more widely supported also.

removed get_children, get_node_data, get_node_attribute from abstraction layer and put code directly in gpxfield parser

Removed extra checking from get_first_child. Updated travis to only run tests once (just with coveralls)

Completely remove XMLParser and consolidate some code in GPXParser

Std Lib doesn't read in comments and doesn't support the remove_comments parser option. Std Lib also doesn't support QName methods the same as lxml.

Fixed a circular import on older versions of python, expected to fail unittests with 2.x and LXML

This function seems to do nothing of consequence. Also adding htmlcov (the coverage html folder) to .gitignore, and a few pycodestyle fixes in parser.py.

Added a new __library function to provide convenient access to whether LXML is loaded or not. The LXML status should only be needed when running the test suite.

Best guess for Python 4+ is that strings will remain unicode

removed deprecated getchildren() call by using node.find().

Numerous changes here: changed extensions to list of ETrees added serialization of extensions changed namespace handling to properly prefix and search removed __hash__ functions on mutable objects (makes no sense and broken with extensions) removed tests using hash()

OK now... time for a successful build :(

ElementTree in 2.6 does not support register_namespaces, which is a key part of namespace handing for prefixes. No one should be writing new code on 2.6 anyway...

This reverts commit f607dd7.

Added attribute serialization for ETree nodes

Updated several serialization routines and changed handling for empty containter tags like <link></link>. Laid out some of the framework for prettyprint

Finished adding pretty print handling for serialization.

Added unittests for each extension spot, initialized all extensions as lists instead of None. Code formatting cleanups throughout.

changed to set individual attribs as key, value pairs

Always run unittests one last time before committing...

Changed default output version to 1.1, added a unittest to verify no extensions are written in 1.0, found a small bug in the Email handling.

Accidentally left work in progress

coveralls · 2017-12-14T19:03:34Z

Coverage decreased (-1.6%) to 83.167% when pulling 6c0dfa2 on nawagers:ExtensionETree into e6c8022 on tkrajina:dev.

coveralls · 2017-12-14T19:06:46Z

Coverage increased (+0.2%) to 85.0% when pulling 93ee2d8 on nawagers:ExtensionETree into e6c8022 on tkrajina:dev.

ghost · 2017-12-26T20:49:04Z

gpxpy/parser.py

        """
        Parse the XML and return a GPX object.

        Args:
            version: str or None indicating the GPX Schema to use.
                Options are '1.0', '1.1' and None. When version is None
-                the version is read from the file or falls back on 1.0. 
-
+                the version is read from the file or falls back on 1.0.


After the change it now falls back on 1.1.

ghost · 2017-12-28T20:59:40Z

I would like to support this pull request and I suggest pulling it in before #102. The refactorings here allow for cleaner re-write of #102:

With nsmap handed down in this PR, extensions can be serialized with abbreviated namespaces instead of specifying the full namespace of the elements for each extension element (as is currently the case in Add support for Garmin and Cluetrust Trackpoint extension #102)
The now-stored tree below the extension element now allows for very convenient access to the data -- this was harder before the removal of minidom. Serialization can also be handled conveniently by the ElementTree/lxml modules instead of hand-writing strings.

Furthermore I think that the removal of abstraction layers that became obsolete when removing minidom will make it easier for newcomers to contribute.

Well done @nawagers!

nawagers · 2018-01-19T14:09:34Z

@tkrajina Any estimate when you'll have a chance to look at this? Would it help if it was broken into many smaller PRs? Do you want another person to help maintain the package by reviewing and approving PRs?

javisantana · 2018-01-25T19:58:34Z

I've been working with this branch and I didn't find any problem (a little bit faster with lxml but not impressive)

nawagers · 2018-01-26T15:33:51Z

Glad to hear it's working for you.

tkrajina · 2018-01-26T16:51:03Z

@nawagers assuming the email in your commits is correct, I sent you an email now.

micooke · 2018-03-28T07:47:32Z

Working for me. Just a note that the test to confirm the extensions children contents is not fleshed out.

tkrajina · 2018-04-04T07:22:31Z

Merged in dev, will be included in the next release.

nawagers and others added 30 commits December 10, 2017 13:55

Refactoring XMLParser

74a3fb9

removed get_children, get_node_data, get_node_attribute from abstraction layer and put code directly in gpxfield parser

Simplify get_first_child, unittest only once

e4675f4

Removed extra checking from get_first_child. Updated travis to only run tests once (just with coveralls)

Remove XMLParser

d8d894c

Completely remove XMLParser and consolidate some code in GPXParser

Regression with cElementTree support

228a1c3

Std Lib doesn't read in comments and doesn't support the remove_comments parser option. Std Lib also doesn't support QName methods the same as lxml.

Regression: circular import, removed unused parser args

a8f9c4b

Fixed a circular import on older versions of python, expected to fail unittests with 2.x and LXML

Removing gpx_check_slots_and_default_values

d4e8e27

This function seems to do nothing of consequence. Also adding htmlcov (the coverage html folder) to .gitignore, and a few pycodestyle fixes in parser.py.

Added GPXParser.__library() for unittests

79dec0d

Added a new __library function to provide convenient access to whether LXML is loaded or not. The LXML status should only be needed when running the test suite.

Switch to super()

d10f571

Roll back super(), forgot about Python2

107a244

Future proof version check

2761769

Best guess for Python 4+ is that strings will remain unicode

switched first_child to find()

fdcdeff

removed deprecated getchildren() call by using node.find().

Switch to .format() and .join() for strings

909c4de

Use explicit positions in format for 2.6 compat

654f1d3

I hate namespaces [skip ci]

c5243d7

namespace prefix map cleanup [skip ci]

437d351

Regression: no maxsplit in older versions of Python

fc56a2b

Regression: Python 2.6 dict syntax

93babf7

Regression 2.6 being whiny about some syntax

174448a

Regression: 2.6 dict syntax

902d712

Regression: Typo

fd1bf9d

OK now... time for a successful build :(

Drop 2.6 & STDLIB from travis

48f2590

ElementTree in 2.6 does not support register_namespaces, which is a key part of namespace handing for prefixes. No one should be writing new code on 2.6 anyway...

.travis format

f607dd7

Revert ".travis format"

f472c14

This reverts commit f607dd7.

Drop 2.6

688e889

Attrib serialization, switching computers [skip ci]

c0aa992

Added attribute serialization for ETree nodes

Serialization changes

055ccc2

Updated several serialization routines and changed handling for empty containter tags like <link></link>. Laid out some of the framework for prettyprint

Pretty Print

07bbbe0

Finished adding pretty print handling for serialization.

Add tests, switch extensions to list

60eaca9

Added unittests for each extension spot, initialized all extensions as lists instead of None. Code formatting cleanups throughout.

lxml doesn't support saving attributes as dict

45defc2

changed to set individual attribs as key, value pairs

nawagers added 3 commits December 14, 2017 13:07

Typo...

1776646

Always run unittests one last time before committing...

Default version, 1.0 extension test, small bug

6c0dfa2

Changed default output version to 1.1, added a unittest to verify no extensions are written in 1.0, found a small bug in the Email handling.

Remove extensions.py

93ee2d8

Accidentally left work in progress

ghost reviewed Dec 26, 2017

View reviewed changes

tkrajina merged commit 93ee2d8 into tkrajina:dev Apr 4, 2018

tkrajina mentioned this pull request Apr 4, 2018

Add support for Garmin and Cluetrust Trackpoint extension #102

Closed

hugovk mentioned this pull request Aug 13, 2018

Drop support for EOL Python #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension handling and refactor #105

Extension handling and refactor #105

nawagers commented Dec 14, 2017

coveralls commented Dec 14, 2017 •

edited

Loading

coveralls commented Dec 14, 2017 •

edited

Loading

ghost Dec 26, 2017

ghost commented Dec 28, 2017 •

edited by ghost

Loading

nawagers commented Jan 19, 2018

javisantana commented Jan 25, 2018

nawagers commented Jan 26, 2018

tkrajina commented Jan 26, 2018 •

edited

Loading

micooke commented Mar 28, 2018

tkrajina commented Apr 4, 2018

Extension handling and refactor #105

Extension handling and refactor #105

Conversation

nawagers commented Dec 14, 2017

Changes in dev:

coveralls commented Dec 14, 2017 • edited Loading

coveralls commented Dec 14, 2017 • edited Loading

ghost Dec 26, 2017

Choose a reason for hiding this comment

ghost commented Dec 28, 2017 • edited by ghost Loading

nawagers commented Jan 19, 2018

javisantana commented Jan 25, 2018

nawagers commented Jan 26, 2018

tkrajina commented Jan 26, 2018 • edited Loading

micooke commented Mar 28, 2018

tkrajina commented Apr 4, 2018

coveralls commented Dec 14, 2017 •

edited

Loading

coveralls commented Dec 14, 2017 •

edited

Loading

ghost commented Dec 28, 2017 •

edited by ghost

Loading

tkrajina commented Jan 26, 2018 •

edited

Loading