Merge pull request #11 from UUDigitalHumanitieslab/subj

Subj
UUDigitalHumanitieslab · Feb 15, 2023 · 48749e1 · 48749e1
2 parents 4f646e3 + b74a984
commit 48749e1
Show file tree

Hide file tree

Showing 40 changed files with 1,325 additions and 190 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -0,0 +1,41 @@
+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+
+name: Unit tests
+
+on:
+  workflow_dispatch:
+  push:
+    paths-ignore:
+      - '**.md'
+
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ['3.7', '3.10']
+
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v4
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install flake8
+        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+    - name: Prepare PIP package
+      run: |
+        cd pypi
+        ./prepare.sh
+    - name: Lint with flake8
+      run: |
+        flake8  $(cat pypi/include.txt | grep \.py\$) --count --max-complexity=12 --max-line-length=127 --statistics
+    # - name: Run unit tests
+    #   run: |
+    #     pip install pytest
+    #     python -m pytest
diff --git a/CHAT_Annotation.py b/CHAT_Annotation.py
@@ -9,6 +9,8 @@
 
 CHAT = 'CHAT'
 
+CHAT_explanation = 'Explanation'
+
 monadic = 1
 dyadic = 2
 
@@ -642,7 +644,7 @@ def result(x, y): return SDLOGGER.warning(msg.format(x, y))
                     CHAT_SimpleScopedRegex(r'\[!!\]', keep, False, monadic),
                     simplescopedmetafunction),
     # Duration to be added here @@
-    CHAT_Annotation('Explanation', '8.3:69', '10.3:73',
+    CHAT_Annotation(CHAT_explanation, '8.3:69', '10.3:73',
                     CHAT_ComplexRegex((r'\[=', anybutrb, r'\]'), (keep, eps), False),
                     complexmetafunction),
     CHAT_Annotation('Replacement', '8.3:69', '10.3:73',

diff --git a/Documentation/Tarsp.rst b/Documentation/Tarsp.rst
@@ -92,9 +92,9 @@ Explanation:
 
 *Wh*-questions are defined as follows::
 
-    Tarsp_whq = """((@cat="whq" and @rel="--") or (@cat="whsub"))"""
+    Tarsp_whq = """((@cat="whq" and @rel="--") or (@cat="whsub") or (@cat="whrel" and @rel="--"))"""
     
-covering both main clauses (*whq*) and subordinate clauses (*whsub*).
+covering both main clauses (*whq*) and subordinate clauses (*whsub*), but also clauses that have been analysed by Alpino as independent (relation="--" ) relative clauses, which usually are actually wh-questions.
 
 
 .. _imperatives:
@@ -241,10 +241,12 @@ Language measures such as *WBVC*, *OndVC*, *OndW*, *OndWB*, *OndWBVC*, and sever
     
   where::
     
-        subject = """(@rel="su")"""
+        subject = """(@rel="su" and parent::node[(@cat="smain" or @cat="sv1" or @cat="ssub")])"""
 
+Here we only count as subjects those nodes (full or index nodes) that have a finite clause node as parent. 
+Overt subjects never occur in nonfinite clauses, and subject index nodes should not be counted inside nonfinite clauses. Subject index nodes do occur in finite clauses, e.g., as the "trace" of a wh-movement. 
 
-It differs from the definition of :ref:`T063_Ond` because T063 has to exclude subjects of nonfinite nodes (which is covered in the composed language measures by the conditions on mood) and to cover an additional case (existential *er* as a subject (though maybe that should be included here as well) 
+It differs from the definition of :ref:`T063_Ond` because T063 has to cover an additional case (existential *er* as a subject (though maybe that should be included here as well) 
 
 * **Tarsp_B**: see :ref:`T007_B`
 * **Tarsp_VC**: is defined as follows::
@@ -435,16 +437,23 @@ T006: Avn
 * **Implementation**: Xpath
 * **Query** defined as::
 
-    //node[@pt="vnw"  and @vwtype="aanw" and @lemma!="hier" and @lemma!="daar" and @lemma!="er" and 
-    @rel!="det" and (not(@positie) or @positie!="prenom") ]
+    AVn = """(%coreavn% or %avnrel%)"""" 
+    coreavn = """(@pt="vnw"  and @vwtype="aanw" and @lemma!="hier" and @lemma!="daar" and @lemma!="er" and @rel!="det" and (not(@positie) or @positie!="prenom") )"""
+    avnrel = """(%diedatrel% and parent::node[@cat="rel" and @rel!="mod"])"""
+    diedatrel = """(@pt="vnw" and @vwtype="betr" and @rel="rhd" and (@lemma="die" or @lemma="dat"))"""
 
+The query for *AVn* consists of two subcases: the core case, and the case of *die* and * dat* incorrectly analysed as a relative pronoun in an independent relative clause.
 
-* The query with pt equal to *vnw* and vwtype equal to *aanw* selects demonstrative pronouns, but
+The core case (*coreavn*):
+
+* which has pt equal to *vnw* and vwtype equal to *aanw* selects demonstrative pronouns, but
 * these include R-pronouns, so they are explicitly excluded
 * the relation must not be *det* (otherwise the pronouns are not used independently)
 * and if a *position* attribute is present it should not have the value *prenom* (otherwise it is not used independently)
 
-* **Schlichting**: "Aanwijzend Voornaamwoord: 'die', 'dit', 'deze', 'dat' zelfstandig gebruikt." fully covered.
+The relative case (*avnrel*) covers the relative pronouns *die* and *dat* (*diedatrel*) in an independent relative clause (i.e. *rel* is not equal to *mod*).  
+
+* **Schlichting**: "Aanwijzend Voornaamwoord: 'die', 'dit', 'deze', 'dat' zelfstandig gebruikt." Fully covered.
 
 
 .. _T007_B:
@@ -2052,26 +2061,24 @@ T063: Ond
 
 Here, **FullOnd** is defined as::
 
-    """((%subject% and (@pt or @cat) ) or %erx%)"""
+    FullOnd = """(%subject% or %erx%)"""
 
 
-where **subject** and **erx** are defined as::
+where **subject** (see :ref:`zinsdelen`) and **erx** are defined as::
 
-    subject = """(@rel="su")"""
+    subject = """(@rel="su" and parent::node[(@cat="smain" or @cat="sv1" or @cat="ssub")])"""
     erx = """((@rel="mod" and @lemma="er" and ../node[@rel="su" and @begin>=../node[@rel="mod" and @lemma="er"]/@end]) or
               (@rel="mod" and @lemma="er" and ../node[@rel="su" and not(@pt) and not(@cat)])
              )
           """
           
-The condition on the presence of *pt* or *cat* is present to exclude (empty) subject of infinitives and participle clauses, e.g. in *Hij heeft gezwommen* *hij* is an antecedent of an (empty) node acting as the subject of the past participle *gezwommen*, and we do not want to include that.
 
 The **erx** macro is to ensure that so-called *expletive er* also counts a subject. We implemented this in the following manner: *er* is considered (also) expletive (and thus must count as a subject):
 
 * if it precedes the subject  (as in **er** *kwam iemand binnen*), or
 * if there is an empty subject, to cover cases such as *wie zwom* **er**. Note that in *wie heeft* **er** gezwommen* this *er* is considered a subject because of the empty subject of the participial clause, which is perhaps not what we want.
 
 
-**Remark**: The condition on the presence of *pt* or *cat* incorrectly excludes *wie* in *wie doet dat*, *wie heeft dat gedaan*: *wie* is a *whd* an an antecedent to  an index node with grammatical relation *su* (see e.g. VKLTarsp, sample 3, utterance 25 *weet ik niet* **wie** *daarin zit*. It has no consequences for the scores because *Ond* is not in the form and because in language measures such as OndWB etc a different definition of subject is used. It probably is better to replace the condition on *pt* and *cat* by a condition on the parent node, viz. that it must have as value for the *cat* attribute from one of the values from *smain*, *sv1*, or *ssub* (categories for finite clauses or finite clause bodies).
 
 
 * **Schlichting**: "Dit is de persoon of de zaak die de handeling van het werkwoord uitvoert. Wanneer het onderwerp van een zin in het meervoud staat, staat de persoonsvorm van die zin ook in het meervoud."
@@ -2402,7 +2409,7 @@ Straightforward implementation::
     Tarsp_OndW = """(%declarative% and 
                     %Ond% and  
                     (%Tarsp_W%  or node[%Tarsp_onlyWinVC%]) and 
-                    %realcomplormodnodecount% = 0 )"""
+                    %realcomplormodnodecount% = 1 )"""
 
 See section :ref:`composedmeasures` for details.
 
@@ -3481,28 +3488,16 @@ T106: Vo/bij
 * **Original**: yes
 * **In form**: yes
 * **Page**: 71
-* **Implementation**: Xpath with macros
+* **Implementation**: Python function
 * **Query** defined as::
 
-    //node[node[@pt="vz" and @rel="hd"] and 
-           node[@rel="obj1" and 
-                 ((@index and not(@pt or @cat)) or
-                  (@end < ../node[@rel="hd"]/@begin)
-                 )]]
-
-
-The query searches for a node 
-
-* containing a head adposition node, and
-* containing a node with grammatical relation *obj1*
-
-    * which is an "empty" node
-    * and which precedes the head adposition
+    voslashbij
+ 
+.. autofunction:: queryfunctions::voslashbij
 
 * **Schlichting**: "Voornaamwoordelijk bijwoord, gesplitst. Het gesplitste voornaamwoordelijk bijwoord behoeft niet gescoord te worden bij Vobij, kolom Voornaamwoorden in Fase IV, alleen hier bij de Woordgroepen in Fase V."
 
-* **Remark** Schlichting only gives examples with nonadjacent Rpronouns and adpositions. We agreed with Rob Zwitserlood that adjacent R-pronoun + adposition, even if written separately, is to be counted under *Vobij*. However, the notion "adjacent" is not so easy to define in  inflated tree structure. This still has to be done.
-* **Remark** The condition *@index and not(@pt or @cat)* is better replaced by a macro defined as *@index and not(@word or @cat)*.
+* **Remark** Schlichting only gives examples with nonadjacent Rpronouns and adpositions. We agreed with Rob Zwitserlood that adjacent R-pronoun + adposition, even if written separately, is to be counted under *Vobij*. However, the notion "adjacent" is not so easy to define in  inflated tree structures. This has been done in a python function *adjacent* in the module treebankfunctions, and for this reason this query must also be defined as a python function.
 
 
 
@@ -3518,24 +3513,20 @@ T107: Vobij
 * **Original**: yes
 * **In form**: yes
 * **Page**: 80
-* **Implementation**: Xpath with macros
+* **Implementation**: Python function
 * **Query** defined as::
 
-    //node[%Vobij%]
+    vobij
+	
+.. autofunction:: queryfunctions::vobij
 
-the macro **Vobij** is defined as follows::
-
-   Vobij = """(@pt="bw" and (contains(@frame,"er_adverb" ) or contains(@frame, "tmp_adverb") or @lemma="daarom") and 
-               @lemma!="er" and @lemma!="daar" and @lemma!="hier" and 
-               (starts-with(@lemma, 'er') or starts-with(@lemma, 'daar') or starts-with(@lemma, 'hier'))
-              )"""
 
 
 
 * **Schlichting**: "Voornaamwoordelijk bijwoord. Het voornaamwoordelijk bijwoord is een combinatie van 'er', 'daar', 'hier', 'waar' met een voorzetsel (bijv. 'aan', 'bij', 'voor') of een bijwoord ('af', 'heen', 'toe')"
 
+* **Remark** Separately wiritten but adjacen R-pronoun + adposition cases are also considered *Vobij*
 * **Remark** Alpino considers words such as 'af', 'heen', and 'toe' as postpositions, not as adverbs.
-* **Remark** *waar* is lacking. It does not occur in the VKLtarsp data, in the Auris data only *waarom* occurs.
 
 
 
@@ -3604,6 +3595,7 @@ T110: Vr
 
 This language measure actually does not exist. It has been included and implemented because the code *Vr* occurs in the Schlichting appendix  (example 20, p. 91, and example 22, p. 92). Example 20 should have been annotated as *Vr(XY)*, and example 22 as *Vr4*.
 
+.. _VrXY:
 
 T111: Vr(XY)
 """"""""""""
@@ -3624,13 +3616,14 @@ T111: Vr(XY)
 
 Straightforward implementation::
 
-    Tarsp_VrXY = """(%Tarsp_whq% and 
-                     node[@rel="whd"] and
-                     node[@cat="sv1" and 
-                          @rel="body"  and 
-                          %realcomplormodnodecount% = 1
-                         ])"""
+    Tarsp_VrXY = """(%Tarsp_whq% and
+        node[%Tarsp_whqhead%] and
+        node[%whqbody%   and %realcomplormodnodecount% <= 1])"""
 
+where *Tarsp_whqhead* and *whqbody* cover both wh-questions (main and subordinate) and independent relatives::
+
+    Tarsp_whqhead = """(@rel="whd" or @rel="rhd") """
+    whqbody = """((@cat="sv1" or @cat="ssub") and @rel="body")"""
 
 See section :ref:`composedmeasures` for details.
 
@@ -3662,12 +3655,10 @@ T112: Vr4
 Straightforward implementation::
 
     Tarsp_Vr4 = """(%Tarsp_whq% and
-                    node[@rel="whd"] and
-                    node[@cat="sv1" and 
-                         @rel="body"  and 
-                         %realcomplormodnodecount% = 2
-                        ])"""
+        node[%Tarsp_whqhead%] and
+        node[%whqbody%  and %realcomplormodnodecount% = 2])"""
 
+For *Tarsp_whqhead* and *whqbody*, see :ref:`VrXY`
 
 
 See section :ref:`composedmeasures` for details.
@@ -3694,13 +3685,12 @@ T113: Vr5+
 Straightforward implementation::
 
     Tarsp_Vr5plus = """(%Tarsp_whq% and
-                        node[@rel="whd"] and
-                        node[@cat="sv1" and 
-                             @rel="body"  and 
-                             %realcomplormodnodecount% > 2
-                            ])"""
+        node[%Tarsp_whqhead%] and
+        node[%whqbody%  and %realcomplormodnodecount% > 2])"""
+
 
 
+For *Tarsp_whqhead* and *whqbody*, see :ref:`VrXY`
 
 See section :ref:`composedmeasures` for details.
 

diff --git a/Documentation/auxiliarymodules.rst b/Documentation/auxiliarymodules.rst
@@ -21,3 +21,16 @@ Deregularise
 ------------
 
 .. automodule:: deregularise
+
+.. _treebankfunctions:
+
+Treebankfunctions
+-----------------
+
+.. _indextransform:
+
+Expansion of bare index nodes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. autofunction:: treebankfunctions::indextransform
+
diff --git a/README.md b/README.md
@@ -1,11 +1,17 @@
 # Sastadev
 
+[![Actions Status](https://github.com/UUDigitalHumanitieslab/sastadev/workflows/Unit%20tests/badge.svg)](https://github.com/UUDigitalHumanitieslab/sastadev/actions)
+
+[pypi sastadev](https://pypi.org/project/sastadev)
+
 Method definitions for use in SASTA
 
 Copy `default_config.py` to your own `config.py` in the `sastadev` directory, and change what you need.
 
 ## Upload to PyPi
 
+Specify the files which should be included in the package in `pypi/include.txt`.
+
 ```bash
 cd pypi
 ./prepare.sh

diff --git a/alpinoparsing.py b/alpinoparsing.py
@@ -19,17 +19,20 @@
 from memoize import memoize
 
 import logging
-from typing import Optional
 #from sastatypes import SynTree, URL
 
 #from config import SDLOGGER
 #from sastatypes import SynTree, URL
 
+urllibrequestversion = urllib.request.__version__
+
 alpino_special_symbols_pattern = r'[\[\]]'
 alpino_special_symbols_re = re.compile(alpino_special_symbols_pattern)
 
 gretelurl = 'https://gretel.hum.uu.nl/api/src/router.php/parse_sentence/'
+#gretelurl = 'http://gretel.hum.uu.nl/api/src/router.php/parse_sentence/'
 previewurltemplate = 'https://gretel.hum.uu.nl/ng/tree?sent={sent}&xml={xml}'
+#previewurltemplate = 'http://gretel.hum.uu.nl/ng/tree?sent={sent}&xml={xml}'
 
 emptypattern = r'^\s*$'
 emptyre = re.compile(emptypattern)
@@ -86,6 +89,8 @@ def parse(origsent: str, escape: bool = True):
             return None
 
 #def previewurl(stree: SynTree) -> URL:
+
+
 def previewurl(stree):
     '''
     The function *previewurl* returns the URL to preview the input SynTree *stree* in the GreTEL application.
@@ -118,9 +123,9 @@ def escape_alpino_input(instr: str) -> str:
     result = ''
     for c in instr:
         if c == '[':
-            newc = '\['
+            newc = '\\['
         elif c == ']':
-            newc = '\]'
+            newc = '\\]'
         else:
             newc = c
         result += newc