Skip to content

Commit

Permalink
Merge pull request #11 from UUDigitalHumanitieslab/subj
Browse files Browse the repository at this point in the history
Subj
  • Loading branch information
JeltevanBoheemen authored Feb 15, 2023
2 parents 4f646e3 + b74a984 commit 48749e1
Show file tree
Hide file tree
Showing 40 changed files with 1,325 additions and 190 deletions.
41 changes: 41 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: Unit tests

on:
workflow_dispatch:
push:
paths-ignore:
- '**.md'

jobs:
build:

runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.7', '3.10']

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Prepare PIP package
run: |
cd pypi
./prepare.sh
- name: Lint with flake8
run: |
flake8 $(cat pypi/include.txt | grep \.py\$) --count --max-complexity=12 --max-line-length=127 --statistics
# - name: Run unit tests
# run: |
# pip install pytest
# python -m pytest
4 changes: 3 additions & 1 deletion CHAT_Annotation.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@

CHAT = 'CHAT'

CHAT_explanation = 'Explanation'

monadic = 1
dyadic = 2

Expand Down Expand Up @@ -642,7 +644,7 @@ def result(x, y): return SDLOGGER.warning(msg.format(x, y))
CHAT_SimpleScopedRegex(r'\[!!\]', keep, False, monadic),
simplescopedmetafunction),
# Duration to be added here @@
CHAT_Annotation('Explanation', '8.3:69', '10.3:73',
CHAT_Annotation(CHAT_explanation, '8.3:69', '10.3:73',
CHAT_ComplexRegex((r'\[=', anybutrb, r'\]'), (keep, eps), False),
complexmetafunction),
CHAT_Annotation('Replacement', '8.3:69', '10.3:73',
Expand Down
102 changes: 46 additions & 56 deletions Documentation/Tarsp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,9 +92,9 @@ Explanation:

*Wh*-questions are defined as follows::

Tarsp_whq = """((@cat="whq" and @rel="--") or (@cat="whsub"))"""
Tarsp_whq = """((@cat="whq" and @rel="--") or (@cat="whsub") or (@cat="whrel" and @rel="--"))"""
covering both main clauses (*whq*) and subordinate clauses (*whsub*).
covering both main clauses (*whq*) and subordinate clauses (*whsub*), but also clauses that have been analysed by Alpino as independent (relation="--" ) relative clauses, which usually are actually wh-questions.


.. _imperatives:
Expand Down Expand Up @@ -241,10 +241,12 @@ Language measures such as *WBVC*, *OndVC*, *OndW*, *OndWB*, *OndWBVC*, and sever
where::
subject = """(@rel="su")"""
subject = """(@rel="su" and parent::node[(@cat="smain" or @cat="sv1" or @cat="ssub")])"""

Here we only count as subjects those nodes (full or index nodes) that have a finite clause node as parent.
Overt subjects never occur in nonfinite clauses, and subject index nodes should not be counted inside nonfinite clauses. Subject index nodes do occur in finite clauses, e.g., as the "trace" of a wh-movement.

It differs from the definition of :ref:`T063_Ond` because T063 has to exclude subjects of nonfinite nodes (which is covered in the composed language measures by the conditions on mood) and to cover an additional case (existential *er* as a subject (though maybe that should be included here as well)
It differs from the definition of :ref:`T063_Ond` because T063 has to cover an additional case (existential *er* as a subject (though maybe that should be included here as well)

* **Tarsp_B**: see :ref:`T007_B`
* **Tarsp_VC**: is defined as follows::
Expand Down Expand Up @@ -435,16 +437,23 @@ T006: Avn
* **Implementation**: Xpath
* **Query** defined as::

//node[@pt="vnw" and @vwtype="aanw" and @lemma!="hier" and @lemma!="daar" and @lemma!="er" and
@rel!="det" and (not(@positie) or @positie!="prenom") ]
AVn = """(%coreavn% or %avnrel%)""""
coreavn = """(@pt="vnw" and @vwtype="aanw" and @lemma!="hier" and @lemma!="daar" and @lemma!="er" and @rel!="det" and (not(@positie) or @positie!="prenom") )"""
avnrel = """(%diedatrel% and parent::node[@cat="rel" and @rel!="mod"])"""
diedatrel = """(@pt="vnw" and @vwtype="betr" and @rel="rhd" and (@lemma="die" or @lemma="dat"))"""

The query for *AVn* consists of two subcases: the core case, and the case of *die* and * dat* incorrectly analysed as a relative pronoun in an independent relative clause.

* The query with pt equal to *vnw* and vwtype equal to *aanw* selects demonstrative pronouns, but
The core case (*coreavn*):

* which has pt equal to *vnw* and vwtype equal to *aanw* selects demonstrative pronouns, but
* these include R-pronouns, so they are explicitly excluded
* the relation must not be *det* (otherwise the pronouns are not used independently)
* and if a *position* attribute is present it should not have the value *prenom* (otherwise it is not used independently)

* **Schlichting**: "Aanwijzend Voornaamwoord: 'die', 'dit', 'deze', 'dat' zelfstandig gebruikt." fully covered.
The relative case (*avnrel*) covers the relative pronouns *die* and *dat* (*diedatrel*) in an independent relative clause (i.e. *rel* is not equal to *mod*).

* **Schlichting**: "Aanwijzend Voornaamwoord: 'die', 'dit', 'deze', 'dat' zelfstandig gebruikt." Fully covered.


.. _T007_B:
Expand Down Expand Up @@ -2052,26 +2061,24 @@ T063: Ond

Here, **FullOnd** is defined as::

"""((%subject% and (@pt or @cat) ) or %erx%)"""
FullOnd = """(%subject% or %erx%)"""


where **subject** and **erx** are defined as::
where **subject** (see :ref:`zinsdelen`) and **erx** are defined as::

subject = """(@rel="su")"""
subject = """(@rel="su" and parent::node[(@cat="smain" or @cat="sv1" or @cat="ssub")])"""
erx = """((@rel="mod" and @lemma="er" and ../node[@rel="su" and @begin>=../node[@rel="mod" and @lemma="er"]/@end]) or
(@rel="mod" and @lemma="er" and ../node[@rel="su" and not(@pt) and not(@cat)])
)
"""
The condition on the presence of *pt* or *cat* is present to exclude (empty) subject of infinitives and participle clauses, e.g. in *Hij heeft gezwommen* *hij* is an antecedent of an (empty) node acting as the subject of the past participle *gezwommen*, and we do not want to include that.

The **erx** macro is to ensure that so-called *expletive er* also counts a subject. We implemented this in the following manner: *er* is considered (also) expletive (and thus must count as a subject):

* if it precedes the subject (as in **er** *kwam iemand binnen*), or
* if there is an empty subject, to cover cases such as *wie zwom* **er**. Note that in *wie heeft* **er** gezwommen* this *er* is considered a subject because of the empty subject of the participial clause, which is perhaps not what we want.


**Remark**: The condition on the presence of *pt* or *cat* incorrectly excludes *wie* in *wie doet dat*, *wie heeft dat gedaan*: *wie* is a *whd* an an antecedent to an index node with grammatical relation *su* (see e.g. VKLTarsp, sample 3, utterance 25 *weet ik niet* **wie** *daarin zit*. It has no consequences for the scores because *Ond* is not in the form and because in language measures such as OndWB etc a different definition of subject is used. It probably is better to replace the condition on *pt* and *cat* by a condition on the parent node, viz. that it must have as value for the *cat* attribute from one of the values from *smain*, *sv1*, or *ssub* (categories for finite clauses or finite clause bodies).


* **Schlichting**: "Dit is de persoon of de zaak die de handeling van het werkwoord uitvoert. Wanneer het onderwerp van een zin in het meervoud staat, staat de persoonsvorm van die zin ook in het meervoud."
Expand Down Expand Up @@ -2402,7 +2409,7 @@ Straightforward implementation::
Tarsp_OndW = """(%declarative% and
%Ond% and
(%Tarsp_W% or node[%Tarsp_onlyWinVC%]) and
%realcomplormodnodecount% = 0 )"""
%realcomplormodnodecount% = 1 )"""

See section :ref:`composedmeasures` for details.

Expand Down Expand Up @@ -3481,28 +3488,16 @@ T106: Vo/bij
* **Original**: yes
* **In form**: yes
* **Page**: 71
* **Implementation**: Xpath with macros
* **Implementation**: Python function
* **Query** defined as::

//node[node[@pt="vz" and @rel="hd"] and
node[@rel="obj1" and
((@index and not(@pt or @cat)) or
(@end < ../node[@rel="hd"]/@begin)
)]]


The query searches for a node

* containing a head adposition node, and
* containing a node with grammatical relation *obj1*

* which is an "empty" node
* and which precedes the head adposition
voslashbij
.. autofunction:: queryfunctions::voslashbij

* **Schlichting**: "Voornaamwoordelijk bijwoord, gesplitst. Het gesplitste voornaamwoordelijk bijwoord behoeft niet gescoord te worden bij Vobij, kolom Voornaamwoorden in Fase IV, alleen hier bij de Woordgroepen in Fase V."

* **Remark** Schlichting only gives examples with nonadjacent Rpronouns and adpositions. We agreed with Rob Zwitserlood that adjacent R-pronoun + adposition, even if written separately, is to be counted under *Vobij*. However, the notion "adjacent" is not so easy to define in inflated tree structure. This still has to be done.
* **Remark** The condition *@index and not(@pt or @cat)* is better replaced by a macro defined as *@index and not(@word or @cat)*.
* **Remark** Schlichting only gives examples with nonadjacent Rpronouns and adpositions. We agreed with Rob Zwitserlood that adjacent R-pronoun + adposition, even if written separately, is to be counted under *Vobij*. However, the notion "adjacent" is not so easy to define in inflated tree structures. This has been done in a python function *adjacent* in the module treebankfunctions, and for this reason this query must also be defined as a python function.



Expand All @@ -3518,24 +3513,20 @@ T107: Vobij
* **Original**: yes
* **In form**: yes
* **Page**: 80
* **Implementation**: Xpath with macros
* **Implementation**: Python function
* **Query** defined as::

//node[%Vobij%]
vobij
.. autofunction:: queryfunctions::vobij

the macro **Vobij** is defined as follows::

Vobij = """(@pt="bw" and (contains(@frame,"er_adverb" ) or contains(@frame, "tmp_adverb") or @lemma="daarom") and
@lemma!="er" and @lemma!="daar" and @lemma!="hier" and
(starts-with(@lemma, 'er') or starts-with(@lemma, 'daar') or starts-with(@lemma, 'hier'))
)"""



* **Schlichting**: "Voornaamwoordelijk bijwoord. Het voornaamwoordelijk bijwoord is een combinatie van 'er', 'daar', 'hier', 'waar' met een voorzetsel (bijv. 'aan', 'bij', 'voor') of een bijwoord ('af', 'heen', 'toe')"

* **Remark** Separately wiritten but adjacen R-pronoun + adposition cases are also considered *Vobij*
* **Remark** Alpino considers words such as 'af', 'heen', and 'toe' as postpositions, not as adverbs.
* **Remark** *waar* is lacking. It does not occur in the VKLtarsp data, in the Auris data only *waarom* occurs.



Expand Down Expand Up @@ -3604,6 +3595,7 @@ T110: Vr

This language measure actually does not exist. It has been included and implemented because the code *Vr* occurs in the Schlichting appendix (example 20, p. 91, and example 22, p. 92). Example 20 should have been annotated as *Vr(XY)*, and example 22 as *Vr4*.

.. _VrXY:

T111: Vr(XY)
""""""""""""
Expand All @@ -3624,13 +3616,14 @@ T111: Vr(XY)

Straightforward implementation::

Tarsp_VrXY = """(%Tarsp_whq% and
node[@rel="whd"] and
node[@cat="sv1" and
@rel="body" and
%realcomplormodnodecount% = 1
])"""
Tarsp_VrXY = """(%Tarsp_whq% and
node[%Tarsp_whqhead%] and
node[%whqbody% and %realcomplormodnodecount% <= 1])"""

where *Tarsp_whqhead* and *whqbody* cover both wh-questions (main and subordinate) and independent relatives::

Tarsp_whqhead = """(@rel="whd" or @rel="rhd") """
whqbody = """((@cat="sv1" or @cat="ssub") and @rel="body")"""

See section :ref:`composedmeasures` for details.

Expand Down Expand Up @@ -3662,12 +3655,10 @@ T112: Vr4
Straightforward implementation::

Tarsp_Vr4 = """(%Tarsp_whq% and
node[@rel="whd"] and
node[@cat="sv1" and
@rel="body" and
%realcomplormodnodecount% = 2
])"""
node[%Tarsp_whqhead%] and
node[%whqbody% and %realcomplormodnodecount% = 2])"""

For *Tarsp_whqhead* and *whqbody*, see :ref:`VrXY`


See section :ref:`composedmeasures` for details.
Expand All @@ -3694,13 +3685,12 @@ T113: Vr5+
Straightforward implementation::

Tarsp_Vr5plus = """(%Tarsp_whq% and
node[@rel="whd"] and
node[@cat="sv1" and
@rel="body" and
%realcomplormodnodecount% > 2
])"""
node[%Tarsp_whqhead%] and
node[%whqbody% and %realcomplormodnodecount% > 2])"""



For *Tarsp_whqhead* and *whqbody*, see :ref:`VrXY`

See section :ref:`composedmeasures` for details.

Expand Down
13 changes: 13 additions & 0 deletions Documentation/auxiliarymodules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,16 @@ Deregularise
------------

.. automodule:: deregularise

.. _treebankfunctions:

Treebankfunctions
-----------------

.. _indextransform:

Expansion of bare index nodes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. autofunction:: treebankfunctions::indextransform

6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
# Sastadev

[![Actions Status](https://github.com/UUDigitalHumanitieslab/sastadev/workflows/Unit%20tests/badge.svg)](https://github.com/UUDigitalHumanitieslab/sastadev/actions)

[pypi sastadev](https://pypi.org/project/sastadev)

Method definitions for use in SASTA

Copy `default_config.py` to your own `config.py` in the `sastadev` directory, and change what you need.

## Upload to PyPi

Specify the files which should be included in the package in `pypi/include.txt`.

```bash
cd pypi
./prepare.sh
Expand Down
11 changes: 8 additions & 3 deletions alpinoparsing.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,20 @@
from memoize import memoize

import logging
from typing import Optional
#from sastatypes import SynTree, URL

#from config import SDLOGGER
#from sastatypes import SynTree, URL

urllibrequestversion = urllib.request.__version__

alpino_special_symbols_pattern = r'[\[\]]'
alpino_special_symbols_re = re.compile(alpino_special_symbols_pattern)

gretelurl = 'https://gretel.hum.uu.nl/api/src/router.php/parse_sentence/'
#gretelurl = 'http://gretel.hum.uu.nl/api/src/router.php/parse_sentence/'
previewurltemplate = 'https://gretel.hum.uu.nl/ng/tree?sent={sent}&xml={xml}'
#previewurltemplate = 'http://gretel.hum.uu.nl/ng/tree?sent={sent}&xml={xml}'

emptypattern = r'^\s*$'
emptyre = re.compile(emptypattern)
Expand Down Expand Up @@ -86,6 +89,8 @@ def parse(origsent: str, escape: bool = True):
return None

#def previewurl(stree: SynTree) -> URL:


def previewurl(stree):
'''
The function *previewurl* returns the URL to preview the input SynTree *stree* in the GreTEL application.
Expand Down Expand Up @@ -118,9 +123,9 @@ def escape_alpino_input(instr: str) -> str:
result = ''
for c in instr:
if c == '[':
newc = '\['
newc = '\\['
elif c == ']':
newc = '\]'
newc = '\\]'
else:
newc = c
result += newc
Expand Down
Loading

0 comments on commit 48749e1

Please sign in to comment.