Skip to content

Latest commit

 

History

History
355 lines (257 loc) · 10.6 KB

README.md

File metadata and controls

355 lines (257 loc) · 10.6 KB

NeoSCA

build lint codecov codacy pypi commit support-version platform downloads license

NeoSCA is a rewrite of L2 Syntactic Complexity Analyzer (L2SCA) which is developed by Xiaofei Lu, with added support for Windows and an improved command-line interface for easier usage. The same as L2SCA, NeoSCA takes written English language samples in plain text format as input, and computes:

the frequency of 9 structures in the text:
  1. words (W)
  2. sentences (S)
  3. verb phrases (VP)
  4. clauses (C)
  5. T-units (T)
  6. dependent clauses (DC)
  7. complex T-units (CT)
  8. coordinate phrases (CP)
  9. complex nominals (CN), and
14 syntactic complexity indices of the text:
  1. mean length of sentence (MLS)
  2. mean length of T-unit (MLT)
  3. mean length of clause (MLC)
  4. clauses per sentence (C/S)
  5. verb phrases per T-unit (VP/T)
  6. clauses per T-unit (C/T)
  7. dependent clauses per clause (DC/C)
  8. dependent clauses per T-unit (DC/T)
  9. T-units per sentence (T/S)
  10. complex T-unit ratio (CT/T)
  11. coordinate phrases per T-unit (CP/T)
  12. coordinate phrases per clause (CP/C)
  13. complex nominals per T-unit (CN/T)
  14. complex nominals per clause (CP/C)

Contents

Highlights Top ▲

  • Works on Windows/macOS/Linux
  • Reserves intermediate results, i.e., parsed trees of Stanford Parser and matched subtrees of Stanford Tregex
  • An improved command-line interface

Install Top ▲

Install NeoSCA Top ▲

To install NeoSCA, you need to have Python 3.7 or later installed on your system. You can check if you already have Python installed by running the following command in your terminal:

python --version

If Python is not installed, you can download and install it from Python website. Once you have Python installed, you can install NeoSCA using pip:

pip install neosca

If you are in China and having trouble with slow download speeds or network issues, you can use the Tsinghua University PyPI mirror to install NeoSCA:

pip install neosca -i https://pypi.tuna.tsinghua.edu.cn/simple

Install Dependents Top ▲

NeoSCA depends on Java, Stanford Parser, and Stanford Tregex. After you have NeoSCA installed, you can use nsca --check-depends to install them. Note that this command requires Administrative privileges if you are on Windows.

Basic Usage Top ▲

To use NeoSCA, run the nsca command in your terminal, followed by the options and arguments you want to use.

Single Input Top ▲

To analyze a single text file, use the command nsca followed by the file path.

nsca ./samples/sample1.txt
# frequency output: ./result.csv

A result.csv file will be generated in the current directory. You can specify a different output filename using -o.

nsca ./samples/sample1.txt -o sample1.csv
# frequency output: ./sample1.csv
When analyzing a text file with a filename that includes spaces, it is important to enclose the file path in double quotes. Assume you have a sample 1.txt to analyze:
nsca "./samples/sample 1.txt"

This ensures that the entire filename including the spaces, is interpreted as a single argument. Without the double quotes, the command would interpret "sample" and "1.txt" as two separate arguments and the analysis would fail.

Multiple Input Top ▲

To analyze multiple text files at once, simply list them after the nsca command.

nsca ./samples/sample1.txt ./samples/sample2.txt

You can also use wildcards to select multiple files at once.

nsca ./samples/sample*.txt     # every file whose name starts with "sample" and ends with ".txt"
nsca ./samples/sample[1-9].txt # sample1.txt, sample2.txt, ..., sample9.txt
nsca ./samples/sample1?.txt    # sample10.txt, sample11.txt, ..., sample19.txt

Advanced Usage Top ▲

Output Frequencies in Json Format Top ▲

You can generate a json file by:

nsca ./samples/sample1.txt --output-format json
# frequency output: ./result.json

Or

nsca ./samples/sample1.txt -o sample1.json
# frequency output: ./sample1.json

Pass Text Through the Command Line Top ▲

If you want to analyze text that is passed directly through the command line, you can use --text followed by the text.

nsca --text 'The quick brown fox jumps over the lazy dog.'
# frequency output: ./result.csv

Reserve Intermediate Results Top ▲

To reserve the parsed trees, use -p or --reserve-parsed. To reserve matched subtrees, use -m or --reserve-matched.
nsca samples/sample1.txt -p
# frequency output: ./result.csv
# parsed trees:     ./samples/sample1.parsed
nsca samples/sample1.txt -m
# frequency output: ./result.csv
# matched subtrees: ./result_matches/
nsca samples/sample1.txt -p -m
# frequency output: ./result.csv
# parsed trees:     ./samples/sample1.parsed
# matched subtrees: ./result_matches/

Just Parse Text and Exit Top ▲

If you only want to save the parsed trees and exit, you can use --no-query. This can be useful if you want to use the parsed trees for other purposes.

nsca samples/sample1.txt --no-query
# parsed trees: samples/sample1.parsed
nsca --text 'This is a test.' --no-query
# parsed trees: ./cmdline_text.parsed

List Output Fields Top ▲

If you are not sure what the output fields represent, you can use --list to print a list of all the available output fields.

nsca --list
W: words
S: sentences
VP: verb phrases
C: clauses
T: T-units
DC: dependent clauses
CT: complex T-units
CP: coordinate phrases
CN: complex nominals
MLS: mean length of sentence
MLT: mean length of T-unit
MLC: mean length of clause
C/S: clauses per sentence
VP/T: verb phrases per T-unit
C/T: clauses per T-unit
DC/C: dependent clauses per clause
DC/T: dependent clauses per T-unit
T/S: T-units per sentence
CT/T: complex T-unit ratio
CP/T: coordinate phrases per T-unit
CP/C: coordinate phrases per clause
CN/T: complex nominals per T-unit
CN/C: complex nominals per clause

Print the Help Message Top ▲

If you call the nsca command without any arguments or options, it will return a help message.

Citing Top ▲

If you use NeoSCA in your research, please cite as follows.

BibTeX:
@misc{tan2022neosca,
title        = {NeoSCA: A Rewrite of L2 Syntactic Complexity Analyzer, version 0.0.35},
author       = {Long Tan},
howpublished = {\url{https://github.com/tanloong/neosca}},
year         = {2022}
}
APA (7th edition):
Tan, L. (2022). NeoSCA (version 0.0.35) [Computer software]. Github. https://github.com/tanloong/neosca
MLA (9th edition):
Tan, Long. NeoSCA. version 0.0.35, GitHub, 2022, https://github.com/tanloong/neosca.

Also, you need to cite Xiaofei's article describing L2SCA.

BibTeX:
@article{lu2010automatic,
title     = {Automatic analysis of syntactic complexity in second language writing},
author    = {Xiaofei Lu},
journal   = {International journal of corpus linguistics},
volume    = {15},
number    = {4},
pages     = {474--496},
year      = {2010},
publisher = {John Benjamins Publishing Company},
doi       = {10.1075/ijcl.15.4.02lu},
}
APA (7th edition):
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474-496.
MLA (9th edition):
Lu, Xiaofei. "Automatic Analysis of Syntactic Complexity in Second Language Writing." International Journal of Corpus Linguistics, vol. 15, no. 4, John Benjamins Publishing Company, 2010, pp. 474-96.

License Top ▲

NeoSCA is licensed under the GNU General Public License version 2 or later.