diff --git a/.gitignore b/.gitignore index d21474f..1616478 100644 --- a/.gitignore +++ b/.gitignore @@ -3,3 +3,5 @@ venv/* .idea/* webapp.log static/topicmodeling.zip +__pycache__/* +Pipfile.lock diff --git a/Pipfile b/Pipfile index f46a1ce..da55c7c 100644 --- a/Pipfile +++ b/Pipfile @@ -14,6 +14,7 @@ Flask = "==0.12.2" lxml = "==4.1.1" pandas = "==0.21.1" numpy = "==1.14.0" +pyqt = "==5.9.2" [dev-packages] diff --git a/README.md b/README.md index f4a7561..06348d0 100755 --- a/README.md +++ b/README.md @@ -1,25 +1,24 @@ # Topics Explorer: A GUI for Topics – Easy Topic Modeling This application introduces an user-friendly Topic Modeling workflow, basically containing text data preprocessing, the actual modeling using [latent Dirichlet allocation](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) (LDA), as well as various interactive visualizations. -**If you do not know anything about Topic Modeling or programming in general, this is where you start.** +If you do not know anything about Topic Modeling or programming in general, this is where you start. -**Topics Explorer** aims for *simplicity* and *usability*. If you are working with a large corpus (let's say more than 200 documents, 5000 tokens each document) you may wish to use more sophisticated Topic Models such as those implemented in [MALLET](http://mallet.cs.umass.edu/topics.php), which is known to be more robust than standard LDA. Have a look at our Jupyter notebook [introducing Topic Modeling with MALLET](https://github.com/DARIAH-DE/Topics/IntroducingMallet.ipynb). - -![Demonstrator Screenshot](screenshot.png) +## Getting started with the standalone executable +You **do not** have to install a Python interpreter or anything else. There is currently one standalone build for Windows and macOS, respectively. **At the moment, Linux user will have to use the development version**. +1. Go to the [release-section](https://github.com/DARIAH-DE/TopicsExplorer/releases) and download the ZIP archive for your OS. +2. Open it by double-clicking. +3. Run the app by double-clicking the file `DARIAH Topics Explorer`. (The files in the folder `src` is basically source code. You do not need to worry about that). -## Getting started -Although this application is built with Python and some JavaScript, it is possible to run it as if it was a native application, without having to install Python or any related packages. There is currently one build for Windows and macOS, respectively. +**Topics Explorer** aims for simplicity and usability. If you are working with a large corpus (let's say more than 200 documents, 5000 tokens each document) you may wish to use more sophisticated topic models such as those implemented in [MALLET](http://mallet.cs.umass.edu/topics.php), which is known to be more robust than standard LDA. Have a look at our Jupyter notebook [introducing Topic Modeling with MALLET](https://github.com/DARIAH-DE/Topics/blob/master/IntroducingMallet.ipynb). -1. Download `demonstrator-0.0.1-windows.zip` or `demonstrator-0.0.1-mac.zip` from the [release-section](https://github.com/DARIAH-DE/Topics/releases). -2. Open it by double-clicking. -3. Run the app by double-clicking the file `DARIAH Topics Explorer.exe` or `DARIAH Topics Explorer.app`, respectively. +![Demonstrator Screenshot](screenshot.png) ### Troubleshooting * Please be patient. Depending on corpus size and number of iterations, the process may take some time, meaning something between some seconds and some hours. * If you are on a Mac and get an error message saying that the file is from an “unidentified developer”, you can override it by holding control while double-clicking. The error message will still appear, but you will be given an option to run the file anyway. -* Please use [GitHub Issues](https://github.com/DARIAH-DE/TopicsExplorer/issues). +* Please use [GitHub issues](https://github.com/DARIAH-DE/TopicsExplorer/issues). ## Working with the development version @@ -32,18 +31,17 @@ Although this application is built with Python and some JavaScript, it is possib ### Requirements Besides the standalone executables, you have the ability to run the development version. In this case, you will have to install some dependencies, but first of all: - * At least Python 3.6, from [here](https://www.python.org/downloads/). Python 2 is *not* supported. -* If you wish to use *Layer 3* (which is not necessary at all): Node.js, from [here](https://nodejs.org/en/download/). -For Python, you will need the following libraries: -* [`dariah_topics`](https://github.com/DARIAH-DE/Topics) 0.0.5. -* [`lda`](https://github.com/lda-project/lda) 1.0.5. -* [`bokeh`](https://github.com/bokeh/bokeh) 0.12.13. -* [`flask`](https://github.com/pallets/flask) 0.12.2. -* [`lxml`](https://github.com/lxml/lxml) 4.1.1. -* [`pandas`](https://github.com/pandas-dev/pandas) 0.21.1. -* [`numpy`](https://github.com/numpy/numpy) 1.14.0. +You will need the following libraries: +* [`dariah_topics`](https://github.com/DARIAH-DE/Topics) 0.0.6 +* [`lda`](https://github.com/lda-project/lda) 1.0.5 +* [`bokeh`](https://github.com/bokeh/bokeh) 0.12.13 +* [`flask`](https://github.com/pallets/flask) 0.12.2 +* [`lxml`](https://github.com/lxml/lxml) 4.1.1 +* [`pandas`](https://github.com/pandas-dev/pandas) 0.21.1 +* [`numpy`](https://github.com/numpy/numpy) 1.14.0 +* [`pyqt5`](https://github.com/baoboa/pyqt5) 5.9.2. You can install all dependencies using [`pipenv`](http://pipenv.readthedocs.io/en/latest/): @@ -51,50 +49,32 @@ You can install all dependencies using [`pipenv`](http://pipenv.readthedocs.io/e pipenv install ``` -> If you are on a UNIX-based machine, remember using `pip3` and `python3` instead of `pip` and `python`. - -So far, you could run the application via `python webapp.py` and go to `http://127.0.0.1:5000` in any web browser. If you want a more desktop app-like feeling, you can build *Layer 3* on top with [Electron](https://electronjs.org/), a JavaScript framework for creating native applications with web technologies like JavaScript, HTML, and CSS. The dependencies are: - -* [`electron`](https://github.com/electron/electron) 1.7.10. -* [`request-promise`](https://github.com/request/request-promise) 4.2.2. -* [`request`](https://github.com/request/request) 2.83.1. +> If you are on a UNIX-based machine, remember using `python3` instead of `python`. -Run the following command via [`npm`](https://www.npmjs.com/get-npm): +So far, you could run the application via `python webapp.py` and go to `http://127.0.0.1:5000` in any web browser. If you want a more desktop app-like feeling, you can build *Layer 3* on top and run: ``` -npm install +python topicsexplorer.py ``` + ### Contents -* [`bokeh_templates`](bokeh_templates): HTML templates for `bokeh`. This is only relevant, if you want to freeze the Python part with `pyinstaller`. -* [`hooks`](hooks): Necessary hook files. This is only relevant, if you want to freeze the Python part with `pyinstaller`. -* [`main.js`](main.js): Basically the GUI. -* [`package.json`](package.json): Metadata, dependencies, and scripts for the GUI. +* [`bokeh_templates`](bokeh_templates): HTML templates for `bokeh`. This is only relevant, if you want to freeze the scripts with PyInstaller. +* [`hooks`](hooks): Necessary hook files. This is only relevant, if you want to freeze the Python part with PyInstaller. * [`static`](static) and [`templates`](templates): Static files (e.g. images, CSS, etc.) and HTML templates for the `flask` template engine. * [`test`](test): Unittest for `webapp.py`, testing all functions of the application. * [`webapp.py`](webapp.py): Contains 3rd party functions and communicates with the webserver. -* [`webapp.spec`](webapp.spec): The build script for `pyinstaller` containing metadata. +* [`topicsexplorer.py`](topicsexplorer.py): A Qt-based UI displaying the contents of the app by running `webapp.py`. +* [`topicsexplorer.spec`](webapp.spec): The build script for PyInstaller containing metadata. ### Troubleshooting -* When installing `electron` fails, try `sudo npm install -g electron --unsafe-perm=true --allow-root`. -* Please use [GitHub Issues](https://github.com/DARIAH-DE/TopicsExplorer/issues). - +* Please use [GitHub issues](https://github.com/DARIAH-DE/TopicsExplorer/issues). -## Creating a build for Layer 1 and 2 -To freeze the Python part with `pyinstaller`, run on macOS: -``` -pyinstaller --onefile --add-data static:static --add-data templates:templates --add-data bokeh_templates:bokeh_templates --additional-hooks-dir hooks webapp.py -``` - -or, for Windows: -``` -pyinstaller --onefile --add-data static;static --add-data templates;templates --add-data bokeh_templates;bokeh_templates --additional-hooks-dir hooks webapp.py -``` -## Creating a build for the whole application -To freeze the Electron part with `electron-builder`, run: +## Creating a standalone build +To freeze the Python scripts with [PyInstaller](http://www.pyinstaller.org/), simply run: ``` -electron-builder +pyinstaller topicsexplorer.spec ``` diff --git a/bokeh_templates/autoload_js.js b/bokeh_templates/autoload_js.js index 2ee6886..b099055 100755 --- a/bokeh_templates/autoload_js.js +++ b/bokeh_templates/autoload_js.js @@ -1,54 +1,43 @@ {# - Renders JavaScript code - for "autoloading". +Renders JavaScript code for "autoloading". - The code automatically and asynchronously loads BokehJS( - if necessary) and - then replaces the AUTOLOAD_TAG `` < script > `` - tag that - calls it with the rendered model. +The code automatically and asynchronously loads BokehJS (if necessary) and +then replaces the AUTOLOAD_TAG `` +#} + diff --git a/bokeh_templates/css_resources.html b/bokeh_templates/css_resources.html index 15a000c..b45979d 100755 --- a/bokeh_templates/css_resources.html +++ b/bokeh_templates/css_resources.html @@ -1,24 +1,20 @@ -{# Renders HTML that loads Bokeh CSS according to the configuration in a Resources object. :param css_files: a list of URIs for CSS files to include :type css_files: list[str] :param css_raw: a list of raw CSS snippets to put between `` - {%- endfor %} diff --git a/bokeh_templates/doc_js.js b/bokeh_templates/doc_js.js index adcdf71..1362479 100755 --- a/bokeh_templates/doc_js.js +++ b/bokeh_templates/doc_js.js @@ -1,23 +1,7 @@ -{ % extends "try_run.js" % -} +{% extends "try_run.js" %} -{ % block code_to_run % -} -var docs_json = { - { - docs_json - } -}; -var render_items = { - { - render_items - } -}; -root.Bokeh.embed.embed_items(docs_json, render_items { % - - if app_path - % -}, "{{ app_path }}" { % -endif - % -} { % - - if absolute_url - % -}, "{{ absolute_url }}" { % -endif - % -}); { % endblock % -} +{% block code_to_run %} + var docs_json = {{ docs_json }}; + var render_items = {{ render_items }}; + root.Bokeh.embed.embed_items(docs_json, render_items{%- if app_path -%}, "{{ app_path }}" {%- endif -%}{%- if absolute_url -%}, "{{ absolute_url }}" {%- endif -%}); +{% endblock %} diff --git a/bokeh_templates/doc_nb_js.js b/bokeh_templates/doc_nb_js.js index c46cc32..a4f44bc 100755 --- a/bokeh_templates/doc_nb_js.js +++ b/bokeh_templates/doc_nb_js.js @@ -1,17 +1,7 @@ -{ % extends "try_run.js" % -} +{% extends "try_run.js" %} -{ % block code_to_run % -} -var docs_json = { - { - docs_json - } -}; -var render_items = { - { - render_items - } -}; -root.Bokeh.embed.embed_items_notebook(docs_json, render_items); { % endblock % -} +{% block code_to_run %} + var docs_json = {{ docs_json }}; + var render_items = {{ render_items }}; + root.Bokeh.embed.embed_items_notebook(docs_json, render_items); +{% endblock %} diff --git a/bokeh_templates/file.html b/bokeh_templates/file.html index 2e57b22..34891ca 100755 --- a/bokeh_templates/file.html +++ b/bokeh_templates/file.html @@ -1,29 +1,43 @@ -{# Renders Bokeh models into a basic .html file. :param title: value for `` -
Using Settings:
-Bokeh | -version | -{{ bokeh_version }} | -
---|---|---|
BokehJS | -js | -{{ js_info }} | -
css | -{{ css_info }} | -
{{ warning }}
-{%- endfor %} +:param js_info: information about the location, version, etc. of BokehJS code +:type js_info: str + +:param css_info: information about the location, version, etc. of BokehJS css +:type css_info: str + +:param warnings: a list of warnings to display to user +:type warnings: list[str] + +#} + + {%- if verbose %} + +Using Settings:
+Bokeh | +version | +{{ bokeh_version }} | +
---|---|---|
BokehJS | +js | +{{ js_info }} | +
css | +{{ css_info }} | +
{{ warning }}
+ {%- endfor %} diff --git a/bokeh_templates/plot_div.html b/bokeh_templates/plot_div.html index 222d7ec..a1cf3e3 100755 --- a/bokeh_templates/plot_div.html +++ b/bokeh_templates/plot_div.html @@ -1,5 +1,11 @@ -{# Renders a basic plot div, that can be used in conjunction with PLOT_JS. :param elementid: a unique identifier for the `` -Comments are welcome, as are reports of bugs and typos. Please use the project's issue tracker on GitHub.
+Please use the project's issue tracker on GitHub (https://github.com/DARIAH-DE/TopicsExplorer) for comments, bugs, and typos.
For general questions, write a mail to Dr. Steffen Pielström, in case of any technical questions to Severin Simmler.
For this workflow, you will need a corpus (a set of texts) as plain text (.txt) or TEI XML (.xml). Use the button below to select multiple text files. To gain better results, choose at least five documents (but the more the +
For this workflow, you will need a corpus (a set of texts) as plain text (.txt) or XML (.xml). TEI encoded XML is fully supported to process only the text part. Use the button below to select multiple text files. To gain better results, choose at least five documents (but the more the better).
An important preprocessing step is tokenization. Without identifying tokens, it is difficult to extract necessary information, such as most frequent tokens, also known as stopwords, or token frequencies in general. In this +
An important preprocessing step is tokenization. Without identifying tokens, it is difficult to extract necessary information, such as token frequencies in general, or most frequent tokens, also known as stopwords. In this
application, one token consists of one or more characters, optionally followed by exactly one punctuation (a hyphen or something related), followed by one or more characters. For example, the phrase “her father's arm-chair” will be tokenized
as ["her", "father's", "arm-chair"]
.
An iteration is a process of repeating the same action multiple times to achieve a specific goal. This is how LDA works. The number of sampling iterations should be a trade-off between the time taken to complete sampling and the quality of the model. The default value produces quite good results, but feel free to increase the number of iterations.
-When using LDA to explore text collections, we are typically interested in examining texts in terms of their constituent topics (instead of word frequencies). Because the number of topics is so much smaller than the number of unique vocabulary elements (say, 10 versus 10,000), a range of data visualization methods become available. As you will see, all of the provided visualizations are interactive, but you will have the ability to save the plots - as a static image file.
+
All parameters, including some corpus statistics, are summed up in the following table. This kind of information might be useful, if you create more than one topic model and want to compare the results. The most common way to evaluate a probabilistic model is to measure the log-likelihood (if you are interested in the evaluation of probabilistic models, have a look at Wallach et al. 2009: Evaluation Methods for Topic Models, a mathematical approach). If you increase the number of iterations, your model gets better, and you will see, the log-likelihood also increases until a certain point. This is how you might find out the ideal number of iterations.
{% for table in parameter %} {{ table|safe }} {% endfor %}
+ As you can see, your corpus is much smaller after cleaning. You either defined a threshold for most frequent words, or selected an external stopwords list. In addition so-called hapax legomena have been removed. In corpus linguistics, a hapax legomenon is a word that occurs only once within a context. So, if a word occurs only once in a document, it is very likely that the word is semantically insignificant – meaning not useful for the topic modeling algorithm.
Topic Models are unsupervised. It is called unsupervised, because you did not have any labels describing the semantic structures or anything related, but only pure word frequencies. Since the examples given to the algorithm are unlabeled, there is no evaluation of the accuracy, or how good your model is. So, it is up to you now by inspecting the model to decide whether you are satisfied with its performance or not. @@ -89,9 +91,11 @@
In the following graphic, you can access one dimension of the information displayed in the heatmap above. This might be a more clear approach, if you are interested in a specific topic, or, more precisely, how the topic is distributed over the documents of your corpus. Use the dropdown menu to select a topic. The proportions you can see by default is based on the first topic.
{{ topics_div|safe }}Similar thing as above, you can access the other dimension displayed in the heatmap. So, if you are intereseted in a specific document, you have the ability to select it via the dropdown menu and inspect its proportions. The bars displayed by default are based on the first document.
{{ documents_div|safe }}We want to empower users with little or no previous experience and programming skills to create custom workflows mostly using predefined functions within a familiar environment. So, if this practical introduction aroused your interest and diff --git a/templates/modeling.html b/templates/modeling.html index 6e74064..335200c 100644 --- a/templates/modeling.html +++ b/templates/modeling.html @@ -49,7 +49,7 @@