You can also join our Discord server!
You can contact Harmony team at https://harmonydata.ac.uk/, or Thomas Wood at http://fastdatascience.com/.
Read our guide to contributing to Harmony here.
You can raise an issue in the issue tracker, and you can open a pull request.
Please contact us at https://harmonydata.ac.uk/contact or write to thomas@fastdatascience.com if you would like to be involved in the project.
Please visit https://github.com/harmonydata/harmony
- 📰 The code for training the PDF extraction is here: https://github.com/harmonydata/pdf-questionnaire-extraction
Please visit https://github.com/harmonydata/harmony_original
Harmony is a data harmonisation project that uses Natural Language Processing to help researchers make better use of existing data from different studies by supporting them with the harmonisation of various measures and items used in different studies. Harmony is a collaboration project between the University of Ulster, University College London, the Universidade Federal de Santa Maria in Brazil, and Fast Data Science Ltd in London.
You can read more at https://harmonydata.ac.uk.
There is a live demo at: https://harmonydata.ac.uk/app
Harmony compares questions from different instruments by converting them to a vector representation and calculating their similarity. You can read more at https://harmonydata.ac.uk/how-does-harmony-work/
Download and install Docker:
- https://docs.docker.com/desktop/install/mac-install/
- https://docs.docker.com/desktop/install/windows-install/
- https://docs.docker.com/desktop/install/linux-install/
Open a Terminal and run
docker run -p 8000:8000 -p 3000:3000 harmonydata/harmonylocal
Then go to http://localhost:3000 in your browser.
If you are a Docker user, you can run Harmony from a pre-built Docker image.
- https://hub.docker.com/repository/docker/harmonydata/harmonyapi - just the Harmony API
- https://hub.docker.com/repository/docker/harmonydata/harmonylocal - Harmony API and React front end
A prerequisite is Tika, which is a PDF parsing library. This must run as a server in Java. We use the Tika Python bindings.
First, clone the API and make sure to clone with --recurse-submodules
.
git clone --recurse-submodules git@github.com:harmonydata/harmonyapi.git
The Harmony API includes the harmony
repo as a submodule.
After you have cloned the repository, if the folder inside called harmony
is empty, or at any point you get an error like the below, please check you have cloned with --recurse-submodules
as below:
git clone --recurse-submodules https://github.com/harmonydata/harmonyapi.git
Download and install Java if you don't have it already. Download and install Apache Tika and run it on your computer https://tika.apache.org/download.html
java -jar tika-server-standard-2.3.0.jar
docker build -t harmonyapi .
Don't forget to expose port 8080:
docker run -p 8080:80 harmonyapi
You should now be able to visit http://0.0.0.0:8080/docs and view the data.
If you want to run the Harmony API container and execute Bash commands inside it, you can run:
docker run -it harmonyapi bash
Harmony team only: see details of how Harmony is deployed on-premises here:
https://github.com/harmonydata/harmony_deployment_ulster_private
You can deploy Harmony with Docker Compose - see docker_compose.yml
.
When the app is run, there is an environment variable HARMONY_DATA_PATH
which is set to /data
on the production
server, and that's where you need to put any data files. But you could set it to anything you like on your local machine
e.g. /home/xxx/data/
and put the files there and it will find them.
These 3 files are the files it looks for in the /data folder, although the app will run without them. It's a cached version of the Mental Health Catalogue:
mhc_all_metadatas.json
mhc_embeddings.npy
mhc_questions.json
When Harmony is deployed to Azure, there is an Azure blob storage which is mounted under /data
.
The data files can be found here: https://github.com/harmonydata/harmony_deployment_ulster_private
There are also other environment variables which tell the API where to look to load the sentence transformer or contact Tika:
environment:
HARMONY_DATA_PATH: /data
HARMONY_SENTENCE_TRANSFORMER_PATH: /data/paraphrase-multilingual-MiniLM-L12-v2
OPENAI_API_KEY:
GOOGLE_APPLICATION_CREDENTIALS:
AZURE_OPENAI_API_KEY:
AZURE_OPENAI_ENDPOINT:
TIKA_SERVER_ENDPOINT: http://tika:9998
HARMONY_DATA_PATH
- This path will be used to store for example the cache files. Defaults to the HOME DIRECTORY
.
OPENAI_API_KEY
- The OpenAI API key.
GOOGLE_APPLICATION_CREDENTIALS
- To make use of Google's Vertex AI
, fill in this environment variable.
This should be the content of your service account file, so a JSON object is expected as the value for the
environment. Make sure to give the service account the required Vertex AI
role.
AZURE_OPENAI_API_KEY
- The Azure OpenAI API key.
AZURE_OPENAI_ENDPOINT
- The Azure OpenAI endpoint.
TIKA_SERVER_ENDPOINT
- This is the endpoint where Tika
is served from.
AZURE_STORAGE_URL
- The Azure Blob storage URL. This is required for downloading the
catalogue data.
You can ideally set these environment variables to show Harmony where to look for dependencies and data, but it will work without it (it will download the sentence transformer from HuggingFace Hub, etc).
The deployed Harmony uses an Azure Function to run spaCy, available in the repository here: https://github.com/harmonydata/spacyfunctionapp
So to run locally with Docker Compose you can do:
docker compose up
If you are working with external third-party services in the API, you may find it convenient to make an .env
file in the base folder of the project. You can connect an IDE such as Pycharm to use this .env
file. It will be ignored by .gitignore
so you don't need to worry about accidentally committing your credentials to the repo.
Example content of the .env
file:
GOOGLE_APPLICATION_CREDENTIALS='{ "type": "service_account", ... }'
AZURE_OPENAI_API_KEY=f46axxxxxxxxxxxxxxxxxxxxxxxxxxxd
AZURE_OPENAI_ENDPOINT=https://xxxxxxxxx.openai.azure.com/
OPENAI_API_TYPE=azure
OPENAI_API_VERSION=2023-12-01-preview
OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxX
When deploying, you can use these environment variables in your Docker run command, e.g.
docker run -d -p 80:80 -p 3000:3000 -e GOOGLE_APPLICATION_CREDENTIALS=xxx -e AZURE_OPENAI_API_KEY=xxxx -e "HARMONY_DATA_PATH=/data" -v /home/thomaswood/data:/data harmonydata/harmonyapi:[DOCKER_TAG_HERE]
If you are not running with Docker, you can run the individual components of the Harmony API separately.
Architecture of the Harmony implementation on Azure with FastAPI:
You can install from PyPI.
pip install harmonydata
You can read the user guide at ./harmony_pypi_package/README.md.
By default, Harmony API runs on port 8000 (see screenshot below)
If you are having errors running the API on the port it could be
- a different program is already using port 8000
- you are trying to run on a forbidden port e.g. on port 80 which is private and your computer doesn't give permission to do this
In particular on Windows, you may need to give some kind of permission to a Python program to use any port.
If you want to read in a raw (unstructured) PDF or Excel file, you can do this via a POST request to the REST API. This will convert the file into an Instrument object in JSON.
curl -X 'POST' \
'https://api.harmonydata.ac.uk/text/parse' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '[
{
"file_id": "d39f31718513413fbfc620c6b6135d0c",
"file_name": "GAD-7.pdf",
"file_type": "pdf",
"content": "data:application/pdf;base64,"
}
]'
curl -X 'POST' \
'https://api.harmonydata.ac.uk/text/parse' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '[
{
"file_id": "1d66bce4b80c4b0eaefe33f00cddedef",
"file_name": "GAD-7.xlsx",
"file_type": "xlsx",
"content": "data:application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;base64,UEsDBBQAAAAIAGmwhFZGWsEMggAAALEAAAAQAAAAZG9jUHJvcHMvYXBwLnhtbE2OTQvCMBBE/0rp3W5V8CAxINSj4Ml7SDc2kGRDdoX8fFPBj9s83jCMuhXKWMQjdzWGxKd+EclHALYLRsND06kZRyUaaVgeQM55ixPZZ8QksBvHA2AVTDPOm/wd7LU65xy8NeIp6au3hZicdJdqMSj4l2vzjoXXvB+2b/lhBb+T+gVQSwMEFAAAAAgAabCEVu3qrybuAAAAKwIAABEAAABkb2NQcm9wcy9jb3JlLnhtbM2Sz0rEMBCHX0VybydpRSR0e1E8KQguKN5CMrsbbP6QjLT79qZ1t4voAwi5ZOaXb76BdDpKHRI+pxAxkcV8NbnBZ6njhh2IogTI+oBO5bokfGnuQnKKyjXtISr9ofYIDec34JCUUaRgBlZxJbK+M1rqhIpCOuGNXvHxMw0LzGjAAR16yiBqAayfJ8bjNHRwAcwwwuTydwHNSlyqf2KXDrBTcsp2TY3jWI/tkis7CHh7enxZ1q2sz6S8xvIqW0nHiBt2nvza3t1vH1jf8Kat+HU520ZI3kpx+z67/vC7CLtg7M7+Y+OzYN/Br3/RfwFQSwMEFAAAAAgAabCEVplcnCMQBgAAnCcAABMAAAB4bC90aGVtZS90aGVtZTEueG1s7Vpbc9o4FH7vr9B4Z/ZtC8Y2gba0E3Npdtu0mYTtTh+FEViNbHlkkYR/v0c2EMuWDe2STbqbPAQs6fvORUfn6Dh58+4uYuiGiJTyeGDZL9vWu7cv3uBXMiQRQTAZp6/wwAqlTF61WmkAwzh9yRMSw9yCiwhLeBTL1lzgWxovI9bqtNvdVoRpbKEYR2RgfV4saEDQVFFab18gtOUfM/gVy1SNZaMBE1dBJrmItPL5bMX82t4+Zc/pOh0ygW4wG1ggf85vp+ROWojhVMLEwGpnP1Zrx9HSSICCyX2UBbpJ9qPTFQgyDTs6nVjOdnz2xO2fjMradDRtGuDj8Xg4tsvSi3AcBOBRu57CnfRsv6RBCbSjadBk2PbarpGmqo1TT9P3fd/rm2icCo1bT9Nrd93TjonGrdB4Db7xT4fDronGq9B062kmJ/2ua6TpFmhCRuPrehIVteVA0yAAWHB21szSA5ZeKfp1lBrZHbvdQVzwWO45iRH+xsUE1mnSGZY0RnKdkAUOADfE0UxQfK9BtorgwpLSXJDWzym1UBoImsiB9UeCIcXcr/31l7vJpDN6nX06zmuUf2mrAaftu5vPk/xz6OSfp5PXTULOcLwsCfH7I1thhyduOxNyOhxnQnzP9vaRpSUyz+/5CutOPGcfVpawXc/P5J6MciO73fZYffZPR24j16nAsyLXlEYkRZ/ILbrkETi1SQ0yEz8InYaYalAcAqQJMZahhvi0xqwR4BN9t74IyN+NiPerb5o9V6FYSdqE+BBGGuKcc+Zz0Wz7B6VG0fZVvNyjl1gVAZcY3zSqNSzF1niVwPGtnDwdExLNlAsGQYaXJCYSqTl+TUgT/iul2v6c00DwlC8k+kqRj2mzI6d0Js3oMxrBRq8bdYdo0jx6/gX5nDUKHJEbHQJnG7NGIYRpu/AerySOmq3CEStCPmIZNhpytRaBtnGphGBaEsbReE7StBH8Waw1kz5gyOzNkXXO1pEOEZJeN0I+Ys6LkBG/HoY4SprtonFYBP2eXsNJweiCy2b9uH6G1TNsLI73R9QXSuQPJqc/6TI0B6OaWQm9hFZqn6qHND6oHjIKBfG5Hj7lengKN5bGvFCugnsB/9HaN8Kr+ILAOX8ufc+l77n0PaHStzcjfWfB04tb3kZuW8T7rjHa1zQuKGNXcs3Ix1SvkynYOZ/A7P1oPp7x7frZJISvmlktIxaQS4GzQSS4/IvK8CrECehkWyUJy1TTZTeKEp5CG27pU/VKldflr7kouDxb5OmvoXQ+LM/5PF/ntM0LM0O3ckvqtpS+tSY4SvSxzHBOHssMO2c8kh22d6AdNfv2XXbkI6UwU5dDuBpCvgNtup3cOjiemJG5CtNSkG/D+enFeBriOdkEuX2YV23n2NHR++fBUbCj7zyWHceI8qIh7qGGmM/DQ4d5e1+YZ5XGUDQUbWysJCxGt2C41/EsFOBkYC2gB4OvUQLyUlVgMVvGAyuQonxMjEXocOeXXF/j0ZLj26ZltW6vKXcZbSJSOcJpmBNnq8reZbHBVR3PVVvysL5qPbQVTs/+Wa3InwwRThYLEkhjlBemSqLzGVO+5ytJxFU4v0UzthKXGLzj5sdxTlO4Ena2DwIyubs5qXplMWem8t8tDAksW4hZEuJNXe3V55ucrnoidvqXd8Fg8v1wyUcP5TvnX/RdQ65+9t3j+m6TO0hMnHnFEQF0RQIjlRwGFhcy5FDukpAGEwHNlMlE8AKCZKYcgJj6C73yDLkpFc6tPjl/RSyDhk5e0iUSFIqwDAUhF3Lj7++TaneM1/osgW2EVDJk1RfKQ4nBPTNyQ9hUJfOu2iYLhdviVM27Gr4mYEvDem6dLSf/217UPbQXPUbzo5ngHrOHc5t6uMJFrP9Y1h75Mt85cNs63gNe5hMsQ6R+wX2KioARq2K+uq9P+SWcO7R78YEgm/zW26T23eAMfNSrWqVkKxE/Swd8H5IGY4xb9DRfjxRiraaxrcbaMQx5gFjzDKFmON+HRZoaM9WLrDmNCm9B1UDlP9vUDWj2DTQckQVeMZm2NqPkTgo83P7vDbDCxI7h7Yu/AVBLAwQUAAAACABpsIRWZJCgEIMBAADfAgAAGAAAAHhsL3dvcmtzaGVldHMvc2hlZXQxLnhtbH1STU/cMBD9K5bPFV52Ba1QEolSIXpotQW1PTvJJLFwPOl4wsK/70xgoz2UHizPl9+beePigPSYBwA2z2NMubQD83TlXG4GGH0+wwmSZDqk0bO41Ls8Efh2eTRGt91sLt3oQ7JVscT2VBU4cwwJ9mTyPI6eXj5DxENpz+0xcB/6gTXgqmLyPTwA/5z2JJ5bUdowQsoBkyHoSnt9fnW90/ql4FeAQz6xjU5SIz6q87Ut7UYbgggNK4KX6wluIEYFkjb+vGHalVIfntpH9Ntldpml9hluMP4OLQ+l/WRNC52fI9/j4Q7e5rlYG/zi2VcF4cGQzlkVjRoL9yKEVIekKj0wSTYIHVc/Zsjab+FYWtGYa+QIygq1XaG274DcAkioNwnoCef8wfj0HBYDyYgW0PbwH4LdSrB7h+A7sqlBKXwdwTCazDgpeoOJCaMug16k4F807kQeXf03T31I2UTohG1z9vHCGnqV89UR7EWxGplxXMxBfiCQFki+Q+Sjo9tc/3T1F1BLAwQUAAAACABpsIRW4O5QR6kCAAAWCwAADQAAAHhsL3N0eWxlcy54bWzdVtuK2zAQ/RXhD6iTmJq4xIE2ECi0ZWH3oa9KLMcCWXJlOST79Z2RHOeymqXtYxM2Hs3RmTOaGeFd9e6sxHMjhGOnVum+TBrnuk9p2u8b0fL+g+mEBqQ2tuUOlvaQ9p0VvOqR1Kp0MZvlaculTtYrPbTb1vVsbwbtymSWpOtVbfTVs0iCA7byVrAjV2Wy4UrurPR7eSvVObgX6NgbZSxzkIookzl6+tcAz8MKsxzjtFIbi840KITf3bj9BvCPHjZIpe4zA8d61XHnhNVbWHiOd76B2Gi/nDtI7WD5eb74mFwJ/gEiO2MrYe9kgmu9UqJ2QLDy0ODTmS5F0DnTglFJfjCa+xwujFsm860rE9dA6S9hHp0Q89EVBB69k8RoQOZ7odQz7vpZT+nPIf1TzUKfv1bYYobVvJhw5tEMYcIC499GC7Fvwi7+KSzr5NG4LwOcR/v1r8E48WRFLU9+faonfSr6nIgOft516vxZyYNuRTj7HwuuV/zCY42x8hXUcAr34BA2YUdhndyjBxrky3OqxxpN5fHFuiv85GV4ecrkB95JdVVlu0EqJ/W4amRVCf2m/hDe8R1c+rv4sL8SNR+Ue5nAMrna30Ulh7aYdj1hJcZdV/sbzuA8n24uaEldiZOoNuPSHnbeZGCA6vjx8/uAbP0njlCcgMURxCgdKgOKE1iUzv90niV5noBRuS2jyJLkLElOYMWQjf9SOnFOAZ/4SYsiy/KcquhmE81gQ9Utz/EvHo3KDRmUDir9Xa3pbtMT8v4cUD19b0Kok9KTSJ2UrjUi8bohoyji3aZ0kEF1gZod1I/r4EzFOVmGXaVyo24wjRQFheAsxmc0z4nq5PiN94e6JVlWFHEEsXgGWUYheBtphMoAc6CQLPPvwYf3UXp5T6XX/4TXvwFQSwMEFAAAAAgAabCEVpeKuxzAAAAAEwIAAAsAAABfcmVscy8ucmVsc52SuW7DMAxAf8XQnjAH0CGIM2XxFgT5AVaiD9gSBYpFnb+v2qVxkAsZeT08EtweaUDtOKS2i6kY/RBSaVrVuAFItiWPac6RQq7ULB41h9JARNtjQ7BaLD5ALhlmt71kFqdzpFeIXNedpT3bL09Bb4CvOkxxQmlISzMO8M3SfzL38ww1ReVKI5VbGnjT5f524EnRoSJYFppFydOiHaV/Hcf2kNPpr2MitHpb6PlxaFQKjtxjJYxxYrT+NYLJD+x+AFBLAwQUAAAACABpsIRWGrobqzABAAAjAgAADwAAAHhsL3dvcmtib29rLnhtbI1R0UrDQBD8lXAfYFLRgqXpi0UtiBYrfb8km2bp3W3Y27Tar3eTECz44tPezizDzNzyTHwsiI7Jl3ch5qYRaRdpGssGvI031EJQpib2VnTlQxpbBlvFBkC8S2+zbJ56i8GslpPWltPrhQRKQQoK9sAe4Rx/+X5NThixQIfynZvh7cAkHgN6vECVm8wksaHzCzFeKIh1u5LJudzMRmIPLFj+gXe9yU9bxAERW3xYNZKbeaaCNXKU4WLQt+rxBHo8bp3QEzoBXluBZ6auxXDoZTRFehVj6GGaY4kL/k+NVNdYwprKzkOQsUcG1xsMscE2miRYD7kZLA6BdG6qMZyoq6uqeIFK8KYa/U2mKqgxQPWmOlFxLajcctKPQef27n72oEV0zj0q9h5eyVZTxul/Vj9QSwMEFAAAAAgAabCEViQem6KtAAAA+AEAABoAAAB4bC9fcmVscy93b3JrYm9vay54bWwucmVsc7WRPQ6DMAyFrxLlADVQqUMFTF1YKy4QBfMjEhLFrgq3L4UBkDp0YbKeLX/vyU6faBR3bqC28yRGawbKZMvs7wCkW7SKLs7jME9qF6ziWYYGvNK9ahCSKLpB2DNknu6Zopw8/kN0dd1pfDj9sjjwDzC8XeipRWQpShUa5EzCaLY2wVLiy0yWoqgyGYoqlnBaIOLJIG1pVn2wT06053kXN/dFrs3jCa7fDHB4dP4BUEsDBBQAAAAIAGmwhFZlkHmSGQEAAM8DAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbK2TTU7DMBCFrxJlWyUuLFigphtgC11wAWNPGqv+k2da0tszTtpKoBIVhU2seN68z56XrN6PEbDonfXYlB1RfBQCVQdOYh0ieK60ITlJ/Jq2Ikq1k1sQ98vlg1DBE3iqKHuU69UztHJvqXjpeRtN8E2ZwGJZPI3CzGpKGaM1ShLXxcHrH5TqRKi5c9BgZyIuWFCKq4Rc+R1w6ns7QEpGQ7GRiV6lY5XorUA6WsB62uLKGUPbGgU6qL3jlhpjAqmxAyBn69F0MU0mnjCMz7vZ/MFmCsjKTQoRObEEf8edI8ndVWQjSGSmr3ghsvXs+0FOW4O+kc3j/QxpN+SBYljmz/h7xhf/G87xEcLuvz+xvNZOGn/mi+E/Xn8BUEsBAhQDFAAAAAgAabCEVkZawQyCAAAAsQAAABAAAAAAAAAAAAAAAIABAAAAAGRvY1Byb3BzL2FwcC54bWxQSwECFAMUAAAACABpsIRW7eqvJu4AAAArAgAAEQAAAAAAAAAAAAAAgAGwAAAAZG9jUHJvcHMvY29yZS54bWxQSwECFAMUAAAACABpsIRWmVycIxAGAACcJwAAEwAAAAAAAAAAAAAAgAHNAQAAeGwvdGhlbWUvdGhlbWUxLnhtbFBLAQIUAxQAAAAIAGmwhFZkkKAQgwEAAN8CAAAYAAAAAAAAAAAAAACAgQ4IAAB4bC93b3Jrc2hlZXRzL3NoZWV0MS54bWxQSwECFAMUAAAACABpsIRW4O5QR6kCAAAWCwAADQAAAAAAAAAAAAAAgAHHCQAAeGwvc3R5bGVzLnhtbFBLAQIUAxQAAAAIAGmwhFaXirscwAAAABMCAAALAAAAAAAAAAAAAACAAZsMAABfcmVscy8ucmVsc1BLAQIUAxQAAAAIAGmwhFYauhurMAEAACMCAAAPAAAAAAAAAAAAAACAAYQNAAB4bC93b3JrYm9vay54bWxQSwECFAMUAAAACABpsIRWJB6boq0AAAD4AQAAGgAAAAAAAAAAAAAAgAHhDgAAeGwvX3JlbHMvd29ya2Jvb2sueG1sLnJlbHNQSwECFAMUAAAACABpsIRWZZB5khkBAADPAwAAEwAAAAAAAAAAAAAAgAHGDwAAW0NvbnRlbnRfVHlwZXNdLnhtbFBLBQYAAAAACQAJAD4CAAAQEQAAAAA="
}
]'
Example response from the /parse endpoint:
[
{
"file_id": "fd60a9a64b1b4078a68f4bc06f20253c",
"instrument_id": "7829ba96f48e4848abd97884911b6795",
"instrument_name": "GAD-7 English",
"file_name": "GAD-7.pdf",
"file_type": "pdf",
"file_section": "GAD-7 English",
"language": "en",
"study": "MCS",
"sweep": "Sweep 1",
"questions": [
{
"question_no": "1",
"question_intro": "Over the last two weeks, how often have you been bothered by the following problems?",
"question_text": "Feeling nervous, anxious, or on edge",
"options": [
"Not at all",
"Several days",
"More than half the days",
"Nearly every day"
],
"source_page": 0
}
]
}
]
You can request the similarities between instruments with a second POST request:
curl -X 'POST' \
'https://api.harmonydata.ac.uk/text/match' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"instruments": [
{
"file_id": "fd60a9a64b1b4078a68f4bc06f20253c",
"instrument_id": "7829ba96f48e4848abd97884911b6795",
"instrument_name": "GAD-7 English",
"file_name": "GAD-7 EN.pdf",
"file_type": "pdf",
"file_section": "GAD-7 English",
"language": "en",
"questions": [
{
"question_no": "1",
"question_intro": "Over the last two weeks, how often have you been bothered by the following problems?",
"question_text": "Feeling nervous, anxious, or on edge",
"options": [
"Not at all",
"Several days",
"More than half the days",
"Nearly every day"
],
"source_page": 0
},
{
"question_no": "2",
"question_intro": "Over the last two weeks, how often have you been bothered by the following problems?",
"question_text": "Not being able to stop or control worrying",
"options": [
"Not at all",
"Several days",
"More than half the days",
"Nearly every day"
],
"source_page": 0
}
]
},
{
"file_id": "fd60a9a64b1b4078a68f4bc06f20253c",
"instrument_id": "7829ba96f48e4848abd97884911b6795",
"instrument_name": "GAD-7 Portuguese",
"file_name": "GAD-7 PT.pdf",
"file_type": "pdf",
"file_section": "GAD-7 Portuguese",
"language": "en",
"questions": [
{
"question_no": "1",
"question_intro": "Durante as últimas 2 semanas, com que freqüência você foi incomodado/a pelos problemas abaixo?",
"question_text": "Sentir-se nervoso/a, ansioso/a ou muito tenso/a",
"options": [
"Nenhuma vez",
"Vários dias",
"Mais da metade dos dias",
"Quase todos os dias"
],
"source_page": 0
},
{
"question_no": "2",
"question_intro": "Durante as últimas 2 semanas, com que freqüência você foi incomodado/a pelos problemas abaixo?",
"question_text": " Não ser capaz de impedir ou de controlar as preocupações",
"options": [
"Nenhuma vez",
"Vários dias",
"Mais da metade dos dias",
"Quase todos os dias"
],
"source_page": 0
}
]
}
],
"query": "anxiety",
"parameters": {
"framework": "huggingface",
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
}
}'
Example response
The response contains a dictionary with three key-value pairs: questions
(the questions matched in order), matches
(
the matrix of matches between all items), and query_similarity
(the degree of similarity to the query term).
{
"questions": [
...
],
"matches": [
[
1.0000001192092896,
...
0.9999998807907104
]
],
"query_similarity": [
0.7244994640350342,
...
]
}
This repository also contains code for an alternative serverless deployment on AWS Lambda. The deployment has been
divided into four AWS Lambda functions, managed by Terraform. Please refer to folder serverless_deployment
for
details.
License: MIT License
- Docker - Used for deployment to the web
- Apache Tika - Used for parsing PDFs to text
- HuggingFace - Used for machine learning
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 - SentenceBERT model
- spaCy - Used for NLP analysis
- NLTK - Used for NLP analysis
- Scikit-Learn - Used for machine learning
- Apache Tika: Apache 2.0 License
- spaCy: MIT License
- NLTK: Apache 2.0 License
- Scikit-Learn: BSD 3-Clause
If you would like to cite the tool alone, you can cite:
Wood, T.A., McElroy, E., Moltrecht, B., Ploubidis, G.B., Scopel Hoffmann, M., Harmony [Computer software], Version 1.0, accessed at https://harmonydata.ac.uk/app. Ulster University (2022)
A BibTeX entry for LaTeX users is
@unpublished{harmony,
AUTHOR = {Wood, T.A., McElroy, E., Moltrecht, B., Ploubidis, G.B., Scopel Hoffman, M.},
TITLE = {Harmony (Computer software), Version 1.0},
YEAR = {2022},
Note = {To appear},
url = {https://harmonydata.ac.uk/app}
}
You can also cite the wider Harmony project which is registered with the Open Science Foundation:
McElroy, E., Moltrecht, B., Scopel Hoffmann, M., Wood, T. A., & Ploubidis, G. (2023, January 6). Harmony – A global platform for contextual harmonisation, translation and cooperation in mental health research. Retrieved from osf.io/bct6k
@misc{McElroy_Moltrecht_Scopel Hoffmann_Wood_Ploubidis_2023,
title={Harmony - A global platform for contextual harmonisation, translation and cooperation in mental health research},
url={osf.io/bct6k},
publisher={OSF},
author={McElroy, Eoin and Moltrecht, Bettina and Scopel Hoffmann, Mauricio and Wood, Thomas A and Ploubidis, George},
year={2023},
month={Jan}
}
API Version: 2.
Documentation for Harmony API.
Harmony is a tool using AI which allows you to compare items from questionnaires and identify similar content. You can try Harmony at harmonydata.ac.uk/app and you can read our blog at harmonydata.ac.uk/blog/.
CONTACT
NAME: Thomas Wood URL: https://fastdatascience.com
-
- HEALTH CHECK
- 1.1 GET /health-check
-
- INFO
- 2.1 GET /info/version
-
- TEXT
- 3.1 POST /text/parse
- 3.2 POST /text/match
- 3.3 POST /text/examples
- 3.4 GET /text/cache
Health Check
REQUEST
No request parameters
RESPONSE
STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
undefined
Show Version
REQUEST
No request parameters
RESPONSE
STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
undefined
Parse Instruments Parse PDFs or Excels or text files into Instruments, and identifies the language.
If the file is binary (Excel or PDF), you must supply each file with the content in MIME format and the bytes in base encoding, like the example RawFile in the schema.
If the file is plain text, supply the file content as a standard string.
REQUEST
REQUEST BODY - application/json
[{
Array of object:
file_id string Unique identifier for the file (UUID-4)
file_name string DEFAULT:Untitled file
The name of the input file
file_type* enum ALLOWED:pdf, xlsx, txt, docx
The file type (pdf, xlsx, txt)
content* string The raw file contents
text_content string The plain text content
tables [undefined]
}]
RESPONSE
STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
[{
Array of object:
file_id string Unique identifier for the file (UUID-4)
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string DEFAULT:Untitled instrument
Human-readable name of the instrument
file_name string DEFAULT:Untitled file
The name of the input file
file_type enum ALLOWED:pdf, xlsx, txt, docx
The file type (pdf, xlsx, txt)
file_section string The sub-section of the file, e.g. Excel tab
study string The study
sweep string The sweep
metadata {
Optional metadata about the instrument (URL, citation, DOI, copyright holder)
}
language enum DEFAULT:en
ALLOWED:de, el, en, es, fr, it, he, ja, ko, pt, ru,
uk, zh, ar, la, tr, af, ak, am, as, ay, az, be, bg,
bho, bm, bn, bs, ca, ceb, ckb, co, cs, cy, da, doi,
dv, ee, eo, et, eu, fa, fi, fil, fy, ga, gd, gl, gn,
gom, gu, ha, haw, hi, hmn, hr, ht, hu, hy, id, ig,
ilo, is, jv, ka, kk, km, kn, kri, ku, ky, lb, lg,
ln, lo, lt, lus, lv, mai, mg, mi, mk, ml, mn, mni-
mtei, mr, ms, mt, my, ne, nl, no, nso, ny, om, or,
pa, pl, ps, qu, ro, rw, sa, sd, si, sk, sl, sm, sn,
so, sq, sr, st, su, sv, sw, ta, te, tg, th, ti, tk,
tl, ts, tt, ug, ur, uz, vi, xh, yi, yo, zh-tw, zu,
yue
The ISO 639-2 (alpha-2) encoding of the instrument language
questions* [{
Array of object:
question_no string Number of the question
question_intro string Introductory text applying to the question
question_text* string Text of the question
options [string]
source_page integer DEFAULT: 0
The page of the PDF on which the question was located, zero-indexed
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string Human readable name for the instrument
topics_auto [undefined]
nearest_match_from_mhc_auto {
Automatically identified nearest MHC match
}
}]
}]
STATUS CODE - 422: Validation Error
RESPONSE MODEL - application/json
{
detail [{
Array of object:
loc*
ANY OF
prop
string
prop
integer
msg* string
type* string
}]
}
Match Match instruments
REQUEST
REQUEST BODY - application/json
{
instruments* [{
Array of object:
file_id string Unique identifier for the file (UUID-4)
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string DEFAULT:Untitled instrument
Human-readable name of the instrument
file_name string DEFAULT:Untitled file
The name of the input file
file_type enum ALLOWED:pdf, xlsx, txt, docx
The file type (pdf, xlsx, txt)
file_section string The sub-section of the file, e.g. Excel tab
study string The study
sweep string The sweep
metadata {
Optional metadata about the instrument (URL, citation, DOI, copyright holder)
}
language enum DEFAULT:en
ALLOWED:de, el, en, es, fr, it, he, ja, ko, pt, ru, uk,
zh, ar, la, tr, af, ak, am, as, ay, az, be, bg, bho, bm,
bn, bs, ca, ceb, ckb, co, cs, cy, da, doi, dv, ee, eo,
et, eu, fa, fi, fil, fy, ga, gd, gl, gn, gom, gu, ha,
haw, hi, hmn, hr, ht, hu, hy, id, ig, ilo, is, jv, ka,
kk, km, kn, kri, ku, ky, lb, lg, ln, lo, lt, lus, lv,
mai, mg, mi, mk, ml, mn, mni-mtei, mr, ms, mt, my, ne,
nl, no, nso, ny, om, or, pa, pl, ps, qu, ro, rw, sa, sd,
si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te,
tg, th, ti, tk, tl, ts, tt, ug, ur, uz, vi, xh, yi, yo,
zh-tw, zu, yue
The ISO 639-2 (alpha-2) encoding of the instrument language
questions* [{
Array of object:
question_no string Number of the question
question_intro string Introductory text applying to the question
question_text* string Text of the question
options [string]
source_page integer DEFAULT: 0
The page of the PDF on which the question was located, zero-indexed
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string Human readable name for the instrument
topics_auto [undefined]
nearest_match_from_mhc_auto {
Automatically identified nearest MHC match
}
}]
}]
query string Search term
parameters {
Parameters on how to match
framework string DEFAULT:huggingface
The framework to use for matching
model string DEFAULT:sentence-transformers/paraphrase-multilingual-MiniLM-L12-v
Model
}
}
RESPONSE
STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
{
questions* [{
Array of object:
question_no string Number of the question
question_intro string Introductory text applying to the question
question_text* string Text of the question
options [string]
source_page integer DEFAULT: 0
The page of the PDF on which the question was located, zero-indexed
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string Human readable name for the instrument
topics_auto [undefined]
nearest_match_from_mhc_auto {
Automatically identified nearest MHC match
}
}]
matches* [{
Array of object:
}]
query_similarity [undefined]
}
STATUS CODE - 422: Validation Error
RESPONSE MODEL - application/json
{
detail [{
Array of object:
loc*
ANY OF
prop
string
prop
integer
msg* string
type* string
}]
}
Get Example Instruments
Get example instruments
REQUEST
No request parameters
RESPONSE
STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
[{
Array of object:
file_id string Unique identifier for the file (UUID-4)
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string DEFAULT:Untitled instrument
Human-readable name of the instrument
file_name string DEFAULT:Untitled file
The name of the input file
file_type enum ALLOWED:pdf, xlsx, txt, docx
The file type (pdf, xlsx, txt)
file_section string The sub-section of the file, e.g. Excel tab
study string The study
sweep string The sweep
metadata {
Optional metadata about the instrument (URL, citation, DOI, copyright holder)
}
language enum DEFAULT:en
ALLOWED:de, el, en, es, fr, it, he, ja, ko, pt, ru, uk, zh,
ar, la, tr, af, ak, am, as, ay, az, be, bg, bho, bm, bn,
bs, ca, ceb, ckb, co, cs, cy, da, doi, dv, ee, eo, et,
eu, fa, fi, fil, fy, ga, gd, gl, gn, gom, gu, ha, haw,
hi, hmn, hr, ht, hu, hy, id, ig, ilo, is, jv, ka, kk, km,
kn, kri, ku, ky, lb, lg, ln, lo, lt, lus, lv, mai, mg,
mi, mk, ml, mn, mni-mtei, mr, ms, mt, my, ne, nl, no,
nso, ny, om, or, pa, pl, ps, qu, ro, rw, sa, sd, si, sk,
sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, tg, th,
ti, tk, tl, ts, tt, ug, ur, uz, vi, xh, yi, yo, zh-tw,
zu, yue
The ISO 639-2 (alpha-2) encoding of the instrument language
questions* [{
Array of object:
question_no string Number of the question
question_intro string Introductory text applying to the question
question_text* string Text of the question
options [string]
source_page integer DEFAULT: 0
The page of the PDF on which the question was located, zero-indexed
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string Human readable name for the instrument
topics_auto [undefined]
nearest_match_from_mhc_auto {
Automatically identified nearest MHC match
}
}]
}]
Get Cache Get all items in cache
REQUEST
No request parameters
RESPONSE
STATUS CODE - 200: Successful Response
RESPONSE MODEL - application/json
{
instruments* [{
Array of object:
file_id string Unique identifier for the file (UUID-4)
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string DEFAULT:Untitled instrument
Human-readable name of the instrument
file_name string DEFAULT:Untitled file
The name of the input file
file_type enum ALLOWED:pdf, xlsx, txt, docx
The file type (pdf, xlsx, txt)
file_section string The sub-section of the file, e.g. Excel tab
study string The study
sweep string The sweep
metadata {
Optional metadata about the instrument (URL, citation, DOI, copyright holder)
}
language enum DEFAULT:en
ALLOWED:de, el, en, es, fr, it, he, ja, ko, pt, ru, uk,
zh, ar, la, tr, af, ak, am, as, ay, az, be, bg, bho,
bm, bn, bs, ca, ceb, ckb, co, cs, cy, da, doi, dv, ee,
eo, et, eu, fa, fi, fil, fy, ga, gd, gl, gn, gom, gu,
ha, haw, hi, hmn, hr, ht, hu, hy, id, ig, ilo, is, jv,
ka, kk, km, kn, kri, ku, ky, lb, lg, ln, lo, lt, lus,
lv, mai, mg, mi, mk, ml, mn, mni-mtei, mr, ms, mt, my,
ne, nl, no, nso, ny, om, or, pa, pl, ps, qu, ro, rw,
sa, sd, si, sk, sl, sm, sn, so, sq, sr, st, su, sv,
sw, ta, te, tg, th, ti, tk, tl, ts, tt, ug, ur, uz,
vi, xh, yi, yo, zh-tw, zu, yue
The ISO 639-2 (alpha-2) encoding of the instrument language
questions* [{
Array of object:
question_no string Number of the question
question_intro string Introductory text applying to the question
question_text* string Text of the question
options [string]
source_page integer DEFAULT: 0
The page of the PDF on which the question was located, zero-indexed
instrument_id string Unique identifier for the instrument (UUID-4)
instrument_name string Human readable name for the instrument
topics_auto [undefined]
nearest_match_from_mhc_auto {
Automatically identified nearest MHC match
}
}]
}]
vectors* [{
Array of object:
}]
}
You can cite our validation paper:
McElroy, Wood, Bond, Mulvenna, Shevlin, Ploubidis, Scopel Hoffmann, Moltrecht, Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data. BMC Psychiatry 24, 530 (2024), https://doi.org/10.1186/s12888-024-05954-2
A BibTeX entry for LaTeX users is
@article{mcelroy2024using,
title={Using natural language processing to facilitate the harmonisation of mental health questionnaires: a validation study using real-world data},
author={McElroy, Eoin and Wood, Thomas and Bond, Raymond and Mulvenna, Maurice and Shevlin, Mark and Ploubidis, George B and Hoffmann, Mauricio Scopel and Moltrecht, Bettina},
journal={BMC psychiatry},
volume={24},
number={1},
pages={530},
year={2024},
publisher={Springer}
}