Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/text extraction #11

Merged
merged 30 commits into from
Jun 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
f719a79
Script for text extraction
u21598012 May 30, 2024
7ba5d6f
Update python-app-test.yml
Yeshlen May 31, 2024
bcf7ac4
Merge branch 'feature/text_extraction' of https://github.com/COS301-S…
u21598012 May 31, 2024
1089a44
Basic File Agnostic
u21598012 Jun 1, 2024
c7482ed
Packaged functionality
u21598012 Jun 1, 2024
efe5142
Refactored to class
u21598012 Jun 1, 2024
d0fd69e
Refactored to subsystems
u21598012 Jun 2, 2024
18883b3
Validator unit tests
u21598012 Jun 2, 2024
c653376
unit tests completed
u21598012 Jun 2, 2024
a37ac7a
removed externals
u21598012 Jun 2, 2024
a57204e
changed from void
u21598012 Jun 2, 2024
b84a2b4
requirements.txt
u21598012 Jun 2, 2024
111a47a
Added Language Detector
Yeshlen Jun 2, 2024
19f1d8b
Merge pull request #12 from COS301-SE-2024/feature/lang_detect
u21598012 Jun 2, 2024
9be39c5
class seaparation
u21598012 Jun 2, 2024
ccc18db
Unit tests for Lang Detection
Yeshlen Jun 3, 2024
4c21e16
Branch Clean Up
Yeshlen Jun 3, 2024
842e493
Branch Clean Up - Mockdata
Yeshlen Jun 3, 2024
d9f83d2
Merge branch 'develop' into feature/text_extraction
Yudi-G Jun 3, 2024
0cf7556
linting fixes and ignoring useless lint warnings
Yudi-G Jun 3, 2024
5ddfda6
Update README.md
u21598012 Jun 3, 2024
2a95cdd
Update README.md
u21598012 Jun 3, 2024
648effb
Update README.md
u21598012 Jun 3, 2024
7696fed
Update README.md
u21598012 Jun 3, 2024
8f3eb40
testing fixes
Yudi-G Jun 3, 2024
339eb1d
testing fixes
Yudi-G Jun 3, 2024
844267b
actions debugging
Yudi-G Jun 3, 2024
01d4963
actions debugging
Yudi-G Jun 3, 2024
dedf72b
actions debugging
Yudi-G Jun 3, 2024
7c5f156
actions debugging
Yudi-G Jun 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/configs/.flake8
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
[flake8]
extend-ignore = W292, W291, E302
extend-ignore = W292, W291, E302, W293, E501
6 changes: 5 additions & 1 deletion .github/workflows/lint-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@
VALIDATE_JSCPD: false
VALIDATE_NATURAL_LANGUAGE: false
VALIDATE_PYTHON_FLAKE8: false
VALIDATE_PYTHON_MYPY: false
VALIDATE_GITLEAKS: false
VALIDATE_CSHARP: false


# doing this to use the config
run-flake8-lint:
Expand All @@ -65,4 +69,4 @@

- name: Run flake8
run: |
flake8 --config=.github/configs/.flake8
flake8 --config=.github/configs/.flake8
5 changes: 4 additions & 1 deletion .github/workflows/python-app-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,14 +48,17 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install flake8 pytest pytest-cov
ls -a
cd backend
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi

- name: Test with pytest
run: |
ls -a
cd backend/Document_parser
pytest --cov=. --cov-report=xml

- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v4.0.1
with:
token: ${{ secrets.CODECOV_TOKEN }}

8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ The GDPR Data Noncompliance Detector is a software tool designed to identify ins
## Demos

### Demo 1
[Demo 1 Documentation](https://me-qr.com/mobile/pdf/22767945)

### Demo 2

Expand All @@ -32,6 +33,13 @@ The GDPR Data Noncompliance Detector is a software tool designed to identify ins
We use [Monday.com](https://tuks247552.monday.com/boards)

## Testing
[Link to the lang_detection_unit_test.py file](https://github.com/COS301-SE-2024/GDPR-data-noncompliance-detector/blob/feature/text_extraction/backend/Document%20Parser/lang_detection_unit_test.py)

[Link to the storage_and_submission_unit_tests.py](https://github.com/COS301-SE-2024/GDPR-data-noncompliance-detector/blob/feature/text_extraction/backend/Document%20Parser/storage_and_submission_unit_tests.py)

[Link to the text_extractor_unit_tests.py](https://github.com/COS301-SE-2024/GDPR-data-noncompliance-detector/blob/feature/text_extraction/backend/Document%20Parser/text_extractor_unit_tests.py)

[Link to the validator_unit_tests.py](https://github.com/COS301-SE-2024/GDPR-data-noncompliance-detector/blob/feature/text_extraction/backend/Document%20Parser/validator_unit_tests.py)

## Team

Expand Down
47 changes: 47 additions & 0 deletions backend/Document_parser/20240603_095415_o.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
Personal Information:

Name: Harvey Spectre
Date of Birth: July 18, 1972
Address: 123 Pearson Street, New York, NY
Email Address: harvey.spectre@example.com
Phone Number: (555) 555-1234
National Identification Number: 123-45-6789
IP Address: 192.168.1.100
Social Media Profile: @HarveySpectreLaw (Twitter)

Special Categories of Personal Data:

Genetic Data: Harvey Spectre has opted to undergo genetic testing for ancestry
purposes. The results indicate a diverse genetic background with ancestry tracing
back to European, African, and Middle Eastern regions. Additionally, the genetic
test reveals a predisposition to cardiovascular diseases based on family medical
history.

Biometric Data: Biometric data, including fingerprint scans and facial recognition
data, is collected as part of Harvey Spectre's access control measures for his law
firm. These biometric identifiers ensure secure entry into restricted areas of the
office.

Health Data: Harvey Spectre's medical records detail his health history, including
treatment for a sports-related injury sustained during his college years and
regular check-ups for managing hypertension. Medication records show prescriptions
for blood pressure management and occasional pain relief medication.

Data revealing Racial and Ethnic Origin: Harvey Spectre self-identifies as
biracial, with a mix of Caucasian and African-American heritage. This information
is included in demographic surveys conducted by his workplace and educational
institutions.

Political Opinions: Harvey Spectre is an active member of a political party and has
publicly expressed his views on various political matters through social media
platforms and participation in local rallies.

Religious or Ideological Convictions: While Harvey Spectre's religious affiliation
is not explicitly stated, his actions and statements indicate a secular humanist
worldview, emphasizing ethical principles and personal responsibility.

Trade Union Membership: As a prominent lawyer, Harvey Spectre is not a member of a
trade union. However, he has represented clients involved in labor disputes and
negotiations with trade unions in his legal practice.


Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
3 changes: 3 additions & 0 deletions backend/Document_parser/copy.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Empty DataFrame
Columns: [Test Text]
Index: []
24 changes: 24 additions & 0 deletions backend/Document_parser/document_parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# import os
import sys
from validator import validator
from text_extractor import text_extractor
from storage_and_submission import storage_and_submission


class document_parser:
def __init__(self, file_path):
self.file_path = file_path
self.validator = validator()
self.text_extractor = text_extractor()
self.storage_and_submission = storage_and_submission()

def process(self):
try:
extension = self.validator.process_file(self.file_path)
text = self.text_extractor.extract_text_multi(self.file_path, extension)
output = self.storage_and_submission.submit(text)
except SystemExit as e:
print("An error occurred: ", e)
sys.exit(1)

return output
42 changes: 42 additions & 0 deletions backend/Document_parser/lang_detection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
from langdetect import detect_langs
from langcodes import Language


class location_finder:
def detect_country(self, file):
with open(file, 'r', encoding='utf-8') as file:
data = file.read()

try:
languages = detect_langs(data)
# primary_language = str(languages[0]).split(':')[0]
# return primary_language

possible_languages = []

for language in languages:
country_code = language.lang
full_country_name = Language.make(country_code).display_name()
possible_languages.append((full_country_name, language.prob))

return possible_languages

except Exception as e:
print("Error:", e)
return None

# def main():

# file = "../../mock_data/language_data/polish.txt"
# countries = detect_country(file)

# if countries:
# print("Most probable countries of origin:")
# for country in countries:
# print(f'Country: {country[0]}, Probability: {country[1]}')

# else:
# print("Could not determine the country of origin.")

# if __name__ == "__main__":
# main()
32 changes: 32 additions & 0 deletions backend/Document_parser/lang_detection_unit_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import unittest
from lang_detection import location_finder


class TestLocationFinder(unittest.TestCase):

def setUp(self):
self.finder = location_finder()

def test_detect_country_english(self):
result = self.finder.detect_country('../mockdata/polish.txt')
self.assertIsNotNone(result)
self.assertTrue(any(lang[0] == 'Polish' for lang in result))

def test_detect_country_spanish(self):
result = self.finder.detect_country('../mockdata/dutch.txt')
self.assertIsNotNone(result)
self.assertTrue(any(lang[0] == 'Dutch' for lang in result))

def test_detect_country_french(self):
result = self.finder.detect_country('../mockdata/german.txt')
self.assertIsNotNone(result)
self.assertTrue(any(lang[0] == 'German' for lang in result))

# This test is expected to fail
def test_detect_country_invalid_file(self):
result = self.finder.detect_country('dummy.txt')
self.assertIsNone(result)


if __name__ == '__main__':
unittest.main()
28 changes: 28 additions & 0 deletions backend/Document_parser/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from document_parser import document_parser
from lang_detection import location_finder
import sys

def main():

path = input("File Name: ")
parser = document_parser(path)
file = parser.process()
print(file)
locale_search = location_finder()
countries = locale_search.detect_country(file)

if countries:
print("Most probable countries of origin:")
for country in countries:
print(f'Country: {country[0]}, Probability: {country[1]}')

else:
print("Could not determine the country of origin.")


if __name__ == "__main__":
try:
main()
except SystemExit as e:
print("An error occurred: ", e)
sys.exit(1)
Binary file added backend/Document_parser/requirements.txt
Binary file not shown.
12 changes: 12 additions & 0 deletions backend/Document_parser/storage_and_submission.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from datetime import datetime

class storage_and_submission:
def __init__(self):
now = datetime.now()
self.timestamp_str = now.strftime("%Y%m%d_%H%M%S")
self.filename = f'{self.timestamp_str}_o.txt'

def submit(self, text):
with open(self.filename, 'w') as f:
f.write(text)
return self.filename
28 changes: 28 additions & 0 deletions backend/Document_parser/storage_and_submission_unit_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import unittest
import os
from datetime import datetime
from storage_and_submission import storage_and_submission


class TestStorageAndSubmission(unittest.TestCase):
def setUp(self):
self.storage = storage_and_submission()

def test_init(self):
now = datetime.now()
timestamp_str = now.strftime("%Y%m%d_%H%M%S")
filename = f'{timestamp_str}_o.txt'
self.assertEqual(self.storage.timestamp_str, timestamp_str)
self.assertEqual(self.storage.filename, filename)

def test_submit(self):
text = 'Test text'
self.storage.submit(text)
with open(self.storage.filename, 'r') as f:
file_text = f.read()
self.assertEqual(file_text, text)
os.remove(self.storage.filename)


if __name__ == '__main__':
unittest.main()
35 changes: 35 additions & 0 deletions backend/Document_parser/text_extractor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
import pandas as pd
from pdfminer.high_level import extract_text
from docx import Document


class text_extractor:
def __init__(self):
self.ext = ''

def extract_text_from_pdf(self, file_path):
return extract_text(file_path)

def extract_text_from_docx(self, file_path):
doc = Document(file_path)
return ' '.join([paragraph.text for paragraph in doc.paragraphs])

def extract_data_from_excel(self, file_path):
df = pd.read_excel(file_path)
if df.empty:
column_name = df.columns[0]
return column_name
else:
return df.to_string(index=False)

def extract_text_multi(self, file_path, extension):
if extension == '.pdf':
text = self.extract_text_from_pdf(file_path)
elif extension == '.docx':
text = self.extract_text_from_docx(file_path)
elif extension in ['.xlsx', '.xls']:
text = self.extract_data_from_excel(file_path)
else:
text = None

return text
64 changes: 64 additions & 0 deletions backend/Document_parser/text_extractor_unit_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import unittest
from text_extractor import text_extractor
import tempfile
import os
from openpyxl import Workbook
from docx import Document
from reportlab.pdfgen import canvas
# import pandas as pd


class TestTextExtractor(unittest.TestCase):
def setUp(self):
self.extractor = text_extractor()

def test_extract_text_from_pdf(self):
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as temp:
c = canvas.Canvas(temp.name)
c.drawString(100, 750, "Test text")
c.save()

result = self.extractor.extract_text_from_pdf(temp.name)
self.assertEqual(result, 'Test text\n\n\x0c')
os.remove(temp.name)

def test_extract_text_from_docx(self):
doc = Document()
doc.add_paragraph('Test text')

with tempfile.NamedTemporaryFile(suffix=".docx", delete=False) as temp:
doc.save(temp.name)
result = self.extractor.extract_text_from_docx(temp.name)
self.assertEqual(result, 'Test text')

os.remove(temp.name)

def test_extract_data_from_excel(self):
wb = Workbook()
ws = wb.active
ws['A1'] = 'Test text'

with tempfile.NamedTemporaryFile(suffix=".xls", delete=False) as temp:
wb.save(temp.name)
result = self.extractor.extract_data_from_excel(temp.name)
expected_result = 'Test text'
self.assertEqual(result, expected_result)

os.remove(temp.name)

def test_extract_text(self):
wb = Workbook()
ws = wb.active
ws['A1'] = 'Test text'

with tempfile.NamedTemporaryFile(suffix=".xlsx", delete=False) as temp:
wb.save(temp.name)
result = self.extractor.extract_data_from_excel(temp.name)
expected_result = 'Test text'
self.assertEqual(result, expected_result)

os.remove(temp.name)


if __name__ == '__main__':
unittest.main()
Loading
Loading