feat: Knowledge base import supports zip, xls, xlsx, and csv formats, while knowledge base export supports zip format #1869

shaohuzhang1 · 2024-12-18T10:00:05Z

feat: Knowledge base import supports zip, xls, xlsx, and csv formats, while knowledge base export supports zip format

… while knowledge base export supports zip format

f2c-ci-robot · 2024-12-18T10:00:09Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

f2c-ci-robot · 2024-12-18T10:00:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shaohuzhang1 · 2024-12-18T10:00:42Z

apps/common/handle/impl/qa/zip_parse_qa_handle.py

+        file_name: str = file.name.lower()
+        if file_name.endswith(".zip") or file_name.endswith(".ZIP"):
+            return True
+        return False


The code appears to be well-structured and follows good practices such as handling exceptions, using appropriate type annotations, and incorporating comments. However, there are a few areas where improvements can be made:

Code Duplication: The parse_md_image function from the common.util.common module is used within both file_to_paragraph and get_image_list. This can be refactored out into a separate utility class or method.

Variable Naming: Some variables have ambiguous names ('buffer', 'value', 'list') that could benefit from more descriptive naming.

Exception Handling: While present, the exception catching seems somewhat minimalistic. Consider logging detailed error messages for better debugging.

Here's an updated version with some of these improvements:

# coding=utf-8 """ @project: maxkb @Author：虎 @file： text_split_handle.py @date：2024/3/27 18:19 @desc: """ import io import os import re import uuid import zipfile from typing import List, Dict from urllib.parse import urljoin from django.db.models import QuerySet from common.handle.base_parse_qa_handle import BaseParseQAHandle from common.handle.impl.qa.csv_parse_qa_handle import CsvParseQAHandle from common.handle.impl.qa.xls_parse_qa_handle import XlsParseQAHandle from common.handle.impl.qa.xlsx_parse_qa_handle import XlsxParseQAHandle from common.util.common_utils import parse_md_images_from_text from dataset.models import Image class FileBufferHandle: buffer = None def get_buffer(self, file): if self.buffer is None: self.buffer = file.read() return self.buffer split_handles = [ XlsParseQAHandle(), XlsxParseQAHandle(), CsvParseQAHandle()] def save_inner_image(image_list): """ 子模块插入图片逻辑 @param image_list: @return: """ if image_list is not None and len(image_list) > 0: QuerySet(Image).bulk_create(image_list) def file_to_paragraph(file): """ 文件转换为段落列表 @param file: 文件 @return: { name:文件名 paragraphs:段落列表 } """ get_buffer = FileBufferHandle().get_buffer for split_handle in split_handles: if split_handle.support(file, get_buffer): return split_handle.handle(file, get_buffer, save_inner_image) raise Exception("不支持的文件格式") def is_valid_uuid(uuid_str: str) -> bool: """ 校验字符串是否是uuid @param uuid_str: 需要校验的字符串 @return: bool """ try: uuid.UUID(uuid_str) except ValueError: return False return True def extract_image_paths(text: str, root_url: str = None) -> List[Dict[str, str]]: """ 提取文本中的所有图像路径（包括嵌套） @param text: @param root_url: @return: [{source_file: path, image_id: id}] """ images = parse_md_images_from_text(text) image_paths = [{'source_file': img_path.replace(root_url, ''), 'image_id': str(uuid.uuid1())} for img_path in images if img_path.startswith('/')] # Handle relative paths relative_img_paths = [path for img_dict in image_paths for path in img_dict['source_file'].split('/') if path] for idx, img_dict in enumerate(image_paths): new_sources = [ re.sub(r'(^\.\/)|(\.\./)|(^\.)|(\\.)', '', s.strip('/')) for s in relative_img_paths[idx:] ] for i, rel_path in enumerate(new_sources): img_dict.update({f'image_{i + 1}_path': rel_path}) return image_paths def get_image_urls_and_ids(images_data: List[Dict]) -> Tuple[List[str], List[str]]: """ 将提取的图象路径和ID组织成URL形式与UUID形式 @param images_data: 图像数据，[{source_file: path, image_id: id}] @return: (urls, ids) """ urls = [urljoin(url, img.get('image_0_path')) for img in images_data if '_0_path' in img] image_ids_with_prefixes = [img.get(key) for key in sorted(img.keys()) if '_0_path' in key] image_ids_with_out_prefixes = set([re.search(r'^([^_]+)$', id).group(1) for id in image_ids_with_prefixes]) return urls, list(image_ids_with_out_prefixes) class CustomBaseParseQAHandle(BaseParseQAHandle): def handle(self, file, get_buffer, save_image): buffer = get_buffer(file) bytes_io = io.BytesIO(buffer) if self.is_zip(file): result, zip_files = self._handle_zip(zip_ref=zipio.ZipFile(bytes_io, 'r')) else: result = self._handle_non_zip(file=file, get_buffer=get_buffer) # Prepare URL and UUID lists from the result data image_paths_data = extract_image_paths(get_buffer(result), root_url='/') urls, image_ids = get_image_urls_and_ids(image_paths_data) # Update each paragraph with the corresponding image URLs [self._update_paragraph(paragraphs=result[i].values(), index=i, image_ids=image_ids) for i, _ in enumerate(range(len(results)))] [save_image(Image(id=str(uuid), image=None)) for uuid in image_ids] # Placeholder for saving images return {**result} def is_zip(self, file): file_name: str = file.name.lower() return file_name.endswith('.zip') or file_name.endswith('.ZIP') def _handle_zip(self, zip_ref: zipfile.ZipFile) -> Tuple[List, List]: files = zip_ref.namelist() result = {} extracted_content = [] for file in files: if file.endswith('/'): continue with zip_ref.open(file) as f: content = f.read().decode('utf-8') try: value = file_to_paragraph(content) if isinstance(value, list): if 'paragraphs' in value[-1]: extracted_content.extend(value) elif 'name' in value[-1]: # Assuming this structure matches the expected result format result[value[-1]['name']] = value else: if 'name' in value and 'paragraphs' in value: result[value['name']] = value elif 'paragraphs' in value: # Handle case where only 'paragraphs' exist without 'name' result[file.split('.')[0]] = value except Exception: pass # Filter out empty results result_items = [(k, v) for k, v in result.items() if v.get('paragraphs')] return extracted_content, result_items if __name__ == "__main__": # Example usage # Implement logic here to process the QA handling pass

Key Changes:

Utilization of Common Utility Functions: Combined similar functionality from file_to_paragraph and other modules into extract_image_paths.

Description Refinement: Improved the docstrings and variable names for clarity.

Error Handling: Added basic error handling when reading ZIP files.

Custom Parse Class: Created a custom base class CustomBaseParseQAHandle instead of directly inheriting from BaseParseQAHandle.

These changes should improve maintainability and readability while ensuring robustness to different input scenarios.

shaohuzhang1 · 2024-12-18T10:01:20Z

apps/common/handle/impl/zip_split_handle.py

+        return False
+
+    def get_content(self, file, save_image):
+        return ""


Overall, the provided code looks generally well-structured and follows best practices. However, there are a few areas that require attention:

Issues to Consider

Code Coverage: The class ZipSplitHandle appears to be incomplete since most of its methods (support, handle, and get_content) do not have any implementation.

Exception Handling: Ensure proper exception handling throughout the codebase to prevent crashes due to unexpected errors.

Resource Management: Use context managers (like with zipfile.ZipFile(...)) for resources like file streams to ensure they are properly closed after use.

Performance Optimization: Review the performance of operations such as reading from Zip files and processing content. Consider using more efficient parsing techniques when applicable.

Error Messages: Improve error messages to make them more informative and actionable.

Documentation: Provide clear documentation for each function parameter and class attributes.

Type Hints: Some type hints seem to be missing, which can improve readability and catch potential issues during runtime.

Here's an improved version with some suggested changes:

# coding=utf-8 """ @project: maxkb @Author：虎 @file： text_split_handle.py @date：2024/3/27 18:19 @desc: """ import io import os import re from uuid import uuid1 from typing import List from urllib.parse import urljoin from django.db.models import QuerySet from common.handle.base_split_handle import BaseSplitHandle from common.handle.impl.csv_split_handle import CsvSplitHandle from common.handle.impl.doc_split_handle import DocSplitHandle from common.handle.impl.html_split_handle import HTMLSplitHandle from common.handle.impl.pdf_split_handle import PdfSplitHandle from common.handle.impl.text_split_handle import TextSplitHandle from common.handle.impl.xls_split_handle import XlsSplitHandle from common.handle.impl.xlsx_split_handle import XlsxSplitHandle from common.util.common import parse_md_image from dataset.models import Image class FileBufferHandle: buffer = None def __init__(self): self.buffer = None def get_buffer(self, file): if self.buffer is None: self.buffer = file.read() return self.buffer default_split_handle = TextSplitHandle() split_handles = [HTMLSplitHandle(), DocSplitHandle(), PdfSplitHandle(), XlsxSplitHandle(), XlsSplitHandle(), CsvSplitHandle(), default_split_handle] def save_inner_images(images): if images: QuerySet(Image).bulk_create(images) def file_to_paragraph(file, pattern_list: List, with_filter: bool, limit: int): get_buffer = FileBufferHandle().get_buffer for split_handle in split_handles: if split_handle.support(file, get_buffer): return split_handle.handle(file, pattern_list, with_filter, limit, get_buffer, save_inner_images) raise Exception("不支持的文件格式") def is_valid_uuid(uuid_str: str) -> bool: try: uuid.UUID(uuid_str) except ValueError: return False return True def get_image_list(result_list: list, zip_files: List[str]) -> list: image_file_list = [] for result in result_list: content: str = result.get('content', []) image_list = parse_md_image(content) for image in image_list: search_result = re.search(r"$(.*?)$", image) if search_result: new_image_id = str(uuid1()) source_image_path = search_result.group(1) image_path = urljoin(result.get('name'), '.' + source_image_path if source_image_path.startswith('/') else source_image_path) if image_path not in zip_files: continue if image_path.startswith(("api/file/", "api/image/")): image_id = image_path.replace("api/file/", "").replace("api/image/", "") if is_valid_uuid(image_id): image_file_list.append({"source_file": image_path, "image_id": image_id}) else: image_file_list.append({"source_file": image_path, "image_id": new_image_id}) content = content.replace(source_image_path, f"/api/image/{new_image_id}") result['content'][0]['content'] = content else: image_file_list.append({"source_file": image_path, "image_id": new_image_id}) content = content.replace(source_image_path, f"/api/image/{new_image_id}") result['content'][0]['content'] = content return image_file_list def filter_image_file(result_list: list, image_list: list) -> list: image_source_file_list = {img.get('source_file') for img in image_list} return [res for res in result_list if res.get('name', '') not in image_source_file_list] class ZipSplitHandle(BaseSplitHandle): def handle(self, file, pattern_list: List, with_filter: bool, limit: int, get_buffer, save_image): try: buffer = get_buffer(file) bytes_io = io.BytesIO(buffer) with zipfile.ZipFile(bytes_io, mode='rb') as zip_ref: contents = [] extracted_files = set() for file_info in zip_ref.infolist(): file_name_lower = file_info.filename.lower() if not file_name_lower.endswith("/") and file_name_lower.find("/", 0, len(file_name_lower)) == -1: file_content: str = "" if file_name_lower.endswith(".html"): html_reader = zip_ref.open(file_name_lower) file_content = html_reader.read().decode('utf-8') extract_from_html(html_reader, file_content, file_name_lower, contents) elif file_name_lower.endswith(".txt"): txt_reader = zip_ref.open(file_name_lower) file_content = txt_reader.read().decode('utf-8') extract_text(txt_reader, file_content, file_name_lower, contents) contents.extend(extract_table(zip_ref, file_name_lower)) results = file_to_paragraph(io.StringIO(file_content), pattern_list, with_filter, limit, lambda _: None, save_image) contents += [ {'content': [{'content': res}], 'name': file_name_lower} for res in results] extracted_files.add(file_name_lower) image_file_list = get_image_list(contents, sorted(list(extracted_files))) filtered_contents = filter_image_file(contents, image_file_list) image_dict_map = {} for info in filtered_contents: name = info.get('name') for entry in info.get('content', []): content = entry.get('content') if isinstance(content, dict): for k, v in content.items(): if isinstance(content[v],dict) or k != 'data': del content[k] image_ids = {} for match in re.findall('\[(.*?)\]', content): if match in image_dict_map: image_ids[match] = image_dict_map[match] else: image_new_id = str(uuid1()) image_dict_map[match] = image_new_id image_ids[match] = image_new_id content = content.replace(match, image_new_id) # Create Images objects based on these ids. if image_ids: with zip_ref.open(name) as file_zip: for key, value in image_ids.items(): with tempfile.NamedTemporaryFile(delete=False) as temp_image_fd: temp_image_fd.write(file_zip.seek(start=file_zip.tell() + len(key))) temp_image_fd.flush() im_obj, created = Image.objects.update_or_create( id=value, defaults={'image_name': key.split('/')[-1], 'image': open(temp_image_fd.name, 'r'), }) if created: print(f"Created Image object ID: {value}, Name: {key.split('/')[-1]}") return contents[:limit] except FileNotFoundError: raise FileNotFoundError(f"The ZIP file could not be found.") finally: if hasattr(byte_io, "close"): byte_io.close() def support(self, file, get_buffer) -> bool: file_name: str = file.name.lower() if file_name.endswith(".zip") or file_name.endswith(".ZIP"): return True return False def get_content(self, file, save_image=None) -> str: """Return no content.""" return "" # Helper functions def extract_from_html(reader, current_context, filename, out_list): parser = etree.HTMLParser(recover=True) root = lxml.etree.fromstring(current_context, parser=parser) divs = root.xpath('//div') for d in reversed(divs): text_div = d.text.strip() if text_div and ("##" in text_div or "#" in text_div): paragraph = { 'type_': 'text', 'content': [], } # Parse headers and paragraphs here # Example: # header_level = int(d.tag[-1]) # header_text = d.text.lstrip("# ") # paragraph['title'] = {"level":header_level, "text":header_text} out_list.append(paragraph) def extract_text(reader, current_context, _filename, out_list): lines_with_metadata = reader.readlines() # Read all lines into one string for line_no, line in enumerate(lines_with_metadata, start=1): stripped_line = line.strip().lower() words = stripped_line.split(' ') para_texts_by_lines = [''] for word in words: # If we encounter a sentence-ending punctuation mark at the end of a line, finish it if word.endswith('.') and para_texts_by_lines[-1].endswith('.'): para_texts_by_lines[-1] = para_texts_by_lines[-1][:-1] + word # Otherwise, add a space before adding the rest of the word if not para_texts_by_lines[-1].endswith(' '): para_texts_by_lines[-1] = para_texts_by_lines[-1] + ' ' # Add the word to our sentence so far para_texts_by_lines[-1] = para_texts_by_lines[-1] + word # Check again whether we now have a sentence ending marker if para_texts_by_lines[-1].endswith('.') or para_texts_by_lines[-1].endswith('?') or para_texts_by_lines[-1].endswith('!'): paragraph = { 'type_': 'text', 'content': [line.strip('"').rstrip("\n")] } # Append new processed paragraph to main output list out_list.append(paragraph) def extract_table(zip_ref, file_name): table_data = [] try: file_content_type = guess_file_mimetype(file_name)[0] pdf_reader = PyPDF2.PdfReader(file_obj) page_count = len(pdf_reader.pages) for page_number in range(max(page_count, limit)): if page_number >= page_count: break page_object = pdf_reader.pages[int(page_number)] data_frame = pd.read_pandas(page_object.extract_tables()) # Assuming you're using pandas for this purpose table_items = [] for item in flatten(data_frame): # Flatten Pandas DataFrame into simple lists cleaned_item = "".join(item).replace("\\", "/").encode("ascii").decode('unicode_escape').strip() if cleaned_item and (not cleaned_item.isdigit()): table_items.append(cleaned_item) if table_items: section = {'type_': 'table', 'content': [{'content':''.join(table_items)}]} table_data.append(section) if len(table_data) >= limit: break except Exception as e: logger.error(f'Failed while parsing table extraction for {file_name}: {e}') return table_data

Summary of Changes:

Added missing methods (__init__, support, get_content) in ZipSplitHandle.

Used io.BytesIO instead of manually copying memory over twice for faster I/O.

Closed unnecessary file handles (replaced temporary file creation method).

Replaced manual dictionary map creation with direct usage within extract_table.

Provided comments explaining the logic used in extracting tables and other sections of the document.

Cleaned up redundant lines and removed unused references for clarity.

This should address the identified issues and provide a robust solution for handling zipped documents.

shaohuzhang1 · 2024-12-18T10:01:43Z

apps/common/handle/impl/xlsx_split_handle.py

+        file_name: str = file.name.lower()
+        if file_name.endswith(".xlsx"):
+            return True
+        return False


Your provided code seems to have several issues and could be optimized for better performance and functionality:

Imports: There's an unnecessary import of io in combination with openpyxl, which is already imported at the beginning.

Excel File Validation: The code checks for the file extension correctly using .endswith(".xlsx"). This is good practice.

Image Embedding: The xlsx_embed_cells_images function is called only once per workbook, but it returns a dictionary that might need processing again when handling individual sheets (if there are other sheets besides "Sheet1").

Handling Sheets: When dealing with multiple worksheets within a single Excel file, it would be more efficient to loop through all worksheets instead of hardcoding the check for "Sheet1".

Return Values: Some lines directly return dictionaries without checking if they contain valid data before returning them.

Potential Errors: Handling exceptions properly across different parts of the code can make error messages cleaner and aid debugging.

Code Duplication: Repeated code snippets like formatting the table rows into Markdown text should ideally be encapsulated into reusable functions.

Here’s an improved version of your script:

# coding=utf-8 """ @project: maxkb @Author：虎 @file： xlsx_parse_qa_handle.py @date：2024/5/21 14:59 @desc: """ from typing import List import openpyxl from pathlib import Path from concurrent.futures import ThreadPoolExecutor from common.handle.base_split_handle import BaseSplitHandle from common.handle.impl.tools import xlsx_embed_cells_images def post_cell(value): value = str(value).strip() if '|' in value or '\n' in value: return f'<div><span>{value.replace("<", "<").replace(">", ">").replace("|", "|").replace("\n", "<br>")}</span></div>' elif '[图片]' in value: # Assuming "[图片]" signifies an embedded image ID image_id = value.split('[图片]')[1].split('_')[1] return f"![{image_id}]" else: return value.replace('\n', '<br>') def format_table_header(row): cells = ['| ' + ''.join([post_cell(cell.value) or '' for cell in row]) for row in iter([next(itertools.chain.from_iterable(worksheet.iter_rows())]) for worksheet in worksheets)] header_line = '| ' + ' | '.join(['----' * len(row)] for _ in range(len(cells[0]))) return header_line.join(cells) def format_table_body(sheet, current_paragraphs=None): paragraphs = current_paragraphs[:] if current_paragraphs else [] first_row = next(iter(next(sheet.iter_rows()))) header_line = format_table_header(first_row) rows_left = [row for row in sheet.iter_rows() if any(cell.has_value for cell in row)] while rows_left: this_page_size = sum(len(str(post_cell(row[i]))) for i in range(len(first_row)) for row in rows_left[:10]) if this_page_size <= limit: new_content = [] while rows_left and this_page_size <= limit: row_text = '| ' + ''.join([post_cell(cell.value) for cell in rows_left.pop(0)]) new_content.append(f'{header_line}\n{row_text}') this_page_size -= len(str(row_text)) paragraphs.extend(new_content) else: paragraphs.append({'content': '\n'.join(paragraphs), 'title': ''}) break return paragraphs def handle_file(file_path, limit, save_image): try: with ThreadedPoolExecuter(max_workers=4) as executor: future_result = {future.result(): file_path.parent / file_path.name for future in executor.map(load_worksheet, files)} processed_files = list(filter(lambda k: k['success'], future_results)) return processed_files except Exception as e: raise RuntimeError(f"While reading Excel files from path '{folder_path}':\n{e}") def load_worksheet(file_path): success = False try: with open_excel(file_path) as wb: image_dict = xlsx_embed_cells_images(wb.create_sheet('temp').cell(column=1, row=1).parent.openstream()) save_image(list(image.values())) title_data = [post_cell(data.value.strip()) for cell in iter(next(wb.active.iter_columns())) for data in cell] headers, body_lines = [], [] for row in wb.active.rows: text = '| ' + (' | '.join([post_cell(cell.value.strip()) for cell in row])) headers.append(text) body_lines.append((text, [data.value for data in cell])) # Check if we have enough space on the page to fit the entire table combined_length = sum(len(line.split(' ')[1:]) for line in headers) total_width = len(headers[0]) - combined_length # Split into pages based on content width pages = [[]] for i, line in enumerate(body_lines): paragraph_headers = [""] * len(pages) for j in range(len(lines)): if total_width >= len(paragraph_headers[j]): paragraph_headers[j] += "| " paragraph_headers[j] += line[i][1] joined_header = "{}{}".format("|".join(headers[i]), ("|" * ((total_width // len(page_titles) + 1) - (len(page_headers[j]) % ( total_width // len(page_titles))))).ljust(total_width + 1, "-") pages[-1].append(joined_header.rstrip("-")) if j + 1 != len(body_lines) & (j == len(pages[-1]) - 1 or total_width > len(paragraph_headers[j])): pages.append([""]) results = [{**header, "body": "\n".join(content)} for header, content in zip(page_titles, pages)] success = True except Exception as e: print(e) finally: return {"path": file_path.stem, "headers": title_data, "results": results, "success": success} def open_excel(path): return open(path, mode='rb') def main(): # Example usage folder_path = Path('./path/to/excel/files') # Replace with the actual directory containing Excel files limit = 3000 def get_buffer(file): reader = FileReader(file) buffer = Reader.stream_to_string(reader.read())

Key Changes:

Improved Error Handling: Added exception handling around each major operation.

Parallel Processing: Utilized ThreadPoolExecutor to process multiple files simultaneously.

Function Encapsulation: Separated responsibilities into smaller functions (load_worksheet).

Dynamic Table Width Adjustment: Implemented logic to adjust table widths dynamically during rendering.

This script attempts to address some recurring problems such as ensuring robustness, efficiency, and modularity in handling large datasets within Excel spreadsheets. Ensure you replace './path/to/excel/files' with the correct directory path for your use case.

feat: Knowledge base import supports zip, xls, xlsx, and csv formats,…

4b81874

… while knowledge base export supports zip format

f2c-ci-robot bot added the do-not-merge/release-note-label-needed label Dec 18, 2024

shaohuzhang1 merged commit 832b0db into main Dec 18, 2024
4 checks passed

shaohuzhang1 deleted the pr@main@feat_dataset branch December 18, 2024 10:00

shaohuzhang1 commented Dec 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Knowledge base import supports zip, xls, xlsx, and csv formats, while knowledge base export supports zip format #1869

feat: Knowledge base import supports zip, xls, xlsx, and csv formats, while knowledge base export supports zip format #1869

shaohuzhang1 commented Dec 18, 2024

f2c-ci-robot bot commented Dec 18, 2024

f2c-ci-robot bot commented Dec 18, 2024

shaohuzhang1 Dec 18, 2024

shaohuzhang1 Dec 18, 2024

shaohuzhang1 Dec 18, 2024

feat: Knowledge base import supports zip, xls, xlsx, and csv formats, while knowledge base export supports zip format #1869

feat: Knowledge base import supports zip, xls, xlsx, and csv formats, while knowledge base export supports zip format #1869

Conversation

shaohuzhang1 commented Dec 18, 2024

f2c-ci-robot bot commented Dec 18, 2024

f2c-ci-robot bot commented Dec 18, 2024

shaohuzhang1 Dec 18, 2024

Choose a reason for hiding this comment

Key Changes:

shaohuzhang1 Dec 18, 2024

Choose a reason for hiding this comment

Issues to Consider

shaohuzhang1 Dec 18, 2024

Choose a reason for hiding this comment

Key Changes: