Skip to content

Commit

Permalink
Plugins: String parser errors on invalid UTF-8
Browse files Browse the repository at this point in the history
This will extend the example with error handling situation
  • Loading branch information
AmmarAbouZor committed Nov 15, 2024
1 parent 49b4dae commit 961d6a9
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 15 deletions.
2 changes: 1 addition & 1 deletion application/apps/indexer/plugins/string_parser/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ name = "string_parser"
version = "0.1.0"
edition = "2021"
authors = ["Ammar Abou Zor <ammar.abou.zor@accenture.com>"]
description = "An example of a Chipmunk parser that parses the bytes into UTF-8 strings line by line, including invalid characters"
description = "An example of a Chipmunk parser that parses the bytes into valid UTF-8 strings line by line"

[dependencies]
# TODO AAZ: Use lib from crates.io once the crate is published
Expand Down
2 changes: 1 addition & 1 deletion application/apps/indexer/plugins/string_parser/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# String Tokenizer Parser

This example demonstrates a simple plugin parser in Chipmunk.
The parser converts the provided bytes into UTF-8 strings, processing them line by line, including any invalid characters.
The parser converts the provided bytes into UTF-8 strings, processing them line by line, and it will error on encountering invalid UTF-8 characters.
This example will guide you through creating and configuring a plugin using the provided configurations, as well as utilizing support functions from this library, such as logging.
20 changes: 7 additions & 13 deletions application/apps/indexer/plugins/string_parser/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
use core::str;
use std::{iter, path::PathBuf};

use memchr::memchr;
Expand All @@ -7,27 +8,20 @@ use plugins_api::{
parser_export,
};

/// Simple struct that converts the given bytes into UTF-8 Strings - including
/// invalid characters, line by line.
/// Simple struct that converts the given bytes into valid UTF-8 Strings line by line.
pub struct StringTokenizer;

impl StringTokenizer {
/// Converts a slice from the given data to UTF-8 String stopping when it hit the first
/// break-line character to return one line at a time.
fn parse_line(&self, data: &[u8]) -> Result<ParseReturn, ParseError> {
let res = if let Some(line_brk_idx) = memchr(b'\n', data) {
let line = String::from_utf8_lossy(&data[..line_brk_idx]);
let yld = ParseYield::Message(line.into());
let end_idx = memchr(b'\n', data).unwrap_or_else(|| data.len() - 1);

ParseReturn::new((line_brk_idx + 1) as u64, Some(yld))
} else {
let content = String::from_utf8_lossy(data);
let yld = ParseYield::Message(content.into());
let line = str::from_utf8(&data[..end_idx])
.map_err(|err| ParseError::Parse(format!("Convertion to UTF-8 failed. Error {err}")))?;
let yld = ParseYield::Message(line.into());

ParseReturn::new(data.len() as u64, Some(yld))
};

Ok(res)
Ok(ParseReturn::new((end_idx + 1) as u64, Some(yld)))
}
}

Expand Down

0 comments on commit 961d6a9

Please sign in to comment.