Skip to content

A simple tokenizer I made for JS while making a project that required tokens.

Notifications You must be signed in to change notification settings

CalinZBaenen/ParseJS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ParseJS

ParseJS is a simple library I made (and bug-fixed) in the span of two (2) days. Currently, ParseJS (and ParseTS) remain *mostly* complete.

Example in action.

For an example of how you could use this library, check out this Brain**** interpreter I made on SoloLearn's Web-Development code playground.

Installation.

The developer's suggested method of installation is by downloading the contents of the repo into a zip, and extracting them.
But, if you don't want to have the file on your drive, you can use the CDN (Content Delivery Network) script import:

<script src="https://cdn.jsdelivr.net/gh/CalinZBaenen/ParseJS@main/src/parse_string.js"></script>

Description.

English.

The parse_string function of ParseJS takes in a block of text, and a list of keywords (tokens*) and scans the text you provided letter by letter and returns a list.
If there is a keyword that begins with the letter the function is currently looking at, checks if the sequence of letters ahead of the current letter (in combination with the current one) spells out a valid keyword. If it does, a symbol representing the token found will be inserted into the list, otherwise the current letter is inserted instead.

Programmernese.

The parse_string function of ParseJS takes in a string (str) and an array of strings (toks) and returns an array (parsed_array) of string OR symbol (Array<string|symbol>). parse_string iterates over str, and if there is a string (keyword*) in toks that begins with the current character being iterated over, it will check if the following sequence of characters forms a valid keyword. If a valid keyword is found, a symbol (Symbol.for( tok )) is inserted into parsed_array, otherwise the current character is inserted instead.

Examples

Sonic The Hedgehog. Image of Sonic the Hedgehog.Image of Tails the two-tailed Fox.Image of Knuckles the Echidna.

A parse_string test that uses Sonic characters as tokens. - example.png

So, we have this code. - Let's try to explain what's going on, and why we get the output we do.

// We tell `parse_string` that we want it to read "Knuckles, Tails, Amy, Sonic", but only
// search for "Sonic", "Tails", and "Knuckles".
parse_string("Knuckles, Tails, Amy, Sonic", [
  "Sonic", // "Sonic" is a keyword because it is included in this list.
  "Tails", // Same for "Tails".
  "Knuckles" // Ditto.
]);

This produces the output:

[
  Symbol.for(Knuckles), // "Knuckles" was a keyword - `parse_string` found "Knuckles".
  
  ',', // The character right after "Knuckles".
  ' ', // Character after the character after knuckles. -- This isn't a keyword, so it's left alone.
  
  Symbol.for(Tails),
  
  ',',
  ' ',
  'A', // "Amy" isn't a keyword, so her name is left alone.
  'm',
  'y',
  ',',
  ' ',
  
  Symbol.for(Sonic) // Sonic's at the end, but he was still found, so his name is "tokenized".
]

Testing. Testing! One (1). Two (2). Three (3).

A parse_string test that uses keywords that are variants of each other. - example2.png

Now. Lets do some more fiddling around. . .

parse_string("test12 test1 test2 test", [
  "test",
  "test1",
  "test12",
  "test2"
]);

Ok, so, let's walk through this.

t

Well, we have t. That's a good start. Now we look to see what keywords start with t-
Oh... well, that's strange. It looks like we have four "candidates".
Let's remove the obvious loser: test2.

Now, we still have three possible candidates: test, test1, and test12.
How do we know which one to pick? - Simple. First we sort the candidates by length; test12, test1, test. Then, we arrange candidates in dictionary order. For this case, it doesn't change anything.
So. Now what? Well, let's scan ahead, if the next characters are est1, then we could use test1. - BUT, if the next characters are est12, we could use test12.
Since test12 is longer than test1 or test, it takes precedents. I.e. parse_string prefers this longer token because it's more confident this prediction is correct.

Cleaning up the clutter!

As of the latest patch; 0.051, you can remove the extra letters from the returned list.
It turns THIS output:
Sonic's test has some left over letters. - example.png
into THIS output:
Sonic's test only leaves the symbols behind. - example3.png

So... How do we get this cleaner output?
Well, when you pass in your text and the keywords you want to find, you can also pass in a boolean (a yes or no value) that indicates if you want to keep the clutter. For backwards compatibility, this option is true (yes*) by default.

Bugs

About

A simple tokenizer I made for JS while making a project that required tokens.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published