Skip to content

Commit

Permalink
Further improving scraper
Browse files Browse the repository at this point in the history
  • Loading branch information
polterguy committed Oct 7, 2023
1 parent 5f1b10d commit 2d93892
Show file tree
Hide file tree
Showing 8 changed files with 624 additions and 347 deletions.
79 changes: 58 additions & 21 deletions backend/files/system/openai/create-bot.post.hl
Original file line number Diff line number Diff line change
Expand Up @@ -133,15 +133,6 @@ try
system_message:x:@.arguments/*/flavor
greeting:Hi there, how can I help you?

// Creating our default training snippets.
data.create
table:ml_training_snippets
values
type:x:@.type
prompt:Who created this ChatGPT website chatbot?
completion:@"This chatbot is a custom ChatGPT chatbot allowing you to use natural language to ask questions related to the website you're currently visiting. It was created by [AINIRO.IO](https://ainiro.io). AINIRO have ChatGPT solutions allowing you to scrape your website, upload document, and create publicly available ChatGPT-based chatbots and AI Based search similar to Bing, in addition to AI Expert Systems providing cognitive assistance."
uri:"https://ainiro.io"

// Invoking slot doing the heavy lifting
request.host
add:x:./*/signal
Expand All @@ -162,18 +153,64 @@ try
.url:x:@.arguments/*/url
.host:x:@request.host

// Creating landing page.
unwrap:x:+/*
signal:magic.ai.create-landing-page
url:x:@.url
type:x:@.type
host:x:@.host

// Vectorising training data.
unwrap:x:+/*
signal:magic.ai.vectorise
type:x:@.type
host:x:@.host
// Connecting to database to sanity check model and apply some default training snippets.
data.connect:[generic|magic]

// Making sure we actually have training snippets in model.
data.read
table:ml_training_snippets
columns
count(*)
as:count
where
and
type.eq:x:@.type

// Verifying we've got training snippets for model in database.
if
eq
convert:x:@data.read/*/*/count
type:int
.:int:0
.lambda

// Informing client.
sockets.signal:magic.backend.chatbot
roles:root
args
message:We could not find any training data on your website
type:error

// Deleting model.
data.delete
table:ml_types
where
and
type.eq:x:@.type

else

// Adding "Who created this chatbot" snippet.
data.create
table:ml_training_snippets
values
type:x:@.type
prompt:Who created this ChatGPT website chatbot?
completion:@"This chatbot is a custom ChatGPT chatbot allowing you to use natural language to ask questions related to the website you're currently visiting. It was created by [AINIRO.IO](https://ainiro.io). AINIRO have ChatGPT solutions allowing you to scrape your website, upload document, and create publicly available ChatGPT-based chatbots and AI Based search similar to Bing, in addition to AI Expert Systems providing cognitive assistance."
uri:"https://ainiro.io"

// Vectorizing model.
unwrap:x:+/*
signal:magic.ai.vectorise
type:x:@.type
host:x:@.host

// Creating landing page for model.
unwrap:x:+/*
signal:magic.ai.create-landing-page
url:x:@.url
type:x:@.type
host:x:@.host

// Returning success to caller.
unwrap:x:+/*
Expand Down
58 changes: 29 additions & 29 deletions backend/files/system/openai/magic.startup/magic.ai.crawl-site.hl
Original file line number Diff line number Diff line change
Expand Up @@ -19,32 +19,6 @@ slots.create:magic.ai.crawl-site
signal:magic.ai.load-robots
url:x:@.arguments/*/url

// Checking if robots.txt contains a crawl-delay.
if
exists:x:@signal/*/crawl-delay
.lambda

// Updating delay to value from robots.txt.
remove-nodes:x:@.arguments/*/delay
unwrap:x:+/*
validators.default:x:@.arguments
delay:x:@signal/*/crawl-delay


// Signaling frontend to inform of that we found a crawl-delay value.
strings.concat
.:"Robots.txt file contains a Crawl-Delay value of "
get-value:x:@signal/*/crawl-delay
.:" milliseconds, hence making sure we don't crawl more often than every "
get-value:x:@signal/*/crawl-delay
.:" milliseconds to avoid exhausting web server"
unwrap:x:+/**
sockets.signal:magic.backend.chatbot
roles:root
args
message:x:@strings.concat
type:info

// Checking if site contains a robots.txt file.
if
eq:x:@signal/*/found
Expand Down Expand Up @@ -72,6 +46,31 @@ slots.create:magic.ai.crawl-site
type:info
sleep:100

// Checking if robots.txt contains a crawl-delay.
if
exists:x:@signal/*/crawl-delay
.lambda

// Updating delay to value from robots.txt.
remove-nodes:x:@.arguments/*/delay
unwrap:x:+/*
validators.default:x:@.arguments
delay:x:@signal/*/crawl-delay


// Signaling frontend to inform of that we found a crawl-delay value.
strings.concat
.:"Robots.txt file contains a Crawl-Delay value of "
math.divide:x:@signal/*/crawl-delay
.:int:1000
.:" seconds"
unwrap:x:+/**
sockets.signal:magic.backend.chatbot
roles:root
args
message:x:@strings.concat
type:info

else

// Site does not contain a robots.txt file, signaling that fact to frontend.
Expand Down Expand Up @@ -107,7 +106,7 @@ slots.create:magic.ai.crawl-site
strings.concat
.:"We found "
get-value:x:@signal/*/total
.:" URLs in sitemap"
.:" URLs in sitemap(s)"
unwrap:x:+/**
sockets.signal:magic.backend.chatbot
roles:root
Expand Down Expand Up @@ -158,8 +157,9 @@ slots.create:magic.ai.crawl-site
// Signaling frontend that we're waiting for n milliseconds.
strings.concat
.:"Waiting for "
get-value:x:@.arguments/*/delay
.:" milliseconds to avoid exhausting web server"
math.divide:x:@.arguments/*/delay
.:int:1000
.:" seconds to avoid exhausting web server"
unwrap:x:+/**
sockets.signal:magic.backend.chatbot
roles:root
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,22 @@ slots.create:magic.ai.create-landing-page
validators.mandatory:x:@.arguments/*/url
validators.mandatory:x:@.arguments/*/type

// Adding [headers] argument unless already specified.
if
not-exists:x:@.arguments/*/headers
.lambda

// Adding [headers] to [.arguments] such that we can create default HTTP headers further down.
add:x:@.arguments
.
headers

// Adding default headers unless they're already specified.
validators.default:x:@.arguments/*/headers
User-Agent:AINIRO-Crawler 2.0
Accept-Encoding:identity
Accept:text/xml

// Signaling frontend.
strings.concat
.:"Creating default landing page for "
Expand All @@ -25,9 +41,9 @@ slots.create:magic.ai.create-landing-page
type:info

// Retrieving HTML document from URL specified.
add:x:./*/http.get
get-nodes:x:@.arguments/*/headers
http.get:x:@.arguments/*/url
headers
User-Agent:AINIRO-MachineLearning-Spider

// Converting to lambda to verify we've got a "base" element.
html2lambda:x:@http.get/*/content
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,14 @@ slots.create:magic.ai.html.extract-images
* data to OpenAI during completion invocations.
*/
if
not
strings.starts-with:x:@.dp/#/*/\@src
.:"data:"
and
exists:x:@.dp/#/*/\@src
not-null:x:@.dp/#/*/\@src
neq:x:@.dp/#/*/\@src
.:
not
strings.starts-with:x:@.dp/#/*/\@src
.:"data:"
.lambda

// Normalizing URL.
Expand Down
Loading

0 comments on commit 2d93892

Please sign in to comment.