Merge pull request #38 from apify/copy/jan

Copy/jan
apify · Aug 18, 2024 · 2cf38db · 2cf38db
2 parents 664d6e1 + 411a244
commit 2cf38db
Show file tree

Hide file tree

Showing 7 changed files with 143 additions and 101 deletions.
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 **The whitepaper describes a new concept for building serverless microapps called _Actors_,
 which are easy to develop, share, integrate, and build upon.
 Actors are a reincarnation of the UNIX philosophy
-for programs running in the cloud.**
+for programs or agents running in the cloud.**
 
 By [Jan Čurn](https://apify.com/jancurn),
 [Marek Trunkát](https://apify.com/mtrunkat),
@@ -16,7 +16,9 @@ By [Jan Čurn](https://apify.com/jancurn),
 <!-- toc -->
 
 - [Introduction](#introduction)
+  * [Background](#background)
   * [Overview](#overview)
+  * [Apify platform](#apify-platform)
 - [Basic concepts](#basic-concepts)
   * [Input](#input)
   * [Run environment](#run-environment)
@@ -30,7 +32,7 @@ By [Jan Čurn](https://apify.com/jancurn),
   * [Relation to the Actor model](#relation-to-the-actor-model)
   * [Why the name "Actor"](#why-the-name-actor)
 - [Installation and setup](#installation-and-setup)
-  * [Apify platform](#apify-platform)
+  * [Running on the Apify platform](#running-on-the-apify-platform)
   * [Node.js](#nodejs)
   * [Python](#python)
   * [Command-line interface (CLI)](#command-line-interface-cli)
@@ -72,23 +74,26 @@ By [Jan Čurn](https://apify.com/jancurn),
 
 ## Introduction
 
-This document introduces _Actors_,
-a new kind of serverless microapps (or agents, cloud programs, ...) for general-purpose
+This whitepaper introduces _Actors_,
+a new kind of serverless microapps (or agents, cloud programs, functions) for general-purpose
 language-agnostic computing and automation jobs.
-The main design goal for Actors is to make it easy for developers build and ship reusable
-cloud software tools, which are also easy to run
+The main goal for Actors is to make it easy for developers build and ship reusable
+software automation tools, which are also easy to run
 and integrate by other users.
 
+
+### Background
+
 The Actors were first introduced by [Apify](https://apify.com/) in late 2017,
 as a way to easily build, package, and ship web scraping and web automation tools to customers.
 Over the years, Apify keeps developing the concept and has applied
 it successfully to thousands of real-world use cases in many business areas,
 well beyond the domain of web scraping.
 
-Drawing on this experience,
+Building on this experience,
 we're releasing this whitepaper to introduce the philosophy of Actors
-to the public and receive feedback on it.
-Our hope is that Actor programming model will eventually become an open standard,
+to the public and receive your feedback on it.
+Our hope is to make the Actor programming model an open standard,
 which will help community to more effectively
 build and ship reusable software automation tools,
 as well as encourage new implementations of the model in other programming languages.
@@ -100,8 +105,8 @@ by the Apify platform, with SDKs for
 [Node.js](https://sdk.apify.com/) and
 [Python](https://pypi.org/project/apify/),
 and a [command-line interface (CLI)](https://docs.apify.com/cli).
-Beware that the frameworks might not yet implement all the features of Actor programming model.
-This is work in progress. 
+Beware that the frameworks might not yet implement all the features of Actor programming model
+described in this whitepaper. This is work in progress. 
 
 
 ### Overview
@@ -114,24 +119,27 @@ or removing duplicates from a large dataset.
 Actors can run as short or as long as necessary, from seconds to hours, even infinitely.
 
 Basically, Actors are programs packaged as Docker images,
-which accept a well-defined input JSON object, perform
+which accept a well-defined JSON input, perform
 an action, and optionally produce a well-defined JSON output.
 
 Actors have the following elements:
 - **Dockerfile** which specifies where is the Actor's source code,
-  how to build it, and run it
-- **Documentation** in a form of README.md file
+  how to build it, and run it.
+- **Documentation** in a form of README.md file.
 - **Input and output schemas** that describe what input the Actor requires,
-  and what results it produces
-- Access to an out-of-box **storage system** for Actor data, results, and files
-- **Metadata** such as the Actor name, description, author, and version
+  and what results it produces.
+- Access to an out-of-box **storage system** for Actor data, results, and files.
+- **Metadata** such as the Actor name, description, author, and version.
 
 The documentation and the input/output schemas make it possible for people to easily understand what the Actor does,
 enter the required inputs both in user interface or API,
 and integrate the results of the Actor into their other workflows.
 Actors can easily call and interact with each other, enabling building more complex
 systems on top of simple ones.
 
+
+### Apify platform
+
 The Actors can be published
 on the [Apify platform](https://apify.com/store),
 which automatically generates a rich website with documentation
@@ -145,15 +153,10 @@ The Apify platform provides an open API, cron-style scheduler, webhooks
 and [integrations](https://apify.com/integrations)
 to services such as Zapier or Make, which make it easy for users
 to integrate Actors into their existing workflows. Additionally, the Actor developers
-can set a price tag for the usage of their Actors, and thus make
-[passive income](https://blog.apify.com/make-regular-passive-income-developing-web-automation-Actors-b0392278d085/)
-to have an incentive to keep developing and improving the Actor for the users.
-
-Currently, Actors can run locally or on the Apify platform. One of the goals of this whitepaper
-is to motivate the community to create new runtime environments outside of Apify.
+can set a price tag for the usage of their Actors, and thus earn income
+and have an incentive to keep developing and improving the Actor for the users.
+For details, see [Monetization](#monetization).
 
-The ultimate goal of the Actor programming model is to make it as simple as possible
-for people to develop, run, and integrate software automation tools.
 
 
 ## Basic concepts
@@ -385,12 +388,13 @@ Our primary focus was always on practical software engineering utility, not an
 implementation of a formal mathematical model.
 
 For example, our Actors
-do not provide any standard message passing mechanism. The Actors might communicate together
+do not provide any standard message passing mechanism, but they can communicate together
 directly via HTTP requests (see [live-view web server](#live-view-web-server)),
-manipulate each other's operation using the Apify platform API (e.g. abort another Actor),
+manipulate each other's operation via the Apify platform API (e.g. abort another Actor),
 or affect each other by sharing some internal state or storage.
-The Actors simply do not have any formal restrictions,
-and they can access whichever external systems they want.
+The Actors do not have any formal restrictions,
+and they can access whichever external systems they want,
+and thus going beyond the formal mathematical Actor model.
 
 
 ### Why the name "Actor"
@@ -403,11 +407,12 @@ And they are related to the Actor model known from the computer science.
 
 To make it clear Actors are not people, the letter "A" is capitalized.
 
+
 ## Installation and setup
 
 Below are steps to start building Actors in various languages and environments.
 
-### Apify platform
+### Running on the Apify platform
 
 You can develop and run Actors in [Apify Console](https://console.apify.com/actors) without
 installing any software locally. Just create a free account, and start building Actors
@@ -1174,7 +1179,7 @@ $ actor abort --run-id RUN_ID
 $ kill <PID>
 ```
 
-<!-- TODO: Include Actor.boot() or not? I'd say yes -->
+<!-- TODO: Include Actor.boot() or not? I'd say yes. See https://github.com/apify/actor-specs/issues/23 -->
 
 ### Live view web server
 
@@ -1225,6 +1230,7 @@ https://bob--screenshot-taker.apify.actor
 Currently, the specific Standby mode settings, authentication options, or OpenAPI schema are not part of this Actor specification,
 but they might be in the future introduced as new settings in the `actor.json` file.
 
+<!-- TODO: Consider unifying Standby mode with Live view web server, they are really two sides of the same thing -->
 
 ### Migration to another server
 
@@ -1639,6 +1645,7 @@ The monetization gives developers an incentive to further develop and maintain t
 Actors provide a new way for software developers like you to monetize their skills,
 bringing the creator economy model to SaaS.
 
+For more details, read our essay [Make passive income developing web automation Actors](https://blog.apify.com/make-regular-passive-income-developing-web-automation-Actors-b0392278d085/).
 
 ## Future work
 

diff --git a/pages/ACTOR_FILE.md b/pages/ACTOR_FILE.md
@@ -1,32 +1,46 @@
 # Actor file specification
 
-This JSON file must be present at `.actor/actor.json` and contains the main definition of the Actor.
+This JSON file must be present at `.actor/actor.json` and contains an object with the definition of the Actor.
 
-It looks as follows:
+The file looks as follows:
 
 ```jsonc
 {
-  "actorSpecification": 2, // required
+  // Required, indicates that this is an Actor definition file
+  "actorSpecification": 1,
+
+  // Metadata
   "name": "google-search-scraper",
   "title": "Google Search Scraper",
   "description": "The 200-char description",
-  "version": "0.0", // required
-  "buildTag": "latest", // if omitted, builds with "latest" tag
+  "version": "0.0", // Required
+  "buildTag": "latest", // If omitted, builds with "latest" tag
   "environmentVariables": {
     "MYSQL_USER": "my_username",
     "MYSQL_PASSWORD": "@mySecretPassword"
   },
-  "dockerfile": "./Dockerfile", // if omitted, it checks "./Dockerfile" and "../Dockerfile"
-  "readme": "./ACTOR.md", // if omitted, it checks "./ACTOR.md" and "../README.md"
+
+  // Optional min and max memory for running this Actor
+  "minMemoryMbytes": 128,
+  "maxMemoryMbytes": 4096,
+
+  // Links to other Actor defintion files
+  "dockerfile": "./Dockerfile", // If omitted, the system looks for "./Dockerfile" and "../Dockerfile"
+  "readme": "./README.md", // If omitted, the system looks for "./ACTOR.md" and "../README.md"
+  "changelog": "../../../shared/CHANGELOG.md",
+
+  // Links to input/output schema files, or inlined schema objects.
   "input": "./input_schema.json",
   "output": "./output_schema.json",
-  "minMemoryMbytes": 128, // optional number, min memory in megabytes allowed for running this Actor
-  "maxMemoryMbytes": 4096, // optional number, max memory in megabytes allowed for running this Actor
+
+  // Links to storage schema files, or inlined schema objects.
   "storages": {
     "keyValueStore": "./key_value_store_schema.json",
-    "dataset": "../shared-schemas/dataset_schema.json",
+    "dataset": "../shared_schemas/generic_dataset_schema.json",
     "requestQueue": "./request_queue_schema.json"
   },
+
+  // Scripts that might be used by the CLI to ease the local Actor development.
   "scripts": {
     "post-create": "npm install",
     "run": "npm start"

diff --git a/pages/DATASET_SCHEMA.md b/pages/DATASET_SCHEMA.md
@@ -1,6 +1,10 @@
 # Dataset schema file specification
 
-Dataset storage enables you to sequentially save and retrieve data. Each Actor run is assigned its own dataset, which is created when the first item is stored to it. Datasets usually contain results from web scraping, crawling or data processing jobs. The data can be visualized as a table where each object is a row and its attributes are the columns. The data can be exported in JSON, CSV, XML, RSS, Excel or HTML formats.
+Dataset storage enables you to sequentially store and retrieve data records, in various formats.
+Each Actor run is assigned its own dataset, which is created when the first item is stored to it.
+Datasets usually contain results from web scraping, crawling or data processing jobs.
+The data can be visualized as a table where each object is a row and its attributes are the columns.
+The data can be exported in JSON, CSV, XML, RSS, Excel or HTML formats.
 
 Dataset can be assigned a schema which describes:
 
@@ -41,22 +45,35 @@ Uncaught Error: Dataset schema is not compatible with the provided schema
     "title": "Eshop products",
     "description": "Dataset containing the whole product catalog including prices and stock availability.",
 
-    // Not supported yet
+    // Define a JSON schema for the dataset fields, including their type, description, etc.
     "fields": {
-        "title": "string",  
-        "imageUrl": "string",  
-        "priceUsd": "number", 
-        "manufacturer": {
-            "title": "string", 
-            "url": "number",
+        "type": "object",
+        "properties": {
+            "title": {
+                "type": "string",
+                "description": "The name of the results",
+            },
+            "imageUrl": {
+                "type": "string",
+                "description": "Function executed for each request",
+            },
+            "priceUsd": {
+                "type": "integer",
+                "description": "Price of the item",
+            },
+            "manufacturer": {
+                "type": "object",
+                "properties": {
+                    "title": { ... }, 
+                    "url": { ... },
+                }
+            },
+            ...
         },
-        "productVariants": [{
-            "color": "?string"
-        }],
-        
-        ...
+        "required": ["title"],
     },
   
+    // Define the ways how to present the Dataset to users
     "views": {
         "overview": {
             "title": "Products overview",
@@ -116,7 +133,8 @@ Uncaught Error: Dataset schema is not compatible with the provided schema
 
 ### JSON schema
 
-Items of a dataset can be described by a JSON schema definition. Apify platform then ensures that each object accomplies with the provided schema. In the first version only the standard JSON schema will be supported, i.e.:
+Items of a dataset can be described by a JSON schema definition. Apify platform then ensures that each object complies with the provided schema.
+In the first version only the standard JSON schema will be supported, i.e.:
 
 
 ```jsonc
@@ -155,7 +173,7 @@ Items of a dataset can be described by a JSON schema definition. Apify platform
 }
 ```
 
-with simplifed version comming in the near future:
+And potentially with simplified version coming in the future:
 
 ```jsonc
 {

diff --git a/pages/INPUT_SCHEMA.md b/pages/INPUT_SCHEMA.md
@@ -1,7 +1,7 @@
 # Actor input schema file specification
 
-NOTE: Currently the Apify platform only supports [input schema v1](https://docs.apify.com/Actors/development/input-schema),
-this document describes how the v2 should look like, but it's not implemented yet.
+NOTE: Currently the Apify platform only supports [input schema version 1](https://docs.apify.com/Actors/development/input-schema),
+this document describes how the version 2 should look like, but it's not implemented yet.
 
 ## Work in progress
 

diff --git a/pages/KEY_VALUE_STORE_SCHEMA.md b/pages/KEY_VALUE_STORE_SCHEMA.md
@@ -84,7 +84,6 @@ https://api.apify.com/v2/key-value-stores/storeId/keys?prefix=post-images-
 
 ## TODO(@jancurn)
 - Finalize this text, keep `collections` for now
-- xx
 - What is kv-store schema is used by Actor to define structure of key-value store it operates on,
   but the developer defines a non-compatible record group for "INPUT" prefix?
   Maybe the default kv-stores should be created with a default record group to cover the "INPUT" prefixes