diff --git a/README.md b/README.md index 3f1de4e5..ae6173d7 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ **The whitepaper describes a new concept for building serverless microapps called _Actors_, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy -for programs running in the cloud.** +for programs or agents running in the cloud.** By [Jan Čurn](https://apify.com/jancurn), [Marek Trunkát](https://apify.com/mtrunkat), @@ -16,7 +16,9 @@ By [Jan Čurn](https://apify.com/jancurn), - [Introduction](#introduction) + * [Background](#background) * [Overview](#overview) + * [Apify platform](#apify-platform) - [Basic concepts](#basic-concepts) * [Input](#input) * [Run environment](#run-environment) @@ -30,7 +32,7 @@ By [Jan Čurn](https://apify.com/jancurn), * [Relation to the Actor model](#relation-to-the-actor-model) * [Why the name "Actor"](#why-the-name-actor) - [Installation and setup](#installation-and-setup) - * [Apify platform](#apify-platform) + * [Running on the Apify platform](#running-on-the-apify-platform) * [Node.js](#nodejs) * [Python](#python) * [Command-line interface (CLI)](#command-line-interface-cli) @@ -72,23 +74,26 @@ By [Jan Čurn](https://apify.com/jancurn), ## Introduction -This document introduces _Actors_, -a new kind of serverless microapps (or agents, cloud programs, ...) for general-purpose +This whitepaper introduces _Actors_, +a new kind of serverless microapps (or agents, cloud programs, functions) for general-purpose language-agnostic computing and automation jobs. -The main design goal for Actors is to make it easy for developers build and ship reusable -cloud software tools, which are also easy to run +The main goal for Actors is to make it easy for developers build and ship reusable +software automation tools, which are also easy to run and integrate by other users. + +### Background + The Actors were first introduced by [Apify](https://apify.com/) in late 2017, as a way to easily build, package, and ship web scraping and web automation tools to customers. Over the years, Apify keeps developing the concept and has applied it successfully to thousands of real-world use cases in many business areas, well beyond the domain of web scraping. -Drawing on this experience, +Building on this experience, we're releasing this whitepaper to introduce the philosophy of Actors -to the public and receive feedback on it. -Our hope is that Actor programming model will eventually become an open standard, +to the public and receive your feedback on it. +Our hope is to make the Actor programming model an open standard, which will help community to more effectively build and ship reusable software automation tools, as well as encourage new implementations of the model in other programming languages. @@ -100,8 +105,8 @@ by the Apify platform, with SDKs for [Node.js](https://sdk.apify.com/) and [Python](https://pypi.org/project/apify/), and a [command-line interface (CLI)](https://docs.apify.com/cli). -Beware that the frameworks might not yet implement all the features of Actor programming model. -This is work in progress. +Beware that the frameworks might not yet implement all the features of Actor programming model +described in this whitepaper. This is work in progress. ### Overview @@ -114,17 +119,17 @@ or removing duplicates from a large dataset. Actors can run as short or as long as necessary, from seconds to hours, even infinitely. Basically, Actors are programs packaged as Docker images, -which accept a well-defined input JSON object, perform +which accept a well-defined JSON input, perform an action, and optionally produce a well-defined JSON output. Actors have the following elements: - **Dockerfile** which specifies where is the Actor's source code, - how to build it, and run it -- **Documentation** in a form of README.md file + how to build it, and run it. +- **Documentation** in a form of README.md file. - **Input and output schemas** that describe what input the Actor requires, - and what results it produces -- Access to an out-of-box **storage system** for Actor data, results, and files -- **Metadata** such as the Actor name, description, author, and version + and what results it produces. +- Access to an out-of-box **storage system** for Actor data, results, and files. +- **Metadata** such as the Actor name, description, author, and version. The documentation and the input/output schemas make it possible for people to easily understand what the Actor does, enter the required inputs both in user interface or API, @@ -132,6 +137,9 @@ and integrate the results of the Actor into their other workflows. Actors can easily call and interact with each other, enabling building more complex systems on top of simple ones. + +### Apify platform + The Actors can be published on the [Apify platform](https://apify.com/store), which automatically generates a rich website with documentation @@ -145,15 +153,10 @@ The Apify platform provides an open API, cron-style scheduler, webhooks and [integrations](https://apify.com/integrations) to services such as Zapier or Make, which make it easy for users to integrate Actors into their existing workflows. Additionally, the Actor developers -can set a price tag for the usage of their Actors, and thus make -[passive income](https://blog.apify.com/make-regular-passive-income-developing-web-automation-Actors-b0392278d085/) -to have an incentive to keep developing and improving the Actor for the users. - -Currently, Actors can run locally or on the Apify platform. One of the goals of this whitepaper -is to motivate the community to create new runtime environments outside of Apify. +can set a price tag for the usage of their Actors, and thus earn income +and have an incentive to keep developing and improving the Actor for the users. +For details, see [Monetization](#monetization). -The ultimate goal of the Actor programming model is to make it as simple as possible -for people to develop, run, and integrate software automation tools. ## Basic concepts @@ -385,12 +388,13 @@ Our primary focus was always on practical software engineering utility, not an implementation of a formal mathematical model. For example, our Actors -do not provide any standard message passing mechanism. The Actors might communicate together +do not provide any standard message passing mechanism, but they can communicate together directly via HTTP requests (see [live-view web server](#live-view-web-server)), -manipulate each other's operation using the Apify platform API (e.g. abort another Actor), +manipulate each other's operation via the Apify platform API (e.g. abort another Actor), or affect each other by sharing some internal state or storage. -The Actors simply do not have any formal restrictions, -and they can access whichever external systems they want. +The Actors do not have any formal restrictions, +and they can access whichever external systems they want, +and thus going beyond the formal mathematical Actor model. ### Why the name "Actor" @@ -403,11 +407,12 @@ And they are related to the Actor model known from the computer science. To make it clear Actors are not people, the letter "A" is capitalized. + ## Installation and setup Below are steps to start building Actors in various languages and environments. -### Apify platform +### Running on the Apify platform You can develop and run Actors in [Apify Console](https://console.apify.com/actors) without installing any software locally. Just create a free account, and start building Actors @@ -1174,7 +1179,7 @@ $ actor abort --run-id RUN_ID $ kill ``` - + ### Live view web server @@ -1225,6 +1230,7 @@ https://bob--screenshot-taker.apify.actor Currently, the specific Standby mode settings, authentication options, or OpenAPI schema are not part of this Actor specification, but they might be in the future introduced as new settings in the `actor.json` file. + ### Migration to another server @@ -1639,6 +1645,7 @@ The monetization gives developers an incentive to further develop and maintain t Actors provide a new way for software developers like you to monetize their skills, bringing the creator economy model to SaaS. +For more details, read our essay [Make passive income developing web automation Actors](https://blog.apify.com/make-regular-passive-income-developing-web-automation-Actors-b0392278d085/). ## Future work diff --git a/pages/ACTOR_FILE.md b/pages/ACTOR_FILE.md index e01567cf..4a6e7441 100644 --- a/pages/ACTOR_FILE.md +++ b/pages/ACTOR_FILE.md @@ -1,32 +1,46 @@ # Actor file specification -This JSON file must be present at `.actor/actor.json` and contains the main definition of the Actor. +This JSON file must be present at `.actor/actor.json` and contains an object with the definition of the Actor. -It looks as follows: +The file looks as follows: ```jsonc { - "actorSpecification": 2, // required + // Required, indicates that this is an Actor definition file + "actorSpecification": 1, + + // Metadata "name": "google-search-scraper", "title": "Google Search Scraper", "description": "The 200-char description", - "version": "0.0", // required - "buildTag": "latest", // if omitted, builds with "latest" tag + "version": "0.0", // Required + "buildTag": "latest", // If omitted, builds with "latest" tag "environmentVariables": { "MYSQL_USER": "my_username", "MYSQL_PASSWORD": "@mySecretPassword" }, - "dockerfile": "./Dockerfile", // if omitted, it checks "./Dockerfile" and "../Dockerfile" - "readme": "./ACTOR.md", // if omitted, it checks "./ACTOR.md" and "../README.md" + + // Optional min and max memory for running this Actor + "minMemoryMbytes": 128, + "maxMemoryMbytes": 4096, + + // Links to other Actor defintion files + "dockerfile": "./Dockerfile", // If omitted, the system looks for "./Dockerfile" and "../Dockerfile" + "readme": "./README.md", // If omitted, the system looks for "./ACTOR.md" and "../README.md" + "changelog": "../../../shared/CHANGELOG.md", + + // Links to input/output schema files, or inlined schema objects. "input": "./input_schema.json", "output": "./output_schema.json", - "minMemoryMbytes": 128, // optional number, min memory in megabytes allowed for running this Actor - "maxMemoryMbytes": 4096, // optional number, max memory in megabytes allowed for running this Actor + + // Links to storage schema files, or inlined schema objects. "storages": { "keyValueStore": "./key_value_store_schema.json", - "dataset": "../shared-schemas/dataset_schema.json", + "dataset": "../shared_schemas/generic_dataset_schema.json", "requestQueue": "./request_queue_schema.json" }, + + // Scripts that might be used by the CLI to ease the local Actor development. "scripts": { "post-create": "npm install", "run": "npm start" diff --git a/pages/DATASET_SCHEMA.md b/pages/DATASET_SCHEMA.md index 855707d5..4973400d 100644 --- a/pages/DATASET_SCHEMA.md +++ b/pages/DATASET_SCHEMA.md @@ -1,6 +1,10 @@ # Dataset schema file specification -Dataset storage enables you to sequentially save and retrieve data. Each Actor run is assigned its own dataset, which is created when the first item is stored to it. Datasets usually contain results from web scraping, crawling or data processing jobs. The data can be visualized as a table where each object is a row and its attributes are the columns. The data can be exported in JSON, CSV, XML, RSS, Excel or HTML formats. +Dataset storage enables you to sequentially store and retrieve data records, in various formats. +Each Actor run is assigned its own dataset, which is created when the first item is stored to it. +Datasets usually contain results from web scraping, crawling or data processing jobs. +The data can be visualized as a table where each object is a row and its attributes are the columns. +The data can be exported in JSON, CSV, XML, RSS, Excel or HTML formats. Dataset can be assigned a schema which describes: @@ -41,22 +45,35 @@ Uncaught Error: Dataset schema is not compatible with the provided schema "title": "Eshop products", "description": "Dataset containing the whole product catalog including prices and stock availability.", - // Not supported yet + // Define a JSON schema for the dataset fields, including their type, description, etc. "fields": { - "title": "string", - "imageUrl": "string", - "priceUsd": "number", - "manufacturer": { - "title": "string", - "url": "number", + "type": "object", + "properties": { + "title": { + "type": "string", + "description": "The name of the results", + }, + "imageUrl": { + "type": "string", + "description": "Function executed for each request", + }, + "priceUsd": { + "type": "integer", + "description": "Price of the item", + }, + "manufacturer": { + "type": "object", + "properties": { + "title": { ... }, + "url": { ... }, + } + }, + ... }, - "productVariants": [{ - "color": "?string" - }], - - ... + "required": ["title"], }, + // Define the ways how to present the Dataset to users "views": { "overview": { "title": "Products overview", @@ -116,7 +133,8 @@ Uncaught Error: Dataset schema is not compatible with the provided schema ### JSON schema -Items of a dataset can be described by a JSON schema definition. Apify platform then ensures that each object accomplies with the provided schema. In the first version only the standard JSON schema will be supported, i.e.: +Items of a dataset can be described by a JSON schema definition. Apify platform then ensures that each object complies with the provided schema. +In the first version only the standard JSON schema will be supported, i.e.: ```jsonc @@ -155,7 +173,7 @@ Items of a dataset can be described by a JSON schema definition. Apify platform } ``` -with simplifed version comming in the near future: +And potentially with simplified version coming in the future: ```jsonc { diff --git a/pages/INPUT_SCHEMA.md b/pages/INPUT_SCHEMA.md index 96f3e5f2..29866a3d 100644 --- a/pages/INPUT_SCHEMA.md +++ b/pages/INPUT_SCHEMA.md @@ -1,7 +1,7 @@ # Actor input schema file specification -NOTE: Currently the Apify platform only supports [input schema v1](https://docs.apify.com/Actors/development/input-schema), -this document describes how the v2 should look like, but it's not implemented yet. +NOTE: Currently the Apify platform only supports [input schema version 1](https://docs.apify.com/Actors/development/input-schema), +this document describes how the version 2 should look like, but it's not implemented yet. ## Work in progress diff --git a/pages/KEY_VALUE_STORE_SCHEMA.md b/pages/KEY_VALUE_STORE_SCHEMA.md index aba2d790..6589a0c7 100644 --- a/pages/KEY_VALUE_STORE_SCHEMA.md +++ b/pages/KEY_VALUE_STORE_SCHEMA.md @@ -84,7 +84,6 @@ https://api.apify.com/v2/key-value-stores/storeId/keys?prefix=post-images- ## TODO(@jancurn) - Finalize this text, keep `collections` for now -- xx - What is kv-store schema is used by Actor to define structure of key-value store it operates on, but the developer defines a non-compatible record group for "INPUT" prefix? Maybe the default kv-stores should be created with a default record group to cover the "INPUT" prefixes diff --git a/pages/OUTPUT_SCHEMA.md b/pages/OUTPUT_SCHEMA.md index 0f83e849..df043d0d 100644 --- a/pages/OUTPUT_SCHEMA.md +++ b/pages/OUTPUT_SCHEMA.md @@ -1,46 +1,19 @@ # Actor output schema file specification [work in progress] -A JSON file that defines structure of the output generated by -Actor (see [Input and Output](../README.md#input-and-output) for details). -The file is referenced from the main [Actor file](ACTOR.md) using the `output` directive, +A JSON file that defines structure of the [output](../README.md#output) produced by an +Actor. +The file is referenced from the main [Actor file](./ACTOR_FILE.md) using the `output` property, and it is typically stored in `.actor/output_schema.json`. -Note that the schema is not only used to generate UI, but also the output JSON object, -with fields corresponding to `properties`, whose values are URLs to the results, data or liveview. -Such output object needs to be generated by system right when the Actor starts, -and remain static over entire lifecycle of Actor. +The output schema is used by the system to generate the +output JSON object, +whose fields corresponding to `properties`, where values are URLs to the dataset results, key-value store files, or live view web server. +This output object needs to be generated by system right when the Actor starts, +and remain static over entire lifecycle of Actor, only the linked content changes over time. +This is necessary to enable integrations of results to other systems - you don't need to run an Actor +to see format of its results as it's predefined by the output schema. - -## Random notes - -- **NOTE:** The output schema should enable developers to define schema for the - default dataset and key-value store. But how? It should be declarative - so that the system can check that e.g. the overridden default dataset - has the right schema. But then, when it comes to kv-store, that's not purely - output object but INPUT, similarly for overridden dataset or request queue. - Perhaps the cleanest way would be to set these directly in `.actor/actor.json`. -- The Run Sync API could have an option to automatically return (or redirect to?) - a specific property (i.e. URL) of the output object. - This would supersede the `outputRecordKey=OUTPUT` API param as well as - the run-sync-get-dataset-items API endpoint. - Maybe we could have one of the output properties as the main one, - which would be used by default for this kind of API endpoint, and just return - data to user. -- Same as we show Output in UI, we need to autogenerate the OUTPUT in API e.g. JSON format. - There would be properties like in the output_schema.json file, with e.g. URL to dataset, - log file, kv-store, live view etc. So it would be an auto-generated field "output" - that we can add to JSON returned by the Run API enpoints - (e.g. https://docs.apify.com/api/v2#/reference/actor-tasks/run-collection/run-task) - - Also see: https://github.com/apify/actor-specs/pull/5#discussion_r775641112 - - `output` will be a property of run object generated from Output schema - - -NOTE: We decided that output schema can reference other datasets/kv-stores/queues -but only those ones that are referenced in the input, or the default. Hence -there's no point to include storage schema here again, as it's done elsewhere. - - -TODO: Fix this +The output schema is also used by the system to generate the user interface, API examples, integrations, etc. ## Structure @@ -93,6 +66,37 @@ TODO: Fix this } ``` + +## Random notes + +- **NOTE:** The output schema should enable developers to define schema for the + default dataset and key-value store. But how? It should be declarative + so that the system can check that e.g. the overridden default dataset + has the right schema. But then, when it comes to kv-store, that's not purely + output object but INPUT, similarly for overridden dataset or request queue. + Perhaps the cleanest way would be to set these directly in `.actor/actor.json`. +- The Run Sync API could have an option to automatically return (or redirect to?) + a specific property (i.e. URL) of the output object. + This would supersede the `outputRecordKey=OUTPUT` API param as well as + the run-sync-get-dataset-items API endpoint. + Maybe we could have one of the output properties as the main one, + which would be used by default for this kind of API endpoint, and just return + data to user. +- Same as we show Output in UI, we need to autogenerate the OUTPUT in API e.g. JSON format. + There would be properties like in the output_schema.json file, with e.g. URL to dataset, + log file, kv-store, live view etc. So it would be an auto-generated field "output" + that we can add to JSON returned by the Run API enpoints + (e.g. https://docs.apify.com/api/v2#/reference/actor-tasks/run-collection/run-task) + - Also see: https://github.com/apify/actor-specs/pull/5#discussion_r775641112 + - `output` will be a property of run object generated from Output schema + +NOTE: We decided that output schema can reference other datasets/kv-stores/queues +but only those ones that are referenced in the input, or the default. Hence +there's no point to include storage schema here again, as it's done elsewhere. + +TODO: Fix this + + ## Examples of ideal Actor run UI - For the majority of Actors, we want to see the dataset with new records being added in realtime diff --git a/pages/REQUEST_QUEUE_SCHEMA.md b/pages/REQUEST_QUEUE_SCHEMA.md index 1f86ba2f..fbae3873 100644 --- a/pages/REQUEST_QUEUE_SCHEMA.md +++ b/pages/REQUEST_QUEUE_SCHEMA.md @@ -1,9 +1,9 @@ # Request queue schema file specification [work in progress] -TODO: This will be added later +Currently, this is neither specified nor implemented. +We think that request queue schema might be useful for two things: -But in general I think that it might be useful for 2 things: - ensuring what kind of URLs might be enqueued (certain domains or subdomains, ...) -- ensure that for example each requets has `userData.label`, i.e. schema of `userData` the same way as we enforce it for the Datasets +- ensure that for example each request has `userData.label`, i.e. schema of `userData` the same way as we enforce it for the Datasets -- Consider renaming `RequestQueue` to just `Queue` and make it more generic, and then it makes sense to have request schema +We should consider renaming `RequestQueue` to just `Queue` and make it more generic, and then it makes sense to have request schema.