This is a basic information service that allows users to archive web sites, capturing sites on a time-based frequency. It implements Flask, Flask-RESTful, and JSON-LD. A user can take the following actions:
- Add a
<DomainArchive>
resource to the<DomainList>
- Create an
<ArchivePlan>
for a given<DomainArchive>
, and also update or delete it if needed. - View records of any created
<SnapShot>
for a given<DomainArchive>
The service is designed with four resource classes (<DomainList>
, <DomainArchive>
, <ArchivePlan>
, and <SnapShot>
), as defined in webarch_vocab.ttl
. A single vocabulary, the Portland Common Data Model, describes a Collection and an Object. The Collection class describes the <DomainList>
and the <DomainArchive>
, while the Object class describes the <ArchivePlan>
and the <SnapShot>
. See the full data model as a Graph Diagram to view the way these four resources are linked togther through the rdfs:member
property.
The <DomainList>
is a list resource that has various resources, expressed as a property rdfs:member
. The other three resource classes each have properties as follows:
For a <DomainArchive>
:
- URL
- createdate
- title
- description
- owner
- @type
- @id
For an <ArchivePlan>
:
- frequency
- depth
- @type
- @id
For a <SnapShot>
:
- size
- runtime
- date
- filename
- @type
- @id
We use assorted schema.org, W3C, and dcterms vocabularies to express conceptually what the properties of each resource mean.
- any URL is expressed as
@type: http://schema.org/url
- any createdate for a new resource is expressed as
@type: http://schema.org/dateCreated
- a user that submits a domain to be archived is expressed as
@type: http://schema.org/creator
- the title given by a user to a domain is expressed as
@type: http://schema.org/name
- the description of a domain, provided by the user, is expressed as
@type: http://schema.org/description
- the "runtime" for a snapshot capture is expressed as
@type: http://schema.org/Duration
- the "depth" property for an archive plan is expressed as
@type: http://schema.org/depth
- the "cycle" or "frequency" of snapshot capturing is expressed as
@type: http://purl.org/dc/terms/accrualPeriodicity
- the "filename" for a
.warc
file is expressed as@type: http://schema.org/name
- the "size" for any associated
.warc
file created by a snapshot is expressed as@type: http://schema.org/fileSize
-
Install required dependencies (
python3.x
,pip
,virtualenv
,Flask
, andFlask-RESTful
, then create a virtualenv with the command:$ virtualenv venv
-
cd
to the directory where you extracted the webarchivingservice project to, and activate your virtualenv with the command$ . venv/bin/activate
-
Start the server:
$ python server.py
That's it!