forked from decentralized-identity/confidential-storage
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
3738 lines (3424 loc) · 122 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<title>Secure Data Store 0.1</title>
<meta http-equiv='Content-Type' content='text/html;charset=utf-8'/>
<!--
=== NOTA BENE ===
For the three scripts below, if your spec resides on dev.w3 you can check them
out in the same tree and use relative links so that they'll work offline,
-->
<script src='https://www.w3.org/Tools/respec/respec-w3c-common' class='remove'></script>
<script src="./common.js" class="remove"></script>
<script type="text/javascript" class="remove">
var respecConfig = {
// specification status (e.g., WD, LCWD, NOTE, etc.). If in doubt use ED.
specStatus: "unofficial",
// the specification's short name, as in http://www.w3.org/TR/short-name/
shortName: "secure-data-store",
// subtitle for the spec
subtitle: "Secure Data Store",
// if you wish the publication date to be other than today, set this
//publishDate: "2019-03-26",
//crEnd: "2019-04-23",
//implementationReportURI: "https://w3c.github.io/sdh-test-suite",
previousMaturity: "CG-DRAFT",
previousPublishDate: "2020-01-26",
prevVersion: "https://digitalbazaar.github.io/encrypted-data-vaults/",
// extend the bibliography entries
localBiblio: ccg.localBiblio,
doJsonLd: true,
github: "https://github.com/decentralized-identity/secure-data-store",
includePermalinks: false,
// if there a publicly available Editor's Draft, this is the link
edDraftURI: "https://identity.foundation/secure-data-store/",
// Override respec autogenerated w3c URLs
thisVersion: "https://identity.foundation/secure-data-store/",
latestVersion: "https://identity.foundation/secure-data-store/",
// if this is a LCWD, uncomment and set the end of its review period
// lcEnd: "2009-08-05",
// editors, add as many as you like
// only "name" is required
editors: [
{ name: "Manu Sporny", url: "http://manu.sporny.org/",
company: "Digital Bazaar", companyURL: "http://digitalbazaar.com/"},
{ name: "Daniel Buchner", url: "https://www.linkedin.com/in/dbuchner/",
company: "Microsoft", companyURL: "https://microsoft.com/"},
{ name: "Orie Steele", url: "https://www.linkedin.com/in/or13b/",
company: "Transmute", companyURL: "https://transmute.industries" },
],
// authors, add as many as you like.
// This is optional, uncomment if you have authors as well as editors.
// only "name" is required. Same format as editors.
// authors:
// [
// ],
// name of the WG
// wg: "Secure Data Storage Working Group",
// URI of the public WG page
// wgURI: "https://www.w3.org/community/credentials/",
// name (with the @w3c.org) of the public mailing to which comments are due
// wgPublicList: "public-credentials",
// URI of the patent status for this WG, for Rec-track documents
// !!!! IMPORTANT !!!!
// This is important for Rec-track documents, do not copy a patent URI from a random
// document unless you know what you're doing. If in doubt ask your friendly neighborhood
// Team Contact.
// wgPatentURI: "https://www.w3.org/2004/01/pp-impl/98922/status",
maxTocLevel: 2,
inlineCSS: true
};
</script>
<style>
pre .highlight {
font-weight: bold;
color: green;
}
pre .comment {
font-weight: bold;
color: Gray;
}
.color-text {
font-weight: bold;
text-shadow: -1px 0 black, 0 1px black, 1px 0 black, 0 -1px black;
}
</style>
</head>
<body>
<section id='abstract'>
<p>
We store a significant amount of sensitive data online, such as personally
identifying information (PII), trade secrets, family pictures, and customer
information. The data that we store is often not protected in an appropriate
manner.
</p>
<p>
This specification describes a privacy-respecting mechanism for storing,
indexing, and retrieving encrypted data at a storage provider. It is often
useful when an individual or organization wants to protect data in a way that
the storage provider cannot view, analyze, aggregate, or resell the data.
This approach also ensures that application data is portable and protected
from storage provider data breaches.
</p>
</section>
<section id='sotd'>
<p>
This specification is a joint work item of the
<a href="https://www.w3.org/community/credentials/">W3C Credentials Community
Group</a> and the <a href="https://identity.foundation/">Decentralized
Identity Foundation</a>.
This specification is a combination of and iteration on work done by both of
these groups. Input documents, or parts thereof, which have not yet been
integrated into the specification may be found in the appendices.
</p>
</section>
<section class="informative">
<h2>
Introduction
</h2>
<p>
We store a significant amount of sensitive data online, such as personally
identifying information (PII), trade secrets, family pictures, and customer
information. The data that we store is often not protected in an appropriate
manner.
</p>
<p>
Legislation, such as the General Data Protection Regulation (GDPR), incentivizes
service providers to better preserve individuals' privacy, primarily through
making the providers liable in the event of a data breach. This liability
pressure has revealed a technological gap, whereby providers are often not
equipped with technology that can suitably protect their customers. Encrypted
Data Vaults fill this gap and provide a variety of other benefits.
</p>
<p>
This specification describes a privacy-respecting mechanism for storing,
indexing, and retrieving encrypted data at a storage provider. It is often
useful when an individual or organization wants to protect data in a way that
the storage provider cannot view, analyze, aggregate, or resell the data.
This approach also ensures that application data is portable and protected
from storage provider data breaches.
</p>
<section class="informative">
<h3>
Why Do We Need Encrypted Data Vaults?
</h3>
<p class="issue">
Explain why individuals and organizations that want to protect their privacy,
trade secrets, and ensure data portability will benefit from using this
technology. Explain how giving a standard API for the storage of user data
empowering users to "bring their own storage", giving them control of their
own information. Explain how applications that are written against a standard
API and assume that users will bring their own storage can separate concerns
and focus on the functionality of their application, removing the need to
deal with storage infrastructure (instead leaving it to a specialist service
provider that is chosen by the user).
</p>
<p>
Requiring client-side (edge) encryption for all data and metadata at the same
time as enabling the user to store data on multiple devices and to share data
with others, whilst also having searchable or queryable data, has been
historically very difficult to implement in one system. Trade-offs are often
made which sacrifice privacy in favor of usability, or vice versa.
</p>
<p>
Due to a number of maturing technologies and standards, we are hopeful that such
trade-offs are no longer necessary, and that it is possible to design a
privacy-preserving protocol for encrypted decentralized data storage that has
broad practical appeal.
</p>
</section>
<section class="informative">
<h3>
Ecosystem Overview
</h3>
<p>
The problem of decentralized data storage has been approached from various
different angles, and personal data stores (PDS), decentralized or otherwise,
have a long history in commercial and academic settings. Different approaches
have resulted in variations in terminology and architectures.
The diagram below shows the types of components that are emerging, and the roles
they play. Encrypted Data Vaults fulfill the low-level encrypted
<em>storage</em> role.
</p>
<figure>
<img style="margin: auto; display: block; width: 75%;"
src="diagrams/SDS_Layers.svg" alt="diagram showing
the roles of different technologies in the encrypted
data vaults and secure data store ecosystem and how they interact.">
<figcaption style="text-align: center;">
Secure Data Storage layers
</figcaption>
</figure>
<p>
This section describes the roles of the core actors and the relationships
between them in an ecosystem where this specification is expected
to be useful. A role is an abstraction that might be implemented in many
different ways. The separation of roles suggests likely interfaces and
protocols for standardization. The following roles are introduced in this
specification:
</p>
<dl>
<dt><dfn>data vault controller</dfn></dt>
<dd>
A role an <a>entity</a> might perform by creating, managing, and deleting
data vaults. This entity is also responsible for granting and revoking
authorization to <a>storage agents</a> to the data vaults that are under its
control.
</dd>
<dt><dfn data-lt="storage agents">storage agent</dfn></dt>
<dd>
A role an <a>entity</a> might perform by creating, updating, and deleting
data in a data vault. This entity is typically granted authorization to
to access a data vault by a <a>data vault controller</a>.
</dd>
<dt><dfn>storage provider</dfn></dt>
<dd>
A role an <a>entity</a> might perform by providing a raw data storage
mechanism to a <a>data vault controller</a>. It is impossible for this entity
to see the data that it is storing due to all data being encrypted at rest
and in transit to and from the <a>storage provider</a>.
</dd>
</dl>
</section>
<section class="informative">
<h3>
Prior Art
</h3>
<h4>
Peergos
</h4>
<p>
<a href="https://github.com/peergos/peergos">Peergos</a> has many of the same requirements as SDS (actually stronger privacy requirements, especially against a quantum computer) and is built on top of ipfs.
In summary, its properties include:
<ul>
<li>
global p2p private filesystem
</li>
<li>
strong end to end encryption, and fine grained access control (read only and writable) (capability based, so server's are trustless)
</li>
<li>
data model is hash linked data (IPLD) with updates signed by a keypair
</li>
<li>
hide metadata from server including file names, mime-types, file sizes, directory topology
</li>
<li>
hide social graph from server (the server cannot see who has been granted access to what)
</li>
<li>
directories are indistinguishable from small files to the server
</li>
<li>
independent of DNS and TLS certificate authorities (though there is a web interface if you trust them)
</li>
<li>
handle arbitrarily large files, including streaming, and O(1) seeking
</li>
<li>
data can be trivially mirrored which provides live redundancy over the ipfs protocol
</li>
<li>
efficient modification of large files without having to re-encrypt the entire file
</li>
<li>
resistant to quantum computer based attacks - unshared files are already fully post-quantum, shared files currently have a limited time window of vulnerability to a large quantum computer
</li>
<li>
access control is done with a version of <a href="https://github.com/Peergos/Peergos/raw/master/papers/wuala-cryptree.pdf">cryptree</a>, improved to fit in the ipfs data model, which is a stunning data structure. More details can be read <a href="https://book.peergos.org/security/cryptree.html">here</a>.
</li>
<li>
all data is in a merkle-champ (compressed hash-array mapped prefix-trie) which is a great data structure for content addressed mutable data, and plays very well with CRDTs as well. The original paper on the CHAMP structure is https://michael.steindorfer.name/publications/oopsla15.pdf. The properties of champs that are useful are insertion order indepdendence (giving a canonical root for a given set of mappings), and balanced structure.
</li>
<li>
Unshared files are safe from exposure by a large quantum computer (this is because their privacy only relies on hashing and symmetric encryption, neither of which are significantly weakened by a quantum computer).
</li>
<li>
Peergos servers are trustless. The worst that a malicious server could do is delete your data or withhold valid updates, both of which are easily detected and mitigated by running a mirror.
</li>
</ul>
More technical descriptions are available <a href="https://book.peergos.org">here</a>
</p>
</section>
<section class="informative">
<h3>
Use Cases
</h3>
<p class="issue">
Use cases have been moved to a distinct <a href=use_cases.md>markdown document</a>.
</p>
<section>
<h4>
Deployment topologies
</h4>
<p>
Based on the use cases, we consider the following deployment topologies:
</p>
<ul>
<li>
<strong>Mobile Device Only:</strong> The server and the client reside on the
same device. The vault is a library providing functionality via a binary API,
using local storage to provide an encrypted database.
</li>
<li>
<strong>Mobile Device Plus Cloud Storage:</strong>A mobile device plays the role
of a client, and the server is a remote cloud-based service provider that has
exposed the storage via a network-based API (eg. REST over HTTPS). Data is not
stored on the mobile device.
</li>
<li>
<strong>Multiple Devices (Single User) Plus Cloud Storage:</strong> When adding
more devices managed by a single user, the vault can be used to synchronize data
across devices.
</li>
<li>
<strong>Multiple Devices (Multiple Users) Plus Cloud Storage:</strong> When
pairing multiple users with cloud storage, the vault can be used to synchronize
data between multiple users with the help of replication and merge strategies.
</li>
<li>
<p><strong>Multi-/Cross-cloud:</strong> Some use cases (IoT / machine to machine
/ Skynet / guardianship ) require a non-human or non-functioning actor to
delegate KMS/key control to a cloud vault for oversight or human intervention.
In the case of some Password manager use case architectures or biometrically
accessed/deployed key material storage, as well as some multi-cloud/hybrid-cloud
architectures, key material will need to be retrieved from at least one other
vault before accessing the vault being specified here.</p>
<p>Keys in control of such an entity might still need to securely store signed
credentials or data in a separate vault. Additional diagramming or
specifications will be needed to show how this 2-vault solution could be
constrained to be secure and feasible, even if non-normative.</p>
</li>
<li>
<strong>Self-Hosted and/or Home-based Server:</strong> Alice wants to host her
own SDS software instance, on her own server.
</li>
<li>
<strong>Support Low Power Devices/Non-private computing:</strong> To support
users without access to private computing resources, the following three
components need to be considered:
<ol>
<li>Secure Storage</li>
<li>Key vault - private key storage and recovery (Key management)</li>
<li>Trusted computing - computational resources which have access to private
keys and plain text private data</li>
</ol>
</li>
</ul>
</section>
</section>
<section class="informative">
<h3>
Requirements
</h3>
<p>
The following sections elaborate on the requirements that have been gathered
from the core use cases.
</p>
<section>
<h4>
Privacy and multi-party encryption
</h4>
<p>
One of the main goals of this system is ensuring the privacy of an entity's
data so that it cannot be accessed by unauthorized parties, including the
storage provider.
</p>
<p>
To accomplish this, the data must be encrypted both while it is in transit
(being sent over a network) and while it is at rest (on a storage system).
</p>
<p>
Since data could be shared with more than one entity, it is also necessary for
the encryption mechanism to support encrypting data to multiple parties.
</p>
</section>
<section>
<h4>
Sharing and authorization
</h4>
<p>
It is necessary to have a mechanism that enables authorized sharing
of encrypted information among one or more entities.
</p>
<p>
The system is expected to specify one mandatory authorization scheme,
but also allow other alternate authorization schemes. Examples of
authorization schemes include OAuth2, Web Access Control, and
[[ZCAP]]s (Authorization Capabilities).
</p>
</section>
<section>
<h4>
Identifiers
</h4>
<p>
The system should be identifier agnostic. In general, identifiers that are a
form of URN or URL are preferred. While it is presumed that [[DID-CORE]]
(Decentralized Identifiers, DIDs) will be used by the system in a few important ways, hard-coding the implementations to DIDs would be an anti-pattern.
</p>
</section>
<section>
<h4>
Versioning and replication
</h4>
<p>
It is expected that information can be backed up on a continuous basis. For this
reason, it is necessary for the system to support at least one mandatory
versioning strategy and one mandatory replication strategy, but also allow other
alternate versioning and replication strategies.
</p>
</section>
<section>
<h4>
Metadata and searching
</h4>
<p>
Large volumes of data are expected to be stored using this system, which then
need to be efficiently and selectively retrieved. To that end, an encrypted
search mechanism is a necessary feature of the system.
</p>
<p>
It is important for clients to be able to associate metadata with the data such
that it can be searched. At the same time, since privacy of both data <em>and</em>
metadata is a key requirement, the metadata must be stored in an encrypted
state, and service providers must be able to perform those searches in an opaque
and privacy-preserving way, without being able to see the metadata.
</p>
</section>
<section>
<h4>
Protocols
</h4>
<p>
Since this system can reside in a variety of operating environments, it is
important that at least one protocol is mandatory, but that other protocols are
also allowed by the design. Examples of protocols include HTTP, gRPC, Bluetooth,
and various binary on-the-wire protocols. An HTTPS API is defined in <a href="#data-vault-https-api"></a>.
</p>
</section>
</section>
<section class="informative">
<h3>
Design goals
</h3>
<p>
This section elaborates upon a number of guiding principles and design goals
that shape Encrypted Data Vaults.
</p>
<section>
<h4>
Layered and modular architecture
</h4>
<p>
A layered architectural approach is used to ensure that the foundation for the
system is easy to implement while allowing more complex functionality to be
layered on top of the lower foundations.
</p>
<p>
For example, Layer 1 might contain the mandatory features for the most basic
system, Layer 2 might contain useful features for most deployments, Layer 3
might contain advanced features needed by a small subset of the ecosystem, and
Layer 4 might contain extremely complex features that are needed by a very small
subset of the ecosystem.
</p>
</section>
<section>
<h4>
Prioritize privacy
</h4>
<p>
This system is intended to protect an entity's privacy. When exploring new
features, always ask "How would this impact privacy?". New features that
negatively impact privacy are expected to undergo extreme scrutiny to determine
if the trade-offs are worth the new functionality.
</p>
</section>
<section>
<h4>
Push implementation complexity to the client
</h4>
<p>
Servers in this system are expected to provide functionality strongly focused on
the storage and retrieval of encrypted data. The more a server knows, the
greater the risk to the privacy of the entity storing the data, and the more
liability the service provider might have for hosting data. In addition, pushing
complexity to the client enables service providers to provide stable server-side
implementations while innovation can by carried out by clients.
</p>
</section>
</section>
<section id="conformance" class="normative">
</section>
</section>
<section class="informative">
<h2>
Terminology
</h2>
<div data-include="./terms.html"
data-oninclude="restrictReferences">
</div>
</section>
<section class="informative">
<h2>
Core Concepts
</h2>
<p>
The following sections outline core concepts, such as encrypted storage,
which form the foundation of this specification.
</p>
<section class="normative">
<h2>
Encrypted Storage
</h2>
<p>
An important consideration of encrypted data stores is which components of the
architecture have access to the (unencrypted) data, or who controls the private
keys. There are roughly three approaches: storage-side encryption, client-side
(edge) encryption, and gateway-side encryption (which is a hybrid of the
previous two).
</p>
<p>
Any data storage systems that let the user store arbitrary data also support
client-side encryption at the most basic level. That is, they let the user
encrypt data themselves, and then store it. This doesn't mean these systems are
optimized for encrypted data however. Querying and access control for encrypted
data may be difficult.
</p>
<p>
Storage-side encryption is usually implemented as whole-
<a href="https://en.wikipedia.org/wiki/Disk_encryption">disk encryption</a>
or filesystem-level encryption. This is widely supported and understood, and any
type of hosted cloud storage is likely to use storage-side encryption. In this
scenario the private keys are managed by the service provider or controller of
the storage server, which may be a different entity than the user who is storing
the data. Encrypting the data while it resides on disk is a useful security
measure should physical access to the storage hardware be compromised, but does
not guarantee that <em>only</em> the original user who stored the data has access.
</p>
<p>
Conversely, client-side encryption offers a high level of
security and privacy, especially if metadata can be encrypted as well. Encryption
is done at the individual data object level, usually aided by a keychain or wallet
client, so the user has direct access to the private keys. This comes at a cost,
however, since the significant responsibility of key management and recovery falls
squarely onto the end user. In addition, the question of key management becomes
more complex when data needs to be shared.
</p>
<p>
Gateway-side encryption systems take an approach that combines
techniques from storage-side and client-side encryption architectures. These
storage systems, typically encountered among multi-server clusters or some
"encryption as a platform" cloud service providers, recognize that client-side
key management may be too difficult for some users and use cases, and offer to
perform encryption and decryption themselves in a way that is transparent to
the client application. At the same time, they aim to minimize the number of
components (storage servers) that have access to the private decryption keys.
As a result, the keys usually reside on "gateway" servers, which encrypt the
data before passing it to the storage servers. The encryption/decryption is
transparent to the client, and the data is opaque to the storage servers, which
can be modular/pluggable as a result. Gateway-side encryption provides some
benefits over storage-side systems, but also share the drawbacks: the gateway
sysadmin controls the keys, not the user.
</p>
</section>
<section class="normative">
<h2>
Structured Documents
</h2>
<p>
The fundamental unit of storage in data vaults is the encrypted
structured document which, when decrypted, provides a data structure that
can be expressed in popular syntaxes such as JSON and CBOR. Documents can
store structured data and metadata about the structured data. Structured
document sizes are limited to 16MB.
</p>
</section>
<section class="normative">
<h2>
Streams
</h2>
<p>
For files larger than 16MB or for raw binary data formats such as audio,
video, and office productivity files, a streaming API is provided that
enables data to be streamed to/from a data vault. Streams are described using
structured documents, but the storage of the data is separated from the
structured document using a hashlink to the encrypted content.
</p>
</section>
<section class="normative">
<h2>
Indexing
</h2>
<p>
Data vaults are expected to store a very large number of documents
of varying kinds. This means that it is important to be able to search the
documents in a timely way, which creates a challenge for the storage provider
as the content is encrypted. Previously this has been worked around
with a certain amount of unencrypted metadata attached to the data objects.
Another possibility is unencrypted listings of pointers to filtered subsets
of data.
</p>
<p>
In the case of data vaults, an encrypted search scheme is provided for
secure data vaults that enable data vault clients to do meta data indexing while
<em>not leaking</em> metadata to the storage provider.
</p>
</section>
</section>
<section class="normative">
<h2>
Architecture
</h2>
<p class="issue">
Review this section for language that should be properly normative.
</p>
<p>
This section describes the architecture of the Encrypted Data Vault protocol, in
the form of a client-server relationship. The vault isregarded as the server and
the client acts as the interface used to interact with the vault.
</p>
<p>
This architecture is layered in nature, where the foundational layer consists of
an operational system with minimal features, and where more advanced features are
layered on top. Implementations can choose to implement only the foundational
layer, or optionally, additional layers consisting of a richer set of features
for more advanced use cases.
</p>
<section>
<h3>
Server and client responsibilities
</h3>
<p>
The server is assumed to be of low trust, and must have no visibility into the
data that it persists. However, even in this model, the server still has a set
of minimum responsibilities it must adhere to.
</p>
<p>
The client is responsible for providing an interface to the server, with
bindings for each relevant protocol (HTTP, RPC, or binary over-the-wire
protocols), as required by the implementation.
</p>
<p>
All encryption and decryption of data is done on the client side, at the edges.
The data (including metadata) MUST be opaque to the server, and the architecture
is designed to prevent the server from being able to decrypt it.
</p>
</section>
<section>
<h3>
Layer 1 (L1) responsibilities
</h3>
<p>
Layer 1 consists of a client-server system that is capable of encrypting data in
transit and at rest.
</p>
<section>
<h4>
Server: validate requests (L1)
</h4>
<p>
When a vault client makes a request to store, query, modify, or delete data in
the vault, the server validates the request. Since the actual data and metadata
in any given request is encrypted, such validation is necessarily limited and
largely depends on the protocol and the semantics of the request.
</p>
</section>
<section>
<h4>
Server: Persist data (L1)
</h4>
<p>
The mechanism a server uses to persist data, such as storage on a local,
networked, or distributed file system, is determined by the implementation. The
persistence mechanism is expected to adhere to the common expectations of a data
storage provider, such as reliable storage and retrieval of data.
</p>
</section>
<section>
<h4>
Server: Persist global configuration (L1)
</h4>
<p>
A vault has a global configuration that defines the following properties:
</p>
<ul>
<li>
Stream chunk size
</li>
<li>
Other config metadata
</li>
</ul>
<p>
The configuration allows the the client to perform capability discovery
regarding things like authorization, protocol, and replication mechanisms that are used
by the server.
</p>
</section>
<section>
<h4>
Server: enforcement of authorization policies (L1)
</h4>
<p>
When a client makes a request to store, query, modify, or delete data in
the vault, the server enforces any authorization policy that is associated with
the request.
</p>
</section>
<section>
<h4>
Client: encrypted data chunking (L1)
</h4>
<p>
An Encrypted Data Vault is capable of storing many different types of data,
including large unstructured binary data. This means that storing a file as a
single entry would be challenging for systems that have limits on single record
sizes. For example, some databases set the maximum size for a single record to
16MB. As a result, it is necessary that large data is chunked into sizes that
are easily managed by a server. It is the responsibility of the client to set
the chunk size of each resource and chunk large data into manageable chunks for
the server. It is the responsibility of the server to deny requests to store
chunks larger that it can handle.
</p>
<p>
Each chunk is encrypted individually using authenticated encryption. Doing so
protects against attacks where an attacking server replaces chunks in a large
file and requires the entire file to be downloaded and decrypted by the victim
before determining that the file is compromised. Encrypting each chunk with
authenticated encryption ensures that a client knows that it has a valid chunk
before proceeding to the next one. Note that another authorized client can still
perform an attack by doing authenticated encryption on a chunk, but a server is
not capable of launching the same attack.
</p>
</section>
<section>
<h4>
Client: Resource structure (L1)
</h4>
<p>
The process of storing encrypted data starts with the creation of a Resource by
the client, with the following structure.
</p>
<p>
Resource:
</p>
<ul>
<li>
<code>id</code> (required)
</li>
<li>
<code>meta</code>
<ul>
<li>
<code>meta.contentType</code> MIME type
</li>
</ul>
</li>
<li>
<code>content</code> - entire payload, or a manifest-like list of hashlinks to individual chunks
</li>
</ul>
<p>
If the data is less than the chunk size, it is embedded directly into the
<code>content</code>.
</p>
<p>
Otherwise, the data is sharded into chunks by the client (see next section), and
each chunk is encrypted and sent to the server. In this case, <code>content</code>
contains a manifest-like listing of URIs to individual chunks (integrity-protected
by [[HASHLINK]].
</p>
</section>
<section>
<h4>
Client: Encrypted resource structure (L1)
</h4>
<p>
The process of creating the Encrypted Resource. If the data was sharded into
chunks, this is done after the individual chunks are written to the server.
</p>
<ul>
<li>
<code>id</code>
</li>
<li>
<code>index</code> - encrypted index tags prepared by the client (for use with
privacy-preserving querying over encrypted resources)
</li>
<li>
<em>Chunk size</em> (if different from the default in global config)
</li>
<li>
<em>Versioning metadata</em> - such as sequence numbers, Git-like hashes, or other mechanisms
</li>
<li>
<em>Encrypted resource payload</em> - encoded as a <code>jwe</code> [[RFC7516]], <code>cwe</code> [[RFC8152]] or other appropriate mechanism
</p>
</section>
</section>
<section>
<h3>
Layer 2 (L2) responsibilities
</h3>
<p>
Layer 2 consists of a system that is capable of sharing data among multiple
entities, of versioning and replication, and of performing privacy-preserving searches
in an efficient manner.
</p>
<section>
<h4>
Client: Encrypted search indexes (L2)
</h4>
<p>
To enable privacy-preserving querying (where the search index is opaque to the
server), the client must prepare a list of encrypted index tags (which are stored
in the Encrypted Resource, alongside the encrypted data contents).
</p>
<p class="issue">
Need details about salting and encryption mechanism of index tags.
</p>
</section>
<section>
<h4>
Client: Versioning and replication (L2)
</h4>
<p>
A server must support <em>at least one</em> versioning/change control mechanism.
Replication is done by the client, not by the server (since the client controls
the keys, knows about which other servers to replicate to, etc.). If an
Encrypted Data Vault implementation aims to provide replication functionality,
it MUST also pick a versioning/change control strategy (since replication
necessarily involves conflict resolution). Some versioning strategies are
implicit ("last write wins", eg. <code>rsync</code> or uploading a file to a file
hosting service), but keep in mind that a replication strategy <em>always</em> implies
that some sort of conflict resolution mechanism should be involved.
</p>
</section>
<section>
<h4>
Client: Sharing with other entities
</h4>
<p>
An individual vault's choice of authorization mechanism determines how a client
shares resources with other entities (authorization capability link or similar
mechanism).
</p>
</section>
</section>
<section>
<h3>
Layer 3 (L3) responsibilities
</h3>
<section>
<h4>
Server: Notifications (L3)
</h4>
<p>
It is helpful if data storage providers are able to notify clients when changes
to persisted data occurs. A server may optionally implement a mechanism by which
clients can subscribe to changes in the vault.
</p>
</section>
<section>
<h4>
Client: Vault-wide integrity protection (L3)
</h4>
<p>
Vault-wide integrity protection is provided to prevent a variety of storage
provider attacks where data is modified in a way that is undetectable, such as
if documents are reverted to older versions or deleted. This protection
requires that a global catalog of all the resource identifiers that belong to a
user, along with the most recent version, is stored and kept up to date by the
client. Some clients may store a copy of this catalog locally (and
include integrity protection mechanism such as [[HASHLINK]] to guard against
interference or deletion by the server.
</p>
</section>
</section>
</section>
<section class="normative">
<h2>
Data Model
</h2>
<p>
The following sections outlines the data model for data vaults.
</p>
<section>
<h3>
DataVaultConfiguration
</h3>
<p class="issue">
Data vault configuration isn't strictly necessary for using the other features
of data vaults. This should have its own conformance section/class or potentially
event be non-normative.
</p>
<p>
A data vault configuration specifies the properties a particular data vault
will have.
</p>
<table class="simple">
<thead>
<th style="white-space: nowrap">Property</th>
<th>Description</th>
</thead>
<tbody>
<tr>
<td>sequence</td>
<td>
A unique counter for the data vault in order to ensure that
clients are properly synchronized to the data vault. The value is required and
MUST be an unsigned 64-bit number.
</td>
</tr>
<tr>
<td>controller</td>
<td>
The entity or cryptographic key that is in control of the