Skip to content

Commit e9f33b1

Browse files
author
Ayush Shukla
committed
add Readme doc for s3aTagging.md supporting addition of S3 tags through S3A
1 parent c21f9bd commit e9f33b1

File tree

1 file changed

+298
-0
lines changed
  • hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws

1 file changed

+298
-0
lines changed
Lines changed: 298 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,298 @@
1+
# S3 Object Tagging Support in Hadoop S3A Filesystem
2+
3+
## Overview
4+
5+
The Hadoop S3A filesystem connector now supports S3 object tagging, allowing users to automatically assign metadata tags to S3 objects during creation and soft deletion operations. This feature enables better data organization, cost allocation, access control, and lifecycle management for S3-stored data.
6+
7+
**JIRA Issue**: [HADOOP-19536](https://issues.apache.org/jira/browse/HADOOP-19536#s3-tags)
8+
9+
## Table of Contents
10+
11+
- [Motivation](#motivation)
12+
- [S3 Object Tagging Capabilities](#s3-object-tagging-capabilities)
13+
- [Use Cases](#use-cases)
14+
- [Configuration](#configuration)
15+
- [Usage Examples](#usage-examples)
16+
- [Soft Delete Feature](#soft-delete-feature)
17+
- [Best Practices](#best-practices)
18+
- [Limitations](#limitations)
19+
20+
## Motivation
21+
22+
Amazon S3 supports tagging objects with key-value pairs, providing several critical benefits:
23+
24+
1. **Cost Allocation**: Track and allocate S3 storage costs across departments, projects, or cost centers
25+
2. **Access Control**: Use tags in IAM policies to control object access permissions
26+
3. **Lifecycle Management**: Trigger automated lifecycle policies for object transitions and expiration
27+
4. **Data Classification**: Organize and classify data for compliance, security, and business requirements
28+
5. **Analytics and Reporting**: Enable detailed analytics and reporting based on object metadata
29+
30+
Previously, the Hadoop S3A connector lacked native support for object tagging, requiring users to implement custom solutions or use separate tools to tag objects post-creation.
31+
32+
## S3 Object Tagging Capabilities
33+
34+
### Tag Specifications
35+
- **Maximum Tags**: Up to 10 tags per object
36+
- **Structure**: Key-value pairs
37+
- **Key Length**: Up to 128 Unicode characters
38+
- **Value Length**: Up to 256 Unicode characters
39+
- **Case Sensitivity**: Keys and values are case-sensitive
40+
- **Uniqueness**: Tag keys must be unique per object (no duplicate keys)
41+
42+
### Allowed Characters
43+
Tag keys and values can contain:
44+
- Letters (a-z, A-Z)
45+
- Numbers (0-9)
46+
- Spaces
47+
- Special symbols: `. : + - = _ / @`
48+
49+
## Use Cases
50+
51+
### 1. Access Control with IAM Policies
52+
53+
Control object access based on tags:
54+
55+
```json
56+
{
57+
"Effect": "Allow",
58+
"Action": "s3:GetObject",
59+
"Resource": "*",
60+
"Condition": {
61+
"StringEquals": {
62+
"s3:ExistingObjectTag/department": "finance"
63+
}
64+
}
65+
}
66+
```
67+
68+
### 2. Lifecycle Management
69+
70+
Trigger lifecycle rules based on tags:
71+
72+
```json
73+
{
74+
"Rules": [
75+
{
76+
"Status": "Enabled",
77+
"Filter": {
78+
"Tag": {
79+
"Key": "retention",
80+
"Value": "temporary"
81+
}
82+
},
83+
"Expiration": {
84+
"Days": 30
85+
}
86+
}
87+
]
88+
}
89+
```
90+
91+
### 3. Cost Allocation and Tracking
92+
93+
- Use tags for cost tracking in AWS Cost Explorer
94+
- Allocate costs across different business units or projects
95+
- Generate detailed billing reports by tag dimensions
96+
97+
### 4. Data Analytics and Filtering
98+
99+
- Use S3 Analytics to filter and analyze data by tags
100+
- Create custom reports based on tagged object metadata
101+
- Enable data governance and compliance reporting
102+
103+
## Configuration
104+
105+
### Object Creation Tags
106+
107+
#### Method 1: Comma-Separated List
108+
```properties
109+
fs.s3a.object.tags=department=finance,project=alpha,owner=data-team
110+
```
111+
112+
#### Method 2: Individual Tag Properties
113+
```properties
114+
fs.s3a.object.tag.department=finance
115+
fs.s3a.object.tag.project=alpha
116+
fs.s3a.object.tag.owner=data-team
117+
fs.s3a.object.tag.environment=production
118+
```
119+
120+
### Soft Delete Tags
121+
```properties
122+
fs.s3a.soft.delete.enabled=true
123+
fs.s3a.soft.delete.tag.key=archive
124+
fs.s3a.soft.delete.tag.value=true
125+
```
126+
127+
## Usage Examples
128+
129+
### Spark Applications
130+
131+
#### Using Comma-Separated Tags
132+
```bash
133+
spark-submit \
134+
--conf spark.hadoop.fs.s3a.object.tags=department=finance,project=alpha,environment=prod \
135+
--class MySparkApp \
136+
my-app.jar
137+
```
138+
139+
#### Using Individual Tag Configurations
140+
```bash
141+
spark-submit \
142+
--conf spark.hadoop.fs.s3a.object.tag.department=finance \
143+
--conf spark.hadoop.fs.s3a.object.tag.project=alpha \
144+
--conf spark.hadoop.fs.s3a.object.tag.owner=data-team \
145+
--conf spark.hadoop.fs.s3a.object.tag.cost-center=engineering \
146+
--class MySparkApp \
147+
my-app.jar
148+
```
149+
150+
### Hadoop Commands
151+
152+
#### File Upload with Tags
153+
```bash
154+
hadoop fs \
155+
-Dfs.s3a.object.tag.department=finance \
156+
-Dfs.s3a.object.tag.project=quarterly-report \
157+
-put local-file.txt s3a://my-bucket/reports/
158+
```
159+
160+
#### Directory Operations with Tags
161+
```bash
162+
hadoop fs \
163+
-Dfs.s3a.object.tags=team=analytics,retention=long-term \
164+
-put /local/data/ s3a://my-bucket/analytics/
165+
```
166+
167+
### MapReduce Jobs
168+
169+
```bash
170+
hadoop jar my-job.jar \
171+
-Dfs.s3a.object.tag.job-type=etl \
172+
-Dfs.s3a.object.tag.priority=high \
173+
input s3a://my-bucket/output/
174+
```
175+
176+
## Soft Delete Feature
177+
178+
The soft delete feature allows you to tag objects instead of permanently deleting them, enabling data retention policies and recovery options.
179+
180+
### Important Behavior Notes
181+
182+
- **Default Tags**: If no tag key and value are specified, default tags are used as defined in the configuration
183+
- **Tag Replacement**: When soft delete is performed, **all existing tags on the object are removed** and replaced with only the soft delete tag specified by the user
184+
185+
### Current Implementation
186+
187+
```bash
188+
# Using custom soft delete tags
189+
hadoop fs \
190+
-Dfs.s3a.soft.delete.enabled=true \
191+
-Dfs.s3a.soft.delete.tag.key=archive \
192+
-Dfs.s3a.soft.delete.tag.value=true \
193+
-rm s3a://my-bucket/file-to-archive.txt
194+
195+
# Using default soft delete tags (if configured)
196+
hadoop fs \
197+
-Dfs.s3a.soft.delete.enabled=true \
198+
-rm s3a://my-bucket/file-to-archive.txt
199+
```
200+
201+
### Future Capabilities (Planned)
202+
203+
```bash
204+
# Mark file as soft-deleted with default tags
205+
hadoop fs -rm -softDelete s3a://bucket/path/to/file.txt
206+
207+
# Mark file as soft-deleted with custom tags
208+
hadoop fs -rm -softDelete custom_status deleted s3a://bucket/path/to/file.txt
209+
210+
# List files (soft-deleted files won't appear)
211+
hadoop fs -ls s3a://bucket/path/
212+
213+
# Permanently delete soft-deleted files (requires separate process)
214+
# This would typically be done with S3 lifecycle rules or scheduled jobs
215+
```
216+
217+
## Best Practices
218+
219+
### 1. Tag Naming Conventions
220+
- Use consistent naming conventions across your organization
221+
- Consider using prefixes for different tag categories (e.g., `cost:department`, `security:classification`)
222+
- Use lowercase with hyphens for readability: `cost-center`, `data-classification`
223+
224+
### 2. Tag Management
225+
- Document your tagging strategy and enforce it across teams
226+
- Regularly audit and clean up unused or inconsistent tags
227+
- Use automation to ensure consistent tagging
228+
229+
### 3. Cost Optimization
230+
- Use tags to identify and optimize storage costs
231+
- Implement lifecycle policies based on tags to automatically transition or delete objects
232+
- Monitor tag-based cost allocation reports regularly
233+
234+
### 4. Security Considerations
235+
- Use tags in IAM policies for fine-grained access control
236+
- Avoid including sensitive information in tag values
237+
- Regularly review tag-based access policies
238+
239+
## Limitations
240+
241+
### S3 Service Limits
242+
- Maximum 10 tags per object
243+
- Tag key length: 128 Unicode characters maximum
244+
- Tag value length: 256 Unicode characters maximum
245+
- No nested or hierarchical tag structures
246+
247+
### Performance Considerations
248+
- Tagging adds minimal overhead to object creation operations
249+
- Large numbers of tags may slightly impact performance
250+
- Consider batching operations when possible
251+
252+
### Compatibility
253+
- Feature requires S3A connector version with tagging support
254+
- Some older Hadoop versions may not support all tagging features
255+
- Verify compatibility with your specific Hadoop distribution
256+
257+
## Troubleshooting
258+
259+
### Common Issues
260+
261+
1. **Tag Validation Errors**
262+
- Ensure tag keys and values meet S3 character requirements
263+
- Check for duplicate tag keys
264+
- Verify tag count doesn't exceed 10 per object
265+
266+
2. **Permission Issues**
267+
- Ensure IAM permissions include `s3:PutObjectTagging` and `s3:GetObjectTagging`
268+
- Verify bucket policies allow tagging operations
269+
270+
3. **Configuration Problems**
271+
- Check property syntax and formatting
272+
- Ensure configuration properties are properly set in Hadoop configuration files
273+
274+
### Debug Commands
275+
276+
```bash
277+
# Verify object tags using AWS CLI
278+
aws s3api get-object-tagging --bucket my-bucket --key path/to/file.txt
279+
280+
# List objects with specific tags
281+
aws s3api list-objects-v2 --bucket my-bucket --query "Contents[?contains(TagSet[?Key=='department'].Value, 'finance')]"
282+
```
283+
284+
## Contributing
285+
286+
To contribute to this feature or report issues:
287+
288+
1. Check the [JIRA issue](https://issues.apache.org/jira/browse/HADOOP-19536) for current status
289+
2. Follow Hadoop contribution guidelines
290+
3. Submit patches through the Apache Hadoop review process
291+
4. Include comprehensive tests for any new functionality
292+
293+
## References
294+
295+
- [Amazon S3 Object Tagging Documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html)
296+
- [S3 Lifecycle Configuration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
297+
- [IAM Policies with S3 Tags](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging-managing.html)
298+
- [Hadoop S3A Documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)

0 commit comments

Comments
 (0)