Skip to content

Commit 2ba31bd

Browse files
authored
Merge pull request #518 from gallettilance/docs-user-data-collection
LCORE-427: (docs) user data collection
2 parents 528e0b9 + 7e4f62a commit 2ba31bd

File tree

1 file changed

+224
-0
lines changed

1 file changed

+224
-0
lines changed

docs/user_data_collection.md

Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
# Lightspeed Stack user data collection
2+
3+
## Overview
4+
This document outlines the process of capturing user interactions and system responses in the Lightspeed Core Stack service. Understanding this process will help optimize the system for better responses and outcomes.
5+
6+
## Components
7+
8+
### Lightspeed Core Stack
9+
- Every user interaction results in the storage of its transcript as a JSON file on the local disk.
10+
- When a user provides feedback (whether the LLM answer was satisfactory or not), the data is posted to the `/feedback` endpoint. This action also results in the creation of a JSON file.
11+
- Both transcripts and feedback are stored in configurable local directories with unique filenames.
12+
13+
### Data Export Integration
14+
- The Lightspeed Core Stack integrates with the [lightspeed-to-dataverse-exporter](https://github.com/lightspeed-core/lightspeed-to-dataverse-exporter) service to automatically export various types of user interaction data to Red Hat's Dataverse for analysis.
15+
- The exporter service acts as a sidecar that periodically scans the configured data directories for new JSON files (transcripts and feedback).
16+
- It packages these data into archives and uploads them to the appropriate ingress endpoints.
17+
18+
### Red Hat Dataverse Integration
19+
- The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement.
20+
- Data flows through the same pipeline as other Red Hat services for consistent processing and analysis.
21+
22+
## Configuration
23+
24+
User data collection is configured in the `user_data_collection` section of the configuration file:
25+
26+
```yaml
27+
user_data_collection:
28+
feedback_enabled: true
29+
feedback_storage: "/tmp/data/feedback"
30+
transcripts_enabled: true
31+
transcripts_storage: "/tmp/data/transcripts"
32+
data_collector:
33+
enabled: false
34+
ingress_server_url: null
35+
ingress_server_auth_token: null
36+
ingress_content_service_name: null
37+
collection_interval: 7200 # 2 hours in seconds
38+
cleanup_after_send: true
39+
connection_timeout_seconds: 30
40+
```
41+
42+
### Configuration Options
43+
44+
#### Basic Data Collection
45+
- `feedback_enabled`: Enable/disable collection of user feedback data
46+
- `feedback_storage`: Directory path where feedback JSON files are stored
47+
- `transcripts_enabled`: Enable/disable collection of conversation transcripts
48+
- `transcripts_storage`: Directory path where transcript JSON files are stored
49+
50+
#### Data Collector Service (Advanced)
51+
- `enabled`: Enable/disable the data collector service that uploads data to ingress
52+
- `ingress_server_url`: URL of the ingress server for data upload
53+
- `ingress_server_auth_token`: Authentication token for the ingress server
54+
- `ingress_content_service_name`: Service name identifier for the ingress server
55+
- `collection_interval`: Interval in seconds between data collection cycles (default: 7200 = 2 hours)
56+
- `cleanup_after_send`: Whether to delete local files after successful upload (default: true)
57+
- `connection_timeout_seconds`: Timeout for connection to ingress server (default: 30)
58+
59+
## Data Storage
60+
61+
### Feedback Data
62+
Feedback data is stored as JSON files in the configured `feedback_storage` directory. Each file contains:
63+
64+
```json
65+
{
66+
"user_id": "user-uuid",
67+
"timestamp": "2024-01-01T12:00:00Z",
68+
"conversation_id": "conversation-uuid",
69+
"user_question": "What is Kubernetes?",
70+
"llm_response": "Kubernetes is an open-source container orchestration system...",
71+
"sentiment": 1,
72+
"user_feedback": "This response was very helpful",
73+
"categories": ["helpful"]
74+
}
75+
```
76+
77+
### Transcript Data
78+
Transcript data is stored as JSON files in the configured `transcripts_storage` directory, organized by user and conversation:
79+
80+
```
81+
/transcripts_storage/
82+
/{user_id}/
83+
/{conversation_id}/
84+
/{unique_id}.json
85+
```
86+
87+
Each transcript file contains:
88+
89+
```json
90+
{
91+
"metadata": {
92+
"provider": "openai",
93+
"model": "gpt-4",
94+
"query_provider": "openai",
95+
"query_model": "gpt-4",
96+
"user_id": "user-uuid",
97+
"conversation_id": "conversation-uuid",
98+
"timestamp": "2024-01-01T12:00:00Z"
99+
},
100+
"redacted_query": "What is Kubernetes?",
101+
"query_is_valid": true,
102+
"llm_response": "Kubernetes is an open-source container orchestration system...",
103+
"rag_chunks": [],
104+
"truncated": false,
105+
"attachments": []
106+
}
107+
```
108+
109+
## Data Flow
110+
111+
1. **User Interaction**: User submits a query to the `/query` or `/streaming_query` endpoint
112+
2. **Transcript Storage**: If transcripts are enabled, the interaction is stored as a JSON file
113+
3. **Feedback Collection**: User can submit feedback via the `/feedback` endpoint
114+
4. **Feedback Storage**: If feedback is enabled, the feedback is stored as a JSON file
115+
5. **Data Export**: The exporter service (if enabled) periodically scans for new files and uploads them to the ingress server
116+
117+
## How to Test Locally
118+
119+
### Basic Data Collection Testing
120+
121+
1. **Enable data collection** in your `lightspeed-stack.yaml`:
122+
```yaml
123+
user_data_collection:
124+
feedback_enabled: true
125+
feedback_storage: "/tmp/data/feedback"
126+
transcripts_enabled: true
127+
transcripts_storage: "/tmp/data/transcripts"
128+
```
129+
130+
2. **Start the Lightspeed Core Stack**:
131+
```bash
132+
python -m src.app.main
133+
```
134+
135+
3. **Submit a query** to generate transcript data:
136+
```bash
137+
curl -X POST "http://localhost:8080/query" \
138+
-H "Content-Type: application/json" \
139+
-d '{
140+
"query": "What is Kubernetes?",
141+
"provider": "openai",
142+
"model": "gpt-4"
143+
}'
144+
```
145+
146+
4. **Submit feedback** to generate feedback data:
147+
```bash
148+
curl -X POST "http://localhost:8080/feedback" \
149+
-H "Content-Type: application/json" \
150+
-d '{
151+
"conversation_id": "your-conversation-id",
152+
"user_question": "What is Kubernetes?",
153+
"llm_response": "Kubernetes is...",
154+
"sentiment": 1,
155+
"user_feedback": "Very helpful response"
156+
}'
157+
```
158+
159+
5. **Check stored data**:
160+
```bash
161+
ls -la /tmp/data/feedback/
162+
ls -la /tmp/data/transcripts/
163+
```
164+
165+
### Advanced Data Collector Testing
166+
167+
1. **Enable data collector** in your configuration:
168+
```yaml
169+
user_data_collection:
170+
feedback_enabled: true
171+
feedback_storage: "/tmp/data/feedback"
172+
transcripts_enabled: true
173+
transcripts_storage: "/tmp/data/transcripts"
174+
data_collector:
175+
enabled: true
176+
ingress_server_url: "https://your-ingress-server.com/upload"
177+
ingress_server_auth_token: "your-auth-token"
178+
ingress_content_service_name: "lightspeed-stack"
179+
collection_interval: 60 # 1 minute for testing
180+
cleanup_after_send: true
181+
connection_timeout_seconds: 30
182+
```
183+
184+
2. **Deploy the exporter service** pointing to the same data directories
185+
186+
3. **Monitor the data collection** by checking the logs and verifying that files are being uploaded and cleaned up
187+
188+
## Security Considerations
189+
190+
- **Data Privacy**: All user data is stored locally and can be configured to be cleaned up after upload
191+
- **Authentication**: The data collector service uses authentication tokens for secure uploads
192+
- **Data Redaction**: Query data is stored as "redacted_query" to ensure sensitive information is not captured
193+
- **Access Control**: Data directories should be properly secured with appropriate file permissions
194+
195+
## Troubleshooting
196+
197+
### Common Issues
198+
199+
1. **Data not being stored**: Check that the storage directories exist and are writable
200+
2. **Data collector not uploading**: Verify the ingress server URL and authentication token
201+
3. **Permission errors**: Ensure the service has write permissions to the configured directories
202+
4. **Connection timeouts**: Adjust the `connection_timeout_seconds` setting if needed
203+
204+
### Logging
205+
206+
Enable debug logging to troubleshoot data collection issues:
207+
208+
```yaml
209+
service:
210+
log_level: debug
211+
```
212+
213+
This will provide detailed information about data collection, storage, and upload processes.
214+
215+
## Integration with Red Hat Dataverse
216+
217+
For production deployments, the Lightspeed Core Stack integrates with Red Hat's Dataverse through the exporter service. This provides:
218+
219+
- Centralized data collection and analysis
220+
- Consistent data processing pipeline
221+
- Integration with other Red Hat services
222+
- Automated data export and cleanup
223+
224+
For complete integration setup, deployment options, and configuration details, see the [exporter repository](https://github.com/lightspeed-core/lightspeed-to-dataverse-exporter).

0 commit comments

Comments
 (0)