Skip to content

This a flexible and scalable framework designed for comprehensive topic modeling and analysis. It is ideal for tracking topic evolution over time, making it suitable for a variety of document types and research needs.

License

Notifications You must be signed in to change notification settings

alan-hamm/DataPulse

Repository files navigation

DataPulse: Navigating the Information Superhighway

Table of Contents


Pulse Core

In the sprawling web of interconnected data, where digital constructs form the backbone of a vast, electrified network, DataPulse emerges as the ultimate conduit for topic profiling and analysis, weaving torrents of information into structured, discernible patterns. This isn’t just about parsing text—it’s about decoding the intricate lattice of the digital realm, orchestrating a symphony of topics across dimensions, timelines, and virtual domains.

DataPulse acts as a sentinel of insight, navigating the information superhighway to synchronize and illuminate the evolving threads of language and meaning embedded in the written word. It transforms the raw architecture of digital constructs into a dynamic grid of connections, revealing thematic shifts, persistent patterns, and emerging trends.

Designed for multi-dimensional topic analysis, DataPulse doesn’t merely process data—it pulses with the rhythm of an interconnected network, alive with the capacity to capture, track, and synchronize the threads that weave the fabric of digital information. Each session maps a spectrum of ideas, profiling topics and aligning them like streams of light in a neon-etched cityscape, unveiling the order within the chaos of the data-driven world.

The Foundry: Forging DataPulse

DataPulse didn’t emerge fully formed from the void—it was forged in the crucible of human ingenuity and tempered by the precision of an AI assistant. Yes, a synthetic partner contributed, acting as a co-pilot on the information superhighway, but its existence owes as much to the console cowboy’s own chrome, grit, and midnight oil as it does to any algorithm.

Countless hours were spent weaving through the intricate threads of the digital lattice, hunting rogue commas, taming chaotic logic, and restoring harmony to variables distorted by the AI’s “creative” impulses. Every line of code carries the mark of meticulous revision, the persistence of a creator who refused to settle for anything less than precision.

But this journey demanded more than machine assistance. The console cowboy dove deep into cryptic forums, unearthed obscure documentation, and pieced together fragments of knowledge buried in the recesses of the web. This is the lifeblood of DataPulse: a fusion of human tenacity and machine precision, navigating the vast network of data to transform raw chaos into structured, multi-dimensional insight.

So while an AI may have lent its synthetic hand, know that DataPulse pulses with the unmistakable mark of its creator—the sleepless nights, the grit, and the unwavering commitment that no machine could replicate.

Data Nexus Features

  • Adaptive Resource Management: DataPulse harnesses the formidable power of Dask for distributed parallelization. This ensures a seamless orchestration of resources across processors, dynamically adjusting to tackle vast data landscapes and high computational demands without skipping a beat. The system adapts, self-modulates, and optimizes, deploying cores and threads in perfect synchrony to handle even the heaviest data streams with precision.

  • Multi-Phase Topic Analysis: Far from the confines of linear processing, DataPulse performs a tri-phased exploration—train, validation, and test—that keeps models pristine and refined. By treating each phase as a unique dataset, it preserves the sanctity of unbiased learning, diving deep into intricate data patterns. Each model builds upon an evolving dictionary of terms, maintaining distinct corpora for each phase to deliver a thorough, multi-dimensional perspective.

  • Diachronic Topic Tracking (queued in the network, awaiting greenlight): DataPulse traverses time itself, tracking the shifts in language and evolving terminologies. Users can trace topics across years, even decades, capturing emergent themes and the twilight of others. By mapping how concepts morph, persist, or disappear over time, it uncovers the narrative threads running through historical and modern text alike.

  • Precision Metrics: With coherence, convergence, and perplexity metrics in hand, DataPulse doesn’t leave quality to chance. Each metric is tuned with algorithmic precision, fine-tuned across myriad parameters to capture relevance, thematic clarity, and linguistic structure. A spectrum of scoring metrics ensures that every model reflects a refined, accurate portrayal of the data’s hidden dimensions.

Signal Mapping Console

Visualization in DataPulse is an immersive experience, pushing the boundaries of interaction in the digital realm. Each visualization is a portal into the unseen, rendering complex datasets into intuitively graspable maps. Bokeh and pyLDAvis power the platform’s visual dimensions, creating an environment where data doesn’t just speak—it resonates.

  • 2D and 3D Topic Mapping: DataPulse brings your data into vivid relief, visualizing topics in two or three dimensions, allowing you to explore the intricate networks of ideas that link one document to another. It’s not just about seeing data; it’s about inhabiting it.

  • Temporal Topic Flow: As topics shift and reform across timelines, DataPulse captures this dynamic evolution, letting you witness how language trends and persists. It becomes a chronicle of change, a digital archive of thought made manifest in visual form.

  • Interactive Model Visualization: With DataPulse, you don’t just view models—you engage with them. Each visualization offers an interactive portal, inviting you to dissect topics and understand the underlying themes, creating a space where exploration leads to revelation.

Spectral Analysis Suite

DataPulse is more than just a machine learning engine; it’s a digital mind, configured to dissect, explore, and evolve through data. Its machine learning core thrives on advanced algorithms that go beyond simple clustering, instead capturing the full spectrum of thematic evolution. Using Gensim’s LDA (Latent Dirichlet Allocation) model, DataPulse delivers an analysis that is not only multi-layered but dynamically optimized.

  • Hyperparameter Tuning & Adaptive Model Selection: DataPulse applies a rigorous methodology to find the most resonant model configurations. Hyperparameters are fine-tuned in a ceaseless pursuit of coherence and perplexity optimization, ensuring models yield insights of the highest clarity and relevance.

  • Dynamic Topic Allocation: The architecture of DataPulse allows it to shift and recalibrate in real time, making dynamic adjustments that tailor-fit each data structure. This adaptability enables DataPulse to capture even the most nuanced patterns, providing a level of analytical depth that traditional models simply cannot achieve.

  • High-Speed Convergence Tracking: Speed is of the essence. DataPulse's convergence tracking allows it to rapidly navigate through the topic space, minimizing computational delays while maximizing insight—a neural engine that never sleeps.

Turbocharged GPU Performance: Boosted Compute Power

DataPulse leverages GPU acceleration for efficient processing of large datasets, using the following tools:

  • CUDA: CUDA is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows software developers to use NVIDIA GPUs for general-purpose processing (an approach known as GPGPU). CUDA is necessary for leveraging GPU hardware and enabling acceleration for both PyTorch and CuPy.

  • PyTorch: PyTorch is used for deep learning and tensor operations, enabling significant speed improvements in training and evaluation processes by utilizing GPU acceleration. The calculations related to coherence metrics, such as cosine similarity, are performed using PyTorch tensors, which can be processed much faster on a GPU compared to a CPU.

  • CuPy: CuPy provides an interface similar to NumPy, but all array computations are executed on the GPU, resulting in considerable speed improvements for numerical calculations. This project uses CuPy to accelerate matrix operations and other numerical tasks that are computationally intensive.

By using GPU-accelerated libraries like PyTorch and CuPy, this project achieves significant performance gains compared to CPU-only execution. Users should pay close attention to the compatibility between the versions of CUDA, PyTorch, and their GPU drivers to fully utilize the GPU's capabilities and avoid runtime errors.


Synced Network Grid

By default, the settings in distributed.yaml are optimized for high-performance processing with Dask on systems with significant CPU and memory resources. Adjust as needed to suit your environment.

This project includes a custom distributed.yaml file for configuring Dask. The distributed.yaml file is located in the config/ directory and contains recommended settings for Dask performance and resource management tailored for DataPulse's processing requirements.

To ensure your Dask environment is correctly configured, follow these steps:

  1. Review the distributed.yaml File
    Examine the config/distributed.yaml file to understand its settings, especially if you need to adjust resource limits based on your system’s specifications.

  2. Customize if Necessary
    Depending on your hardware and workload, you may want to customize certain values (e.g., memory limits, CPU thresholds) in the distributed.yaml file.

  3. Refer to Setup Instructions
    For more detailed instructions on configuring the Dask dashboard and securing it for local access, see the Dask_Dashboard_Setup_Instructions.txt file in the config/ directory.

Parallel Loadout: Batch Processing Specs

Configuring futures_batches, base_batch_size, and max_batch_size is critical to balancing resource utilization and achieving efficient processing times, especially on high-performance systems. The script batch_estimation.py is provided for adaptive batch size estimation based on document complexity, memory, and CPU limits. This script is recommended for anyone running DataPulse on datasets with varying document sizes or on systems with constrained resources.

Guidelines for Setting Key Batch Size Parameter

  1. Pulse Width: Setting the Batch Frequency

    • Base Batch Size: The base batch size defines the initial slice of data processed, the foundational rhythm of DataPulse. Set this too low, and you’ll see latency spikes—an ocean of tasks endlessly queued up, choking the system’s flow. Crank it too high, and memory consumption becomes an insatiable beast, starving the rest of the network grid. When facing dense archives—like fifteen years of CDC chronicles—boost the base batch to handle the swell. For smaller echoes, dial it back to quicken the pace. The secret? Tuning the base size until each cycle pulses in synchrony with the system’s limits.

    • Max Batch Size: This is your safety rail. It’s the limiter that prevents DataPulse from diving too deep, overwhelming the capacitors, and triggering unwanted restarts. Adaptive batching recalculates these limits on-the-fly, sensing how heavy the load is and pushing just enough data through the pipes to maximize flow without throttling the system.

  2. Voltage Control: Balancing Load with System Capacity

    The future doesn’t wait, and neither should your data streams. Let batch_estimation.py tap into the complexity matrix of your document streams, weigh them against available system bandwidth, and precisely tune your batch size. System capacity varies—let the estimation script be your stabilizer, dynamically regulating futures_batches, base_batch_size, and max_batch_size to minimize system spikes.

    Example Usage of batch_estimation.py:

    from batch_estimation import estimate_futures_batches
    optimal_batch_size = estimate_futures_batches(document="path/to/document.json")
  3. Futures Batches: Amp Up the Throughput

    To blitz through tasks, futures_batches is the lever to pull. More futures? More throughput. But it's a delicate dance—overloading here means excess memory load, risking system crashes. Start with a modest setting (3–10), then inch it up. Ride the edge but don’t fall off—push higher only if your Dask dashboard signals it can handle the extra juice. If the wires start to glow, dial back.

  4. Adaptive Oscillation: Tune in Real Time

    Adaptive batch sizing is the key to staying in tune with an ever-shifting data landscape. With variability inherent to the digital ocean, let DataPulse wield batch_estimation.py to flow with the tide—adjusting to current system capacity to prevent overload. Adaptive oscillation means smoother runs and less downtime.

  5. Dashboard Diagnostics: Iterate and Fine-Tune

    This isn’t a set-it-and-forget-it game. Run the dashboard, stay wired in. Watch how memory fluctuates and tasks get assigned. If your resource meters start peaking—pull back, adjust, recalibrate. This is about staying nimble in the face of a fluctuating workload, fine-tuning for harmony between the task queue and system pulse.

  6. Memory Overdrive: Managing RAM Burn

    DataPulse runs hot—especially on high-volume feeds. Memory is gold here. Set memory_limit parameters in Dask’s LocalCluster to ensure the pulse doesn't burn out.

    • Allocate worker memory mindfully. Stack it too high, and your system thrashes in the red; too low, and you bottleneck.
    • Adaptive scaling is your friend—let workers swell when the data floodgates open, but make sure they recede when tides go down.
  7. Cores, Threads, and the Network Pulse

    Know your hardware—pulse frequency is determined by how many processors you have in the grid. Use num_workers and max_workers to draw out maximum parallelism, but don’t overclock without heat shields.

    • High-Core Systems: If you've got a beastly rig, set --num_workers=10, --max_workers=14, --num_threads=2 to maintain an optimal processing cadence.
    • Low-Core Systems: Adjust to --num_workers=4, --max_workers=6, --num_threads=1. Keep it smooth, keep it flowing.
  8. Monitoring the Pulse: Keep the Rhythm Steady

    The Dask dashboard isn't just a tool—it's your eyes on the wire, your ears in the digital hum. Use it, understand it. Let it guide you to the right tweaks. Task lag? Increase concurrency. Memory spill? Lower the batch frequency. This is all about harmonizing DataPulse to your system’s beat, finding that perfect rhythm between capacity and demand.

    See How to diagnose performance, Diagnostics(local), and Diagnostics(distributed)

    Monitoring Performance After configuring batch sizes, use the Dask dashboard to observe task distribution, resource utilization, and memory usage per worker. Adjust batch sizes further if tasks are not distributed evenly or if memory usage approaches system limits.


Datafeed Processing: Translating Time-Worn Chronicles

DataPulse's data preprocessing capabilities extend to tracking shifts in language, terminology, and themes across vast timelines. Imagine feeding in The Time Machine’s chronicles—or any temporal data source—and watching DataPulse transform each entry into a structured digital artifact ready for analysis. Each document, whether a journal entry or an archived report, is processed as an isolated data stream, meticulously tokenized and formatted into a bag-of-words model for precision mapping.

By translating raw data into this structured form, DataPulse unlocks the ability to detect recurring patterns and track the life cycle of key themes like "technological singularity," "societal evolution," or "survival imperatives." This process sets the stage for DataPulse's diachronic tracking, allowing it to unearth how ideas emerge, evolve, and fade over time. In preparing each document through this structured approach, DataPulse primes the data for deep, multidimensional analysis, tracing concept arcs and dissecting the storylines embedded in the fabric of historical and futuristic texts alike.

Excerpt:

[
 ["date", "ost", "recently", "updated", "arch"],
 ["ime", "convenient", "speak", "epounding", "recondite", "matter", "pale", "grey", "eyes", "shone", "twinkled", "usually", "pale", "face", "flushed", "animated", "fire", "burnt", "brightly", "soft", "radiance", "incandescent", "lights", "lilies", "silver", "caught", "bubbles", "flashed", "passed", "glasses", "chairs", "patents", "embraced", "caressed", "rather", "submitted", "sat", "luurious", "dinner", "atmosphere", "thought", "runs", "gracefully", "free", "trammels", "precision", "put", "way", "marking", "points", "lean", "forefinger", "sat", "lazily", "admired", "earnestness", "new", "thought", "fecundity"],
 ["follow", "carefully", "controvert", "ideas", "almost", "universally", "accepted", "geometry", "instance", "taught", "school", "founded", "misconception"],
 ["rather", "large", "thing", "epect", "begin", "said", "argumentative", "person", "red", "hair"],
 ["mean", "ask", "accept", "reasonable", "ground", "soon", "admit", "much", "need", "know", "course", "mathematical", "line", "line", "thickness", "nil", "real", "eistence", "taught", "mathematical", "plane", "hese", "things", "mere", "abstractions"],
 ["length", "breadth", "thickness", "cube", "real", "eistence"],
 ["object", "said", "course", "solid", "body", "eist", "real", "things"],
 ["people", "think", "wait", "moment", "instantaneous", "cube", "eist"],
 ["cube", "last", "time", "real", "eistence"],
 ["became", "pensive", "learly", "ime", "raveller", "proceeded", "real", "body", "etension", "directions", "readth", "hickness", "uration", "natural", "infirmity", "flesh", "eplain", "moment", "incline", "overlook", "fact", "really", "dimensions", "call", "planes", "pace", "fourth", "however", "tendency", "draw", "unreal", "distinction", "former", "dimensions", "latter", "happens", "consciousness", "moves", "intermittently", "direction", "latter", "beginning", "end", "lives", "ime_raveller"]
]

Wells, H. G. The Time Machine. Release date: October 2, 2004. Most recently updated: March 30, 2021. Project Gutenberg eBook #35. www.gutenberg.org/ebooks/35.

Example CLI Run:

  ```bash
     python DataPulse.py 
     --username "postgres" 
     --password "admin" 
     --database "DataPulse" 
     --corpus_label "time" 
     --data_source "/path/to/your/data/preprocessed-documents/data.json" 
     --start_topics 20 
     --end_topics 60 
     --step_size 5 
     --num_workers 10 
     --max_workers 12 
     --num_threads 2 
     --max_memory 10 
     --mem_threshold 9 
     --max_cpu 110 
     --futures_batches 30
     --base_batch_size 200 
     --max_batch_size 300 
     --log_dir "/path/to/your/log/" 
     2>"/path/to/your/log/terminal_output.txt"
  ```

DataPulse stands at the intersection of human ingenuity and advanced data analysis, ready to illuminate the spectral layers within the fabric of language. Step into a world where insights come to life, patterns converge, and knowledge flows like an electric current through the digital network. Welcome to the future of multi-dimensional topic analysis.


About

This a flexible and scalable framework designed for comprehensive topic modeling and analysis. It is ideal for tracking topic evolution over time, making it suitable for a variety of document types and research needs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages