Chapters · #12

Collecting Unstructured Data

Name: Chapter 12: Collecting Unstructured Data - The Book - DEML Platform
Author: Joe Alongi

Reading Progress32%

Chapter 12: Collecting Unstructured Data

Quantitative telemetry—the rigid, structured arrays of HTTP status codes, latency percentiles, and database transaction times—is incredibly efficient at identifying exactly what failed within a distributed system. However, this numerical data is inherently sterile. It lacks the critical context necessary to understand how the failure actually impacted the human beings relying on the platform. A microsecond latency spike might be a statistical anomaly to a server, but it could manifest as a devastating workflow interruption for an end-user. To achieve a truly holistic view of my operational reality, I must capture human experiences in bits and bytes. This requires me to transcend traditional data engineering and venture into the realm of unstructured data collection and natural language processing.

The challenge lies in the sheer entropy of human communication. Support tickets, user feedback forms, and public social media complaints are chaotic, unstructured, and notoriously difficult to parse programmatically. To bridge the gap between this qualitative feedback and my quantitative telemetry pipelines, I have architected an advanced AI enrichment pipeline utilizing native asynchronous API calls to Google Gemini.

Rather than forcing human operators to manually read, categorize, and correlate every user complaint against backend logs, I constructed an autonomous, stateful AI agent workflow built entirely from scratch using aiohttp. When a natural-language complaint is submitted via the frontend, the payload is immediately routed into this intelligent pipeline. The agent first invokes Google Gemini via its native REST API, leveraging the Large Language Model's (LLM) sophisticated reasoning capabilities to parse the unstructured text, identify the user's underlying intent, and extract critical technical entities (such as browser type, specific error messages, or the feature being accessed).

Crucially, the agent does not operate in isolation. Through strict programmatic tool-calling defined in the native REST payload, the agent is granted access to my historical telemetry APIs. Once the user's complaint is analyzed, the agent autonomously queries the ClickHouse analytical database, fetching the exact server metrics, error logs, and performance traces that occurred during the precise time window of the user's reported issue.

With both the human narrative and the raw machine telemetry loaded into its context window, Gemini executes a complex comparative analysis. It identifies correlations between the qualitative complaint (e.g., "The dashboard wouldn't load my data") and the quantitative reality (e.g., an underlying HTTP 504 Gateway Timeout resulting from a database lock). The agent synthesizes this correlation, determining a highly probable root cause.

Finally, in a masterful demonstration of data transformation, the agent converts this nuanced analysis into a rigidly structured JSON payload. This artifact—containing the synthesized root cause, the extracted entities, and the severity classification—is published directly back onto my Redpanda event bus. By utilizing LLMs not just as chatbots, but as intelligent data transformers, I seamlessly integrate the chaotic reality of human feedback directly into my deterministic data engineering pipelines, unlocking an unprecedented level of operational intelligence.