Telemetry

As applications grow in complexity, the amount of data generated by event and error handling code can be overwhelming. This is where telemetry comes into play. Telemetry is the process of gathering remote information that is collected by instrumentation to help understand the behavior of the application in meeting service level agreements (SLAs) and for guiding future decisions on resource planning.

Telemetry provides the ability to collect and highlight operational events, reducing management costs, and giving useful insights into application behavior. This data can be used to detect performance issues and errors quickly, classify the issue to understand its nature, recover from an incident and return the application to full operation, diagnose the root cause of the problem, and prevent it from reoccurring.

In order to comprehensively measure application performance, monitor availability, and isolate faults, the combination of information you need to collect from built-in system monitoring features and instrumentation, such as logs and performance counters, must be identified. It is important to collect only the information that will be used to avoid unnecessary data gathering. Additionally, telemetry data should be applied to both test and staged versions of the application during development to measure and validate performance and ensure that instrumentation and telemetry systems are operating correctly.

Telemetry can be used to log all calls to external services, including information about the context, destination, method, timing information, and the result. This data can also be useful in supporting reports of SLA violations from users of the application or when challenging hosting providers regarding failures of their services.

When deciding how to store telemetry data, it must be determined whether to collect the data in each data center and combine the results in the monitoring system or whether to centralize the data storage in one data center. Passing data between data centers will incur additional costs, though this may be balanced by the savings in downloading only one dataset.

To prevent the loss of data, it is essential to include code to retry connections that may encounter transient errors. The retry logic must be intelligent so that repeated failures are detected, and the process is abandoned after a preset number of attempts. The number of retries should be logged to help detect inherent or developing issues. Variable retry intervals should be used to minimize the chance that retry logic could overload a target system that is just recovering from a transient error when there are many queued retry attempts in the pipeline.

In conclusion, telemetry plays a crucial role in managing complex applications. It provides the necessary insights to detect issues, diagnose problems, and make informed decisions on resource planning. By understanding the data that needs to be collected and how to store it, organizations can maximize the benefits of telemetry to ensure the smooth running of their applications.