AWS just lately introduced help for a brand new Apache Flink connector for Prometheus. The brand new connector, contributed by AWS to the Flink open supply undertaking, provides Prometheus and Amazon Managed Service for Prometheus as a brand new vacation spot for Flink.
On this publish, we clarify how the brand new connector works. We additionally present how one can handle your Prometheus metrics knowledge cardinality by preprocessing uncooked knowledge with Flink to construct real-time observability with Amazon Managed Service for Prometheus and Amazon Managed Grafana.
Amazon Managed Service for Prometheus is a safe, serverless, scaleable, Prometheus-compatible monitoring service. You should utilize the identical open supply Prometheus knowledge mannequin and question language that you just use immediately to observe the efficiency of your workloads with out having to handle the underlying infrastructure. Flink connectors are software program elements that transfer knowledge into and out of an Amazon Managed Service for Apache Flink utility. You should utilize the brand new connector to ship processed knowledge to an Amazon Managed Service for Prometheus vacation spot beginning with Flink model 1.19. With Amazon Managed Service for Apache Flink, you may remodel and analyze knowledge in actual time. There aren’t any servers and clusters to handle, and there’s no compute and storage infrastructure to arrange.
Observability past compute
In an more and more related world, the boundary of programs extends past compute belongings, IT infrastructure, and functions. Distributed belongings corresponding to Web of Issues (IoT) units, related vehicles, and end-user media streaming units are an integral a part of enterprise operations in lots of sectors. The flexibility to look at each asset of what you are promoting is vital to detecting potential points early, enhancing the expertise of your clients, and defending the profitability of the enterprise.
Metrics and time collection
It’s useful to consider observability as three pillars: metrics, logs, and traces. Probably the most related pillar for distributed units, like IoT, is metrics. It’s because metrics can seize measurements from sensors or counting of particular occasions emitted by the gadget.
Metrics are collection of samples of a given measurement at particular instances. For instance, within the case of a related automobile, they are often the readings from the electrical motor RPM sensor. Metrics are usually represented as time collection, or sequences of discrete knowledge factors in chronological order. Metrics’ time collection are usually related to dimensions, additionally known as labels or tags, to assist with classifying and analyzing the information. Within the case of a related automobile, labels could be one thing like the next:
Metric title – For instance, “Electrical Motor RPM”
Automobile ID – A novel identifier of the automobile, just like the Automobile Identification Quantity (VIN)
Prometheus as a specialised time collection database
Prometheus is a well-liked answer for storing and analyzing metrics. Prometheus defines a regular interface for storing and querying time collection. Generally utilized in mixture with visualization instruments like Grafana, Prometheus is optimized for real-time dashboards and real-time alerting.
Typically thought of primarily for observing compute assets, like containers or functions, Prometheus is definitely a specialised time collection database that may successfully be used to look at various kinds of distributed belongings, together with IoT units.
Amazon Managed Service for Prometheus is a serverless, Prometheus-compatible monitoring service. See What’s Amazon Managed Service for Prometheus? to be taught extra about Amazon Managed Service for Prometheus.
Successfully processing observability occasions, at scale
Dealing with observability knowledge at scale turns into more difficult, because of the variety of belongings and distinctive metrics, particularly when observing massively distributed units, for the next causes:
Excessive cardinality – Every gadget emits a number of metrics or varieties of occasions, every to be tracked independently.
Excessive frequency – Gadgets would possibly emit occasions very incessantly, a number of instances per second. This would possibly lead to a big quantity of uncooked knowledge. This side specifically represents the principle distinction from observing compute assets, that are often scraped at longer intervals.
Occasions arrive at irregular intervals and out of order – Not like compute belongings which are often scraped at common intervals, we regularly see delays of transmission or briefly disconnected units, which trigger occasions to reach at irregular intervals. Concurrent occasions from totally different units would possibly observe totally different paths and arrive at totally different instances.
Lack of contextual info – Gadgets usually transmit over channels with restricted bandwidth, corresponding to GPRS or Bluetooth. To optimize communication, occasions seldom comprise contextual info, corresponding to gadget mannequin or consumer element. Nevertheless, this info is required for an efficient observability.
Derive metrics from occasions – Gadgets usually emit particular occasions when particular info occur. For instance, when the automobile ignition is turned on or off, or when a warning is emitted by the onboard laptop. These will not be direct metrics. Nevertheless, counting and measuring the charges of those occasions are helpful metrics that may be inferred from these occasions.
Successfully extracting worth from uncooked occasions requires processing. Processing would possibly occur on learn, once you question the information, or upfront, earlier than storing.
Storing and analyzing uncooked occasions
The widespread method with observability occasions, and with metrics specifically, is “storing first.” You possibly can merely write the uncooked metrics into Prometheus. Processing, corresponding to grouping, aggregating, and calculating derived metrics, occurs “on question,” when knowledge is extracted from Prometheus.
This method would possibly turn out to be notably inefficient once you’re constructing real-time dashboards or alerting, and your knowledge has very excessive cardinality or excessive frequency. As a time collection database is repeatedly queried, a big quantity of knowledge is repeatedly extracted from the storage and processed. The next diagram illustrates this workflow.
Preprocessing uncooked observability occasions
Preprocessing uncooked occasions earlier than storing shifts the work left, as illustrated within the following diagram. This will increase the effectivity of real-time dashboards and alerts, permitting the answer to scale.
Apache Flink for preprocessing observability occasions
Preprocessing uncooked observability occasions requires a processing engine that means that you can do the next:
Enrich occasions effectively, trying up reference knowledge and including new dimensions to the uncooked occasions. For instance, including the automobile mannequin based mostly on the automobile ID. Enrichment permits including new dimensions to the time collection, enabling evaluation in any other case unattainable.
Mixture uncooked occasions over time home windows, to cut back frequency. For instance, if a automobile emits an engine temperature measurement each second, you may emit a single pattern with the typical over 5 seconds. Prometheus can effectively mixture frequent samples on learn. Nevertheless, ingesting knowledge with a frequency a lot larger than what is beneficial for dashboarding and real-time alerting isn’t an environment friendly use of Prometheus ingestion all through and storage.
Mixture uncooked occasions over dimensions, to cut back cardinality. For instance, aggregating some measurement per automobile mannequin.
Calculate derived metrics making use of arbitrary logic. For instance, counting the variety of warning occasions emitted by every automobile. This additionally allows evaluation in any other case unattainable utilizing solely Prometheus and Grafana.
Assist event-time semantics, to mixture over time occasions from totally different sources.
Such a preprocessing engine should additionally have the ability to scale and course of the massive quantity of enter uncooked occasions, and to course of knowledge with low latency—usually subsecond or single-digit seconds—to allow real-time dashboards and altering. To handle these necessities, we see many purchasers utilizing Flink.
Apache Flink meets the aforementioned necessities. Flink is a framework and distributed stream processing engine, designed to carry out computations at in-memory velocity and at scale. Amazon Managed Service for Apache Flink provides a completely managed, serverless expertise, permitting you to run your Flink functions with out managing infrastructure or clusters.
Amazon Managed Service for Apache Flink can course of the ingested uncooked occasions. The ensuing metrics, with decrease cardinality and frequency, and extra dimensions, could be written to Prometheus for a simpler visualization and evaluation. The next diagram illustrates this workflow.
Integrating Apache Flink and Prometheus
The brand new Flink Prometheus connector permits Flink functions to seamlessly write preprocessed time collection knowledge to Prometheus. No intermediate part is required, and there’s no requirement to implement a customized integration. The connector is designed to scale, utilizing the power of Flink to scale horizontally, and optimizing the writes to a Prometheus backend utilizing a Distant-Write interface.
Instance use case
AnyCompany is a automotive rental firm managing a fleet of lots of of 1000’s hybrid related automobiles, in a number of areas. Every automobile repeatedly transmits measurements from a number of sensors. Every sensor emits a pattern each second or extra incessantly. Autos additionally talk warning occasions when one thing flawed is detected by the onboard laptop. The next diagram illustrates the workflow.
AnyCompany is planning to make use of Amazon Managed Service for Prometheus and Amazon Managed Grafana to visualise automobile metrics and arrange customized alerts.
Nevertheless, constructing a real-time dashboard based mostly on uncooked knowledge, as transmitted by the automobiles, could be sophisticated and inefficient. Every automobile may need lots of of sensors, every of them leading to a separate time collection to show. Moreover, AnyCompany needs to observe the habits of various automobile fashions. Sadly, the occasions transmitted by the automobiles solely comprise the VIN. The mannequin could be inferred by trying up (becoming a member of) some reference knowledge.
To beat these challenges, AnyCompany has constructed a preprocessing stage based mostly on Amazon Managed Service for Apache Flink. This stage has the next capabilities:
Enrich the uncooked knowledge by including the automobile mannequin, and searching up reference knowledge based mostly on the automobile identification.
Scale back the cardinality, aggregating the outcomes per automobile mannequin, accessible after the enrichment step.
Scale back the frequency of the uncooked metrics to cut back write bandwidth, aggregating over time home windows of some seconds.
Calculate derived metrics based mostly on a number of uncooked metrics. For instance, decide whether or not a automobile is in movement when both the inner combustion engine or {the electrical} motor are rotating.
The results of preprocessing are extra actionable metrics. A dashboard constructed on these metrics can, for instance, assist decide whether or not the final software program replace launched over-the-air to all automobiles of a selected mannequin in particular areas, is inflicting points.
Utilizing the Flink Prometheus connector, the preprocessor utility can write on to Amazon Managed Service for Prometheus, with out intermediate elements.
Nothing prevents you from selecting to put in writing uncooked metrics with full cardinality and frequency to Prometheus, permitting you to drill all the way down to the only automobile. The Flink Prometheus connector is designed to scale by batching and parallelizing writes.
Answer overview
The next GitHub repository incorporates a fictional end-to-end instance overlaying this use case. The next diagram illustrates the structure of this instance.
The workflow consists of the next steps:
Autos, radio transmission, and ingestion of IoT occasions have been abstracted away, and changed by a knowledge generator that produces uncooked occasions for 100 thousand fictional automobiles. For simplicity, the information generator is itself an Amazon Managed Service for Apache Flink utility.
Uncooked automobile occasions are despatched to a stream storage service. On this instance, we use Amazon Managed Streaming for Apache Kafka (Amazon MSK).
The core of the system is the preprocessor utility, working in Amazon Managed Service for Apache Flink. We’ll dive deeper into the main points of the processor within the following sections.
Processed metrics are straight written to the Prometheus backend, in Amazon Managed Service for Prometheus.
Metrics are used to generate real-time dashboards on Amazon Managed Grafana.
The next screenshot reveals a pattern dashboard.
Uncooked automobile occasions
Every automobile transmits three metrics nearly each second:
Inner combustion (IC) engine RPM
Electrical motor RPM
Variety of reported warnings
The uncooked occasions are recognized by the automobile ID and the area the place the automobile is situated.
Preprocessor utility
The next diagram illustrates the logical move of the preprocessing utility working in Amazon Managed Service for Apache Flink.
The workflow consists of the next steps:
Uncooked occasions are ingested from Amazon MSK from Flink Kafka supply.
An enrichment operator provides the automobile mannequin, which isn’t contained within the uncooked occasions. This extra dimension is then used to mixture the uncooked occasions. The ensuing metrics have solely two dimensions: automobile mannequin and area.
Uncooked occasions are then aggregated over time home windows (5 seconds) to cut back frequency. On this instance, the aggregation logic additionally generates a derived metric: the variety of automobiles in movement. A brand new metric could be derived from uncooked metrics with arbitrary logic. For the sake of the instance, a automobile is taken into account “in movement” if both the IC engine or electrical motor RPM metric will not be zero.
The processed metrics are mapped into the enter knowledge construction of the Flink Prometheus connector, which maps on to the time collection information anticipated by the Prometheus Distant-Write interface. Confer with the connector documentation for extra particulars.
Lastly, the metrics are despatched to Prometheus utilizing the Flink Prometheus connector. Write authentication, required by Amazon Managed Service for Prometheus, is seamlessly enabled utilizing the Amazon Managed Service for Prometheus request signer supplied with the connector. Credentials are routinely derived from the AWS Identification and Entry Administration (IAM) function of the Amazon Managed Service for Apache Flink utility. No further secret or credential is required.
Within the GitHub repository, you’ll find the step-by-step directions to arrange the working instance and create the Grafana dashboard.
Flink Prometheus connector key options
The Flink Prometheus connector permits Flink functions to put in writing processed metrics to Prometheus, utilizing the Distant-Write interface.
The connector is designed to scale write throughput by:
Parallelizing writes, utilizing the Flink parallelism functionality
Batching a number of samples in a single write request to the Prometheus endpoint
Error dealing with complies with Prometheus Distant-Write 1.0 specs. The specs are notably strict about malformed or out-of-order knowledge rejected by Prometheus.
When a malformed or out-of-order write is rejected, the connector discards the offending write request and continues, preferring knowledge freshness over completeness. Nevertheless, the connector makes knowledge loss observable, emitting WARN log entries and exposing metrics that measure the amount of discarded knowledge. In Amazon Managed Service for Apache Flink, these connector metrics could be routinely exported to Amazon CloudWatch.
Obligations of the consumer
The connector is optimized for effectivity, write throughput, and latency. Validation of incoming knowledge could be notably costly when it comes to CPU utilization. Moreover, totally different Prometheus backend implementations implement constraints in another way. For these causes, the connector doesn’t validate incoming knowledge earlier than writing to Prometheus.
The consumer is accountable of creating certain that the information despatched to the Flink Prometheus connector follows the constraints enforced by the actual Prometheus implementations they’re utilizing.
Ordering
Ordering is especially related. Prometheus expects that samples belonging to the identical time collection—samples with the identical metric title and labels—are written in time order. The connector makes certain ordering isn’t misplaced when knowledge is partitioned to parallelize writes.
Nevertheless, the consumer is accountable for retaining the ordering upstream within the pipeline. To attain this, the consumer should fastidiously design knowledge partitioning throughout the Flink utility and the stream storage. Solely partitioning by key have to be used, and partitioning keys should compound the metric title and all labels that will probably be utilized in Prometheus.
Conclusion
Prometheus is a specialised time collection database, designed for constructing real-time dashboards and altering. Amazon Managed Service for Prometheus is a completely managed, serverless backend appropriate with the Prometheus open supply commonplace. Amazon Managed Grafana means that you can construct real-time dashboards, seamlessly interfacing with Amazon Managed Service for Prometheus.
You should utilize Prometheus for observability use instances past compute useful resource, to look at IoT units, related vehicles, media streaming units, and different extremely distributed belongings offering telemetry knowledge.
Immediately visualizing and analyzing high-cardinality and high-frequency knowledge could be inefficient. Preprocessing uncooked observability occasions with Amazon Managed Service for Apache Flink shifts the work left, vastly simplifying the dashboards or alerting you may construct on prime of Amazon Managed Service for Prometheus.
For extra details about working Flink, Prometheus, and Grafana on AWS, see the assets of those companies:
For extra details about the Flink Prometheus integration, see the Apache Flink documentation.
In regards to the authors
Lorenzo Nicora works as Senior Streaming Answer Architect at AWS, serving to clients throughout EMEA. He has been constructing cloud-centered, data-intensive programs for over 25 years, working throughout industries each by consultancies and product firms. He has used open-source applied sciences extensively and contributed to a number of tasks, together with Apache Flink, and is the maintainer of the Flink Prometheus connector.
Francisco Morillo is a Senior Streaming Options Architect at AWS. Francisco works with AWS clients, serving to them design real-time analytics architectures utilizing AWS companies, supporting Amazon MSK and Amazon Managed Service for Apache Flink. He’s additionally a primary contributor to the Flink Prometheus connector.
your blog is fantastic! I’m learning so much from the way you share your thoughts.