Kaltura reduces observability operational prices by 60% with Amazon OpenSearch Service

This put up is co-written with Ido Ziv from Kaltura.

As organizations develop, managing observability throughout a number of groups and purposes turns into more and more complicated. Logs, metrics, and traces generate huge quantities of knowledge, making it difficult to take care of efficiency, reliability, and cost-efficiency.

At Kaltura, an AI-infused video-first firm serving hundreds of thousands of customers throughout tons of of purposes, observability is mission-critical. Understanding system habits at scale isn’t nearly troubleshooting—it’s about offering seamless experiences for patrons and workers alike. However reaching efficient observability at this scale comes with challenges: managing spans; correlating logs, traces, and occasions throughout distributed programs; and sustaining visibility with out overwhelming groups with noise. Balancing granularity, price, and actionable insights requires fixed tuning and considerate structure.

On this put up, we share how Kaltura reworked its observability technique and technological stack by migrating from a software program as a service (SaaS) logging resolution to Amazon OpenSearch Service—reaching larger log retention, a 60% discount in price, and a centralized platform that empowers a number of groups with real-time insights.

Observability challenges at scale

Kaltura ingests over 8TB of logs and traces every day, processing greater than 20 billion occasions throughout 6 manufacturing AWS Areas and over 200 purposes—with log spikes reaching as much as 6 GB per second. This immense information quantity, mixed with a extremely distributed structure, created vital challenges in observability. Traditionally, Kaltura relied on a SaaS-based observability resolution that met preliminary necessities however turned more and more troublesome to scale. Because the platform developed, groups generated disparate log codecs, utilized retention insurance policies that not mirrored information worth, and operated greater than 10 organically grown observability sources. The shortage of standardization and visibility required intensive guide effort to correlate information, preserve pipelines, and troubleshoot points – resulting in rising operational complexity and stuck prices that didn’t scale effectively with utilization.

Kaltura’s DevOps crew acknowledged the necessity to reassess their observability resolution and commenced exploring a wide range of choices, from self-managed platforms to completely managed SaaS choices. After a complete analysis, they made the strategic choice emigrate to OpenSearch Service, utilizing its superior options comparable to Amazon OpenSearch Ingestion, the Observability plugin, UltraWarm storage, and Index State Administration.

Answer overview

Kaltura created a brand new AWS account that will be a devoted observability account, the place OpenSearch Service was deployed. Logs and traces have been collected from completely different accounts and producers comparable to microservices on Amazon Elastic Kubernetes Service (Amazon EKS) and companies working on Amazon Elastic Compute Cloud (Amazon EC2).

By utilizing AWS companies comparable to AWS Id and Entry Administration (IAM), AWS Key Administration Service (AWS KMS), and Amazon CloudWatch, Kaltura was in a position to meet the requirements to create a production-grade system whereas maintaining safety and reliability in thoughts. The next determine exhibits a high-level design of the atmosphere setup.

Ingestion

As seen within the following diagram, logs are shipped utilizing log shippers, also referred to as collectors. In Kaltura’s case, they used Fluent Bit. A log shipper is a software designed to gather, course of, and transport log information from numerous sources to a centralized location, comparable to log analytics platforms, administration programs, or an aggregator system. Fluent Bit was utilized in all sources and likewise offered gentle processing talents. Fluent Bit was deployed as a daemonset in Kubernetes. The appliance growth groups didn’t change their code, as a result of the Fluent Bit pods have been studying the stdout of the applying pods.

The next code is an instance of FluentBit configurations for Amazon EKS:

(INPUT)
Title tail
Path /var/log/containers/*.log
Tag kube.*
Skip_Long_Lines On
multiline.parser docker, cri
(FILTER)
alias k8s
# kubernetes filter to parse all logs
Title kubernetes
Match kube.*
Kube_Tag_Prefix kube.var.log.containers.
Annotations On
Labels Off
Merge_Log On
Keep_Log Off
Kube_URL https://kubernetes.default.svc.cluster.native:443
(FILTER)
alias apps
Title rewrite_tag
Match kube.*
Rule $kubernetes(‘annotations’)(‘kaltura.com/observability’) ^apps$
(OUTPUT)
Title http
Match apps.*
Alias apps
Host xxxxx.us-east-1.osis.amazonaws.com
Port 443
URI /log/apps
Format json
aws_auth true
aws_region us-east-1
aws_service osis
aws_role_arn arn:aws:iam::xxxxx:position/osis-ingestion-role
Log_Level hint
tls On

Spans and traces have been collected immediately from the applying layer utilizing a seamless integration strategy. To facilitate this, Kaltura deployed an OpenTelemetry Collector (OTEL) utilizing the OpenTelemetry Operator for Kubernetes. Moreover, the crew developed a customized OTEL code library, which was integrated into the applying code to effectively seize and log traces and spans, offering complete observability throughout their system.

Knowledge from Fluent Bit and OpenTelemetry Collector was despatched to OpenSearch Ingestion, a completely managed, serverless information collector that delivers real-time log, metric, and hint information to OpenSearch Service domains and Amazon OpenSearch Serverless collections. Every producer despatched information to a particular pipeline, one for logs and one for traces, the place information was reworked, aggregated, enriched, and normalized earlier than being despatched to OpenSearch Service. The hint pipeline used the otel_trace and service_map processors, whereas utilizing the OpenSearch Ingestion OpenTelemetry hint analytics blueprint.

The next code is an instance of the OpenSearch Ingestion pipeline for logs:

model: “2”
entry-pipeline:
supply:
http:
path: “/log/apps”

processor:
– add_entries:
entries:
– key: “log_type”
worth: “default”
– key: “log_type”
worth: “api”
add_when: ‘comprises(/filename, “api.log”)’
overwrite_if_key_exists: true
– key: “log_type”
worth: “stats”
add_when: ‘comprises(/filename, “stats.log”)’
overwrite_if_key_exists: true
– key: “log_type”
worth: “occasion”
add_when: ‘comprises(/filename, “occasion.log”)’
overwrite_if_key_exists: true
– key: “log_type”
worth: “login”
add_when: ‘comprises(/filename, “login.log”)’
overwrite_if_key_exists: true

– grok:
grok_when: ‘/log_type == “api”‘
match:
log: (‘^(%%{DATA:timestamp}) (%%{DATA:logIp}) (%%{DATA:host}) (%%{WORD:id}) %%{WORD:priorityName}(%%{NUMBER:precedence}): (reminiscence: %%{DATA:reminiscence} MB, actual: %%{DATA:actual}MB) %%{GREEDYDATA:message}’)

– date:
match:
– key: timestamp
patterns: (“dd-MMM-yyyy HH:mm:ss”, “dd/MMM/yyyy:HH:mm:ss Z”, “EEE MMM dd HH:mm:ss.SSSSSS yyyy”)

vacation spot: “@timestamp”
output_format: “yyyy-MM-dd’T’HH:mm:ss”

– rename_keys:
entries:
– from_key: “timestamp”
to_key: “@timestamp”
overwrite_if_to_key_exists: false
– from_key: “date”
to_key: “@timestamp”
overwrite_if_to_key_exists: false

– drop_events:
drop_when: ‘comprises(/filename, “simplesamlphp.log”)’

sink:
– opensearch:
hosts: (“${opensearch_host}”)
index: ‘$${/env}-api-$${/log_type}-app-logs’
index_type: customized
motion: create
bulk_size: 20
aws:
sts_role_arn: ${sts_role_arn}
area: ${area}
dlq:
s3:
bucket: “${bucket}”
key_path_prefix: ‘my-app-dlq-files’
area: “${area}”
sts_role_arn: “${sts_role_arn}”

The previous instance exhibits the usage of processors comparable to grok, date, add_entries, rename_keys, and drop_events:

add_entries:

Provides a brand new subject log_type primarily based on filename
Default: “default”
If the filename comprises particular substrings (comparable to api.log or stats.log), it assigns a extra particular sort

grok:

Applies Grok parsing to logs of sort “api”
Extracts fields like timestamp, logIp, host, priorityName, precedence, reminiscence, actual, and message utilizing a customized sample

date:

Parses timestamp strings into a regular datetime format
Shops it in a subject known as @timestamp primarily based on ISO8601 format
Handles a number of timestamp patterns

rename_keys:

timestamp or date are renamed into @timestamp
Doesn’t overwrite if @timestamp already exists

drop_events:

Drops logs the place filename comprises simplesamlphp.log
This can be a filtering rule to disregard noisy or irrelevant logs

The next is an instance of the enter of a log line:

“log”: “(25-Mar-2025 18:23:18) (127.0.0.1) (the-most-awesome-server-in-kaltura) (67e2f496cc321) INFO(6): (reminiscence: 4.51 MB, actual: 6MB) (request: 1) (time: 0.0263s / whole: 0.0263s)”,

After processing, we get the next code:

“log_type”: “api”,
“priorityName”: “INFO”,
“reminiscence”: “4.51”,
“host”: “the-most-awesome-server-in-kaltura”,
“actual”: “6”,
“precedence”: “6”,
“message”: “(request: 1) (time: 0.0263s / whole: 0.0263s)”,
“logIp”: “127.0.0.1”,
“id”: “67e2f496cc321”,
“@timestamp”: “2025-03-25T18:23:18”

Kaltura adopted some OpenSearch Ingestion greatest practices, comparable to:

Together with a dead-letter queue (DLQ) in pipeline configuration. This will considerably assist troubleshoot pipeline points.
Beginning and stopping pipelines to optimize cost-efficiency, when doable.
In the course of the proof of idea stage:

Putting in Knowledge Prepper regionally for sooner growth iterations.
Disabling persistent buffering to expedite blue-green deployments.

Attaining operational excellence with environment friendly log and hint administration

Logs and traces play an important position in figuring out operational points, however they arrive with distinctive challenges. First, they signify time sequence information, which inherently evolves over time. Second, their worth sometimes diminishes as time passes, making environment friendly administration essential. Third, they’re append-only in nature. With OpenSearch, Kaltura confronted distinct trade-offs between price, information retention, and latency. The purpose was to ensure useful information remained accessible to engineering groups with minimal latency, however the resolution additionally wanted to be cost-effective. Balancing these components required considerate planning and optimization.

Knowledge was ingested to OpenSearch information streams, which simplifies the method of ingesting append-only time sequence information. A number of Index State Administration (ISM) insurance policies have been utilized to completely different information streams, which have been depending on log retention necessities. ISM insurance policies dealt with shifting indexes from sizzling storage to UltraWarm, and finally deleting the indexes. This allowed a customizable and cost-effective resolution, with low latency for querying new information and affordable latency for querying historic information.

The next instance ISM coverage makes certain indexes are managed effectively, rolled over, and moved to completely different storage tiers primarily based on their age and dimension, and finally deleted after 60 days. If an motion fails, it’s retried with an exponential backoff technique. In case of failures, notifications are despatched to related groups to maintain them knowledgeable.

{
“id”: “retention”,
“coverage”: {
“description”: “manufacturing ISM”,
},
“default_state”: “sizzling”,
“states”: (
{
“title”: “sizzling”,
“actions”: (
{
“retry”: {
“depend”: 5,
“backoff”: “exponential”,
“delay”: “1h”
},
“rollover”: {
“min_primary_shard_size”: “30gb”,
“copy_alias”: false
}
}
),
“transitions”: (
{
“state_name”: “heat”,
“situations”: {
“min_index_age”: “second”
}
}
)
},
{
“title”: “heat”,
“actions”: (
{
“retry”: {
“depend”: 5,
“backoff”: “exponential”,
“delay”: “1h”
},
“warm_migration”: {}
}
),
“transitions”: (
{
“state_name”: “chilly”,
“situations”: {
“min_index_age”: “14d”
}
}
)
},
{
“title”: “chilly”,
“actions”: (
{
“retry”: {
“depend”: 5,
“backoff”: “exponential”,
“delay”: “1h”
},
“cold_migration”: {
“start_time”: null,
“end_time”: null,
“timestamp_field”: “@timestamp”,
“ignore”: “none”
}
}
),
“transitions”: (
{
“state_name”: “delete”,
“situations”: {
“min_index_age”: “60d”
}
}
)
},
{
“title”: “delete”,
“actions”: (
{
“retry”: {
“depend”: 3,
“backoff”: “exponential”,
“delay”: “1m”
},
“cold_delete”: {}
}
),
“transitions”: ()
}
),
“ism_template”: (
{
“index_patterns”: (
“*-logs”
),
“precedence”: 50,
}
)
}
}

To create a knowledge stream in OpenSearch, a definition of index template is required, which configures how the info stream and its backing indexes will behave. Within the following instance, the index template specifies key index settings such because the variety of shards, replication, and refresh interval—controlling how information is distributed, replicated, and refreshed throughout the cluster. It additionally defines the mappings, which describe the construction of the info—what fields exist, their sorts, and the way they need to be listed. These mappings ensure that the info stream is aware of how one can interpret and retailer incoming log information effectively. Lastly, the template allows the @timestamp subject because the time-based subject required for a knowledge stream.

{
“index_patterns”: (
“*my-app-logs”
),
“template”: {
“settings”: {
“index.number_of_shards”: “32”,
“index.number_of_replicas”: “0”,
“index.refresh_interval”: “60s”
},
“mappings”: {
“properties”: {
“priorityName”: {
“sort”: “key phrase”
},
“log_type”: {
“sort”: “key phrase”
},
“@timestamp”: {
“sort”: “date”
},
“reminiscence”: {
“sort”: “float”
},
“host”: {
“sort”: “key phrase”
},
“pid”: {
“sort”: “key phrase”
},
“actual”: {
“sort”: “float”
},
“env”: {
“sort”: “key phrase”
},
“message”: {
“sort”: “textual content”
},
“precedence”: {
“sort”: “integer”
},
“logIp”: {
“sort”: “ip”
}
}
}
},
“composed_of”: (),
“precedence”: “100”,
“_meta”: {
“circulation”: “easy”
},
“data_stream”: {
“timestamp_field”: {
“title”: “@timestamp”
}
},
“title”: “my-app-logs”
}

Implementing role-based entry management and consumer entry

The brand new observability platform is accessed by many forms of customers; inner customers log in to OpenSearch Dashboards utilizing SAML-based federation with Okta. The next diagram illustrates the consumer circulation.

Every consumer accesses the dashboards to view observability objects related to their position. Positive-grained entry management (FGAC) is enforced in OpenSearch utilizing built-in IAM position and SAML group mappings to implement role-based entry management (RBAC).When customers log in to the OpenSearch area, they’re mechanically routed to the suitable tenant primarily based on their assigned position. This setup makes certain builders can create dashboards tailor-made to debugging inside growth environments, and help groups can construct dashboards targeted on figuring out and troubleshooting manufacturing points. The SAML integration alleviates the necessity to handle inner OpenSearch customers totally.

For every position in Kaltura, a corresponding OpenSearch position was created with solely the mandatory permissions. As an illustration, help engineers are granted entry to the monitoring plugin to create alerts primarily based on logs, whereas QA engineers, who don’t require this performance, should not granted that entry.

The next screenshot exhibits the position of the DevOps engineers outlined with cluster permissions.

These customers are routed to their very own devoted DevOps tenant, to which they solely have write entry. This makes it doable for various customers from completely different roles in Kaltura to create the dashboard objects that concentrate on their priorities and wishes. OpenSearch helps backend position mapping; Kaltura mapped the Okta group to the position so when a consumer logs in from Okta, they mechanically get assigned primarily based on their position.

This additionally works with IAM roles to facilitate automations within the cluster utilizing exterior companies, comparable to OpenSearch Ingestion pipelines, as may be seen within the following screenshot.

Utilizing observability options and repair mapping for enhanced hint and log correlation

After a consumer is logged in, they will use the Observability plugins, view surrounding occasions in logs, correlate logs and traces, and use the Hint Analytics plugin. Customers can examine traces and spans, and group traces with latency data utilizing built-in dashboards. Customers also can drill right down to a particular hint or span and correlate it again to log occasions. The service_map processor utilized in OpenSearch Ingestion sends OpenTelemetry information to create a distributed service map for visualization in OpenSearch Dashboards.

Utilizing the mixed indicators of traces and spans, OpenSearch discovers the applying connectivity and maps them to a service map.

After OpenSearch ingests the traces and spans from Otel, they’re aggregated to teams in line with paths and traits. Durations are additionally calculated and introduced to the consumer over time.

With a hint ID, it’s doable to filter out all of the related spans by the service and see how lengthy every took, figuring out points with exterior companies comparable to MongoDB and Redis.

From the spans, customers can uncover the related logs.

Publish-migration enhancements

After the migration, a powerful developer neighborhood emerged inside Kaltura that embraced the brand new observability resolution. As adoption grew, so did requests for brand new options and enhancements aimed toward enhancing the general developer expertise.

One key enchancment was extending log retention. Kaltura achieved this by re-ingesting historic logs from Amazon Easy Storage Service (Amazon S3) utilizing a devoted OpenSearch Ingestion pipeline with Amazon S3 learn permissions. With this enhancement, groups can entry and analyze logs from as much as a yr in the past utilizing the identical acquainted dashboards and filters.

Along with monitoring EKS clusters and EC2 situations, Kaltura expanded its observability stack by integrating extra AWS companies. Amazon API Gateway and AWS Lambda have been launched to help log ingestion from exterior distributors, permitting for seamless correlation with current information and broader visibility throughout programs.

Lastly, to empower groups and promote autonomy, information stream templates and ISM insurance policies are managed immediately by builders inside their very own repositories. By utilizing infrastructure as code instruments like Terraform, builders can outline index mappings, alerts, and dashboards as code—versioned in Git and deployed constantly throughout environments.

Conclusion

Kaltura efficiently applied a sensible log retention technique, extending actual time retention from 5 days for all log sorts to 30 days for important logs, whereas sustaining cost-efficiency by the usage of UltraWarm nodes. This strategy led to a 60% discount in prices in comparison with their earlier resolution. Moreover, Kaltura consolidated their observability platform, streamlining operations by merging 10 separate programs right into a unified, all-in-one resolution. This consolidation not solely improved operational effectivity but in addition sparked elevated engagement from developer groups, driving function requests, fostering inner design collaborations, and attracting early adopters for brand new enhancements. If Kaltura’s journey has impressed you and also you’re serious about implementing an analogous resolution in your group, think about these steps:

Begin by understanding the necessities and setting expectations with the engineering groups in your group
Begin with a fast proof of idea to get hands-on expertise
Discuss with the next sources that can assist you get began:

Concerning the authors

Ido Ziv is a DevOps crew chief in Kaltura with over 6 years of expertise. His hobbies embody crusing and Kubernetes (however not on the similar time).

Roi Gamliel is a Senior Options Architect serving to startups construct on AWS. He’s passionate in regards to the OpenSearch Undertaking, serving to clients fine-tune their workloads and maximize outcomes.

Yonatan Dolan is a Principal Analytics Specialist at Amazon Net Providers. He’s situated in Israel and helps clients harness AWS analytical companies to make use of information, acquire insights, and derive worth.

Supply hyperlink

Kaltura reduces observability operational prices by 60% with Amazon OpenSearch Service

Observability challenges at scale

Answer overview

Ingestion

Attaining operational excellence with environment friendly log and hint administration

Implementing role-based entry management and consumer entry

Utilizing observability options and repair mapping for enhanced hint and log correlation

Publish-migration enhancements

Conclusion

Concerning the authors

Improve stability with devoted cluster supervisor nodes utilizing Amazon OpenSearch Service

Mud hits $6M ARR serving to enterprises construct AI brokers that really do stuff as a substitute of simply speaking

Construct conversational AI search with Amazon OpenSearch Service

LEAVE A REPLY Cancel reply

Most Popular

Why Tottenham have made £50m switch bid for West Ham winger

Ex-Trump Household Lawyer Raises Alarms Over Trump’s Actions as President: ‘I Have By no means Been As Involved’

Iran’s supreme chief makes first public look since warfare : NPR

Finish of an Period: Thomas Muller’s Performs His Final Bayern Munich Recreation Amid MLS Hyperlinks

Recent Comments

EDITOR PICKS

Iran’s supreme chief makes first public look since warfare : NPR

Africa: The Significance of Expanded Brics in Right this moment’s World Panorama

How Canada’s election could have left ‘gaps’ in U.S. journey recommendation – Nationwide

POPULAR POSTS

Ex-Trump Household Lawyer Raises Alarms Over Trump’s Actions as President: ‘I Have By no means Been As Involved’

The Steadiness Between Expertise and Exhausting Work in Reaching Success

Craig Robinson Proclaims Exit From Comedy Profession

POPULAR CATEGORY

ABOUT US

FOLLOW US