How data science can boost your detection engineering maintenance and keep you from herding sheep

December 12, 2025

Agapios Tsolakis

In the first blog of this series, we covered all the foundational concepts of detection engineering maintenance. Although that post leaned heavily on theory, it provided the solid groundwork needed for everything that follows.

This time, we’re shifting gears from theory to practice. As we know, detection engineering doesn’t begin with writing resilient logic; it begins with logs, which ultimately means data. Similarly, detection maintenance, a subset of detection engineering, doesn’t start with allowlisting or tuning. It starts with understanding and analyzing data and metadata.

While it would be valuable to explore topics such as ETL pipelines, data workflows, or the ingestion, normalization, and enrichment of logs, these fall outside the scope of this post.

We’ll also leave out validation processes, like log replay or attack simulation, as well as aspects related to data quality, coverage verification, and data source or sensor health. Additionally, alert quality assessments and detection validation (i.e., do they still catch what they’re meant to) are beyond the focus of this piece.

Instead, this second blog centers on what truly matters after a detection has been deployed in an environment. Specifically, how we measure, monitor, and maintain its overall performance.

A small intro into data science

In the previous blog, we emulated our red team friends injecting detection engineering with software engineering principles, much like they inject malicious code into processes. As detection engineering heavily relies on log data, we’ll take it a step further in this blog and infuse our beloved detection engineering science with data science principles.

Now, I’m definitely not a data science expert, but a quick Perplexity search — yes, because Google is for boomers — shows that the high-level steps for a successful data science project are typically the following:

Defining the problem and the goals
Data collection
Data cleaning & pre-processing
Data analysis
Evaluation / conclusion

Let’s have a look at each of these steps and see how they apply to detection maintenance.

Defining the problem and goals

When dealing with a data problem, or when we want to start collecting data for a specific purpose, the first step is to identify and define the problem we’re trying to solve. In our case, the core problem is figuring out how to extract meaningful insights about our detections so we can maintain them proactively and efficiently. To solve this, we first need to understand what data we should collect and from where.

But data collection for performance analysis alone isn’t enough. To maintain detections in a timely and effective way, we also need data that helps us:

Analyze performance trends and detect eligibility for tuning or retirement.
Prioritize which detections require attention first (especially if you’re just starting your detection maintenance journey), and
Report effectively, making future iterations smoother and more data-driven.

By clearly defining the problem from the start, you ensure that your efforts stay focused, avoiding the trap of collecting irrelevant or excessive data that doesn’t lead to actionable insights.

Now that we’ve defined the problem, let’s identify the goals of this project. When setting goals, each one should have a clear, actionable outcome. For example, if the goal is to identify noisy detections, the action item should be to progressively reduce that noise during upcoming maintenance cycles.

An even better approach is to ask yourself:

“What would make my company’s detection maintenance data collection project a successful one?”

That mindset naturally leads to the right kind of questions, and ultimately, the right kind of data.

Questions like:

Do I have the data to determine whether a detection is noisy or not?
Do I have the data to confirm whether a detection is broken?
Do I have clear visibility into how detections behave in my environment?
Can I identify meaningful trends with the data I currently have?
Which, and how many, detections are eligible for tuning in the next iteration?
Do I have redundant or overlapping allowlisting logic in my detections?
Are there detections that should be downgraded to “indicator” (weak signals), or others that deserve to be promoted to “booster” (high-value)?
Do I have a way to store, correlate, and re-query data for historical analysis?
Am I utilizing my available data to its full capacity?

From experience, even when you have answers to all of these existential questions, detections, especially in big environments, will still find a way to punch your team in the face, to paraphrase the wise words of renowned philosopher Mike Tyson.

Data collection

Data collection can be tricky, and sometimes expensive. That’s why we need to be careful and intentional about what we collect. Collecting everything might sound tempting, but it’s neither practical nor sustainable.

Most of the time, the three categories of data described below are enough to give us a strong starting point:

Data from the detections themselves.
Metadata derived from the detections running on one or multiple environments.
Post-deployment data gathered across one or multiple environments.

To collect and manage all this data effectively, you first need to adopt a Detection-as-Code (DaC) approach. Without a proper data structure or a consistent way to query and store your detections, meaningful collection and analysis become nearly impossible.

Note: The sources of data include anything we can query — there are no strict limits here. If it can be accessed, it can be leveraged.

A quick visual to make sure we’re all on the same page:

Detection data

In a previous FalconForce blog post, we shared some examples of how you can structure your core detection code. The file that holds this information is the “usecase.yml”, which acts as a template to populate detections across different environments, you’ll see how below.

Let’s have a look at an example of what detection data looks like in our repository. Keep in mind that the full documentation contains much more detail, but here we’ll focus only on the data that’s relevant for maintenance.

name: Detection 1
id: 0xFF-0001-detection-Win
tags:
  - ActiveDirectory
  - SuspiciousBehavior
  - ProcessInjection
  - Indicator
  - Threshold
change_log:
  - {version: '1.2', date: '2025-05-19', impact: minor, message: Updated entity mapping to remove deprecated FullName field.}
  - {version: '1.1', date: '2024-12-18', impact: minor, message: Added Sentinel entity mapping.}
  - {version: '1.0', date: '2024-11-27', impact: major, message: Initial version.}

The detection data derived from the core detection files can include the following (but you can of course extend detections in your repository to your specific needs):

Detection tags, which provide context on:
→ Whether a detection is high-value or not (via “booster” or “indicator” tags)
→ Whether a detection is threshold-based
→ Whether a detection is CVE-based
Creation date
Versioning information

Note: We use the term “booster” for detections that we consider high value. Conversely, the term “indicator” refers to detections that represent weak signals; they don’t alert on their own, but can add valuable context or be correlated with other detections to trigger meaningful alerts.

Detection metadata

For each environment, you can maintain separate YAML files that store metadata for each detection. At FalconForce, we call these files “env_usecase.yml”, and they follow the format shown below:

status: PROD
first_publish_date: '2024-12-22'
env_usecase_id: '0001'
deployed_version: '1.0'
# maintenance metadata
when_last_reviewed: '2025-03-26'
maintenance_cycles: 1
true_positives: false
trigger_ratio: 'Low'
tuned_after_creation: true
threshold_changed: 'Increased'
# query variables
query_variables:
threshold_placeholder: 30
post_filter_1: |-
| where InitiatingProcessAccountName !startswith "adm" // PERM - Exclusion of admins.

Let’s begin with the maintenance metadata. To facilitate maintenance across environments, we introduced the following metadata elements:

“when_last_reviewed” — The most recent date the detection was reviewed or deployed in an environment.
“maintenance_cycles” — The number of maintenance iterations the detection has undergone within a given environment.
“true_positives” — Indicates whether the detection has previously generated a confirmed True Positive (TP) in the environment. Only based on confirmed incidents or red/purple team exercises in the past. This is not SOC classification data.
“trigger_ratio” — Represents the observed trigger rate of the detection at the time of the latest review or the results’ simulation at the time of the deployment. Acceptable values include `Low`, `Medium` and `High`.
“tuned_after_creation” — Indicates whether the detection required tuning during its initial deployment phase. Typically, during detection testing, prior to reaching production.
“threshold_changed” — Captures whether the detection threshold was modified post-deployment. Allowed values include `No_change`, `Increased` and `Decreased`.

Now, moving on to allowlisting part of the YAML file.

Yes, I know… this is probably the worst allowlisting code you’ve ever seen in your life. But stay with me, the important part here is actually the comment that follows the exclusion code.

It’s always good practice to document the reason behind the allowlisting (though we won’t be using the reason itself in our maintenance process directly). Moreover, we use the tags TEMP and PERM to indicate whether an allowlisting entry is temporary or permanent. Additionally, we want to keep track of the date and version of the allowlisting entry, as well as leave room to set a fixed expiry date.

Of course, this allowlisting data can be extended, but the above concept would look something like this:

# query variables
query_variables:
  threshold_placeholder: 30
  post_filter_1:
 code:
  | where InitiatingProcessAccountName !startswith "adm"
 description: "Exclusion of admins."
 tag: TEMP
 date_added: '2025-03-26'
 last_modification_date: '2025-04-11'
 version: '1.1'
 expiration_date: '2025-05-01'

From the allowlisting code in the above format, we can extract the following data points:

Presence of allowlisting code (to differentiate between tuned and untuned detections).
Number of lines in the allowlisting code (to see how heavily a detection was tuned).
Tags in the comment section indicating if it’s a temporary (TEMP) or permanent (PERM) exclusion (have allowlists expired or is it time to check allowlists set X months before?).

The big question, whether you’re building detections for multiple environments (for example, as MSSP, or when working in a global organization) or just a single one (as an in-house detection engineer for an SMB), is what data you collect from those environments to generate meaningful insights and metrics.

For example, by running the following KQL query in an environment, you can gather data on how often your detections trigger alerts per day:

SecurityAlert 
| where AlertName matches regex @"<placeholder for your regex>" // This line exists if we only want to include detections based on a naming pattern.  
| summarize arg_min(TimeGenerated,*) by SystemAlertId 
| make-series Trend = count() default = 0 on TimeGenerated from startofday(ago(30d)) to startofday(now()) step 1d by AlertName 
| extend Total = array_sum(Trend) 
| top 25 by Total desc 
| project TimeGenerated, Trend, AlertName, Total 
| project-away TimeGenerated

You should get something like the following (screenshot taken from our test environment):

Note: To analyze time-series data effectively, you’ll also need the “TimeGenerated” field. Without the TimeGenerated field, data just becomes unordered numbers. We removed it from the screenshot above to make the visual representation cleaner.

A subset of the data you can extract from an environment, or across multiple environments, includes the following:

Alert trigger frequency of each detection per environment, represented as time-series data.
Alert severity (you can easily extract the alert severity of a detection using the “SecurityAlert” telemetry table).

Data cleaning & pre-processing

Depending on your needs and the type of data you plan to collect, you must ensure that the data are in the correct format before you start analyzing them. The FalconForce blog post I mentioned earlier also provides some useful pointers on maintaining strict data formats in your repository, helping you prevent distorted or inconsistent data from polluting your analysis.

One important thing to keep in mind is that if you query data from multiple environments and perform cross-environments correlation, all environments must follow the same data structure. You might also need to normalize alert or detection names into a consistent format before you can correlate results effectively.

Additionally, watch out for missing data across environments. You’ll need a reliable strategy to handle those gaps without drawing misleading conclusions. Otherwise, you’ll end up comparing apples to oranges.

Data analysis

We’ve got all the data we need, we can query it, explore it with Python (or your tool of choice), and now it’s time for the fun part: diving into the analysis.

Broken detections

Let’s start with the simplest check we can perform: seeing if a detection is broken.

We can use the alert time-series data, and if a detection hasn’t triggered for over six months, it becomes eligible for analysis.

Now, here’s an unpopular opinion, so bear with me. Every non-IoC detection (meaning threshold-based, behavior-based, and so on) should produce at least one result, not necessarily a false-positive, within a reasonable time frame after being deployed in an environment.

If the organization is large, that time frame should shrink even further. I find it hard to believe that a sizeable company with thousands of workstations could deploy a threshold-based detection and have it go an entire year without generating a single alert.

Of course, finding detections that haven’t triggered does not automatically mean they are broken, but it definitely makes them eligible for review. You can prioritize which ones to check by looking at their tags (for example, “booster”) and the alert severity.

Noisy detections

You might have noticed that we didn’t jump straight into noisy detections. That’s intentional. We wanted to ease into it, because dealing with noisy detections is one of the biggest challenges both SOC teams and detection engineers face.

Whether you call it detection efficiency, reducing alert fatigue, or tracking detection performance, these are all different ways to describe the same problem.

Here comes another unpopular opinion. Regardless of how an alert is classified (false- positive, true-positive, or benign-positive), if a standalone custom detection (i.e., not an out-of-the-box one) consistently produces more than one result per day, it should be considered eligible for maintenance. The probability of a detection triggering every single day, or most days in a month (no random spikes), and still being a true positive, is essentially zero.

You might be wondering: if SOC classification data is not reliable enough for analysis, what should you use instead? The answer is in the time-series alert data and the patterns it reveals. The more trigger data you collect and the longer you retain it, the clearer your understanding of a detection’s true behavior becomes. In our next blog post, we will explore how to leverage time-series trends to make dependable, data-driven conclusions using an Azure Workbook dashboard.

As a general rule, a detection should be monitored for at least 90 days before deciding whether it requires maintenance, tuning, allowlisting, or a rework of its core logic. Of course, this depends on the environment. For historical analysis, we would argue that you need at least one year of alert trigger data if your organization is mature, or six months if you are just beginning your detection maintenance journey.

Tuning

Another use of our data is to support tuning as a whole. Tuning is an umbrella term that covers far more than simply suppressing noisy alerts, and our data gives us several entry points to approach it systematically.

Let’s start with allowlisting and exclusions. With the metadata we already have, we can begin to re-evaluate exclusion logic in a more structured way.

For instance, some exclusions were intentionally “tagged” as TEMP to remind us that they should be reviewed or deprecated within a reasonable time frame. In reality, temporary exclusions tend to linger far longer than planned. When that happens, they slowly turn into blind spots or suppress meaningful signals, which is exactly why periodic re-evaluation is non-negotiable.

Another valuable angle is comparing the volume and complexity of exclusions for the same detection across multiple environments. Naturally, environments differ in maturity, configuration, and analyst behavior, so some tenants will always look “quieter” or “cleaner” than others. People also apply exclusions differently, which adds more variability. Even with these differences, the comparison still helps highlighting outliers. If one environment has a detection buried under excessive allowlisting, while others run it with minimal or no exclusions, that is a strong signal that the rule may be over-tuned.

The goal is simple: understand which exclusions are justified, which are outdated, and which were added as short-term workarounds that never received the follow-up they needed. Tightening this process helps preserving the long-term integrity and effectiveness of your detections.

A screenshot from a sample of environments where (subsets of the same) detections have been deployed, highlighting the allowlisting summary section (see “tuning summary”).

A screenshot from a specific environment, highlighting the top ten most tuned detections in comparison to other environments (see “tuning lines”).

Severity re-alignment ensures that the SOC is prioritizing the detections that truly matter. As detection logic matures, some rules naturally become more impactful, while others lose relevance. The “booster” tag, the TP metadata, and even the behavior of the same detection in other environments all provide valuable signals. When combined with time-series alert data, they help identify detections that rarely trigger, but still carry high security value. Raising the severity of these detections directs analyst attention toward events with meaningful risk, rather than those that simply generate noise. This continuous refinement keeps your detection ecosystem aligned with both the evolving threat landscape and your operational priorities.

Threshold alignment is another critical area of analysis. User behavior shifts, new applications enter the environment, and infrastructure evolves, which means thresholds that made sense a year ago may no longer be accurate. By examining the “Threshold” tag in your core detection data and correlating it with historical alert volume, with at least 90 days of data, you can identify whether a threshold has drifted out of alignment. This helps preventing under-alerting caused by thresholds that are too relaxed and over-alerting caused by thresholds that no longer reflect the current operational reality.

Deprecation

Deprecating or retiring a detection from your detection library is a challenging and often complicated task. There are, however, some cases where the decision is much more straightforward. For example, if a detection was originally created to cover a specific CVE that is now patched across all modern systems, it naturally becomes a candidate for removal. By using the “CVE” tag, you can quickly surface all detections in this category and re-evaluate their relevance.

Another group that falls into this “not-so-challenging” bucket includes when the underlying attack technique is no longer relevant, when the required telemetry or schema changes or disappears, when a log source is no longer ingested, or when the platform’s native out-of-the-box detections (for example, Microsoft’s) already cover the behavior, leaving your custom detection with no meaningful purpose.

Now, the real puzzle for strong solvers is deciding when and why to deprecate a perfectly working detection. After some thought, we concluded that the combination of the following factors makes a strong case for retirement:

High trigger rates over time (over at least 6 months of behavior data).
Multiple maintenance cycles (looking at “maintenance_cycles” on the env_usecase.yml file)
A large number of exclusion lines.
All of the above, occurring across multiple environments.

Drifting from core data science principles

Warren Buffett once remarked, “Money is not everything in life.” The same principle applies here: analysis is not everything in data science. Even with a hundred detections waiting for review, the real value lies in what more we can uncover. Let’s explore the additional opportunities hidden within the data we’ve collected.

Prioritization techniques

Not all detections are equal, and you need a way to prioritize the maintenance of your high-value detections. But how do you identify which detections are truly high value? In our opinion, it comes from combining any observed true positives from the past, whether they occurred during red team or purple team exercises, real incidents, or, in the case of large organizations, across multiple environments. We would also add low trigger rate as an additional prerequisite, since high-fidelity detections rarely generate noise.

Note: If you also store and trust your SOC classification data (and you can revisit our thoughts on that in the Unreliable data section below), you can incorporate precision data into your evaluation process. Precision represents the percentage of true positives compared to all triggers, calculated as TP / (TP + FP), and it provides an additional signal when assessing the value of a detection. If a detection has high precision, it should be considered a strong candidate for the “booster” tag.

What if you also want to deprioritize broad or overly-generic detections? In those cases, you need a way to downgrade a detection to an “indicator”. You can make that decision by using time-series trigger data that shows consistently high alert volume over long periods of time, combined with a large quantity of tuning code and multiple maintenance cycles. When all these metrics line up, you should honestly ask yourself: ”Why does this detection keep making my life miserable?”. This is usually a sign that the detection provides weak standalone value and should be treated as supporting context, rather than a primary signal.

Other useful prioritization criteria include the age of the detection (by looking at the “creation_date”), since older detections are naturally more eligible for review, as well as detections that have not been reviewed recently (by looking at the “when_last_reviewed”) and are deployed across multiple environments. The more environments a detection is deployed in, the greater the impact of maintaining or improving it.

Note: the “booster” tag is not only for prioritization, but also a way for large organizations to deploy a set of prioritized, proven, custom detections in new environments, or environments with little to no custom content. For example, we recommend using “booster” tagged detections from our FalconForce repository at new clients for our Sentry Detect service. Learn more at https://falconforce.nl/services/blue-teaming/advanced-detection-content-services/.

Reporting

I would be damned if I didn’t mention reporting, as custom detections are mostly used in mature, large and regulated corporations. Jokes aside, the data we collect is, of course, extremely useful for reporting and high-level overviewing as well.

We are not going to dive deep into reporting in this blog post, but most reporting ideas come from the need for cross-environment correlation of detection data. This can include:

Tracking true positives across all environments.
Comparing allowlisting code lines and identifying whether exclusions overlap between environments.
Monitoring threshold increases or decreases for threshold-based detections across multiple environments.

For example, if you track true positives across multiple environments, you can recommend these proven detections to other teams or other organizations. You can also prioritize detections with a demonstrated record of catching real malicious behavior.

Unreliable data

Before we reach the end of this blog post, let’s throw in one more unpopular opinion. Yes, I know, that was already a handful, but that’s what happens when you ask tired detection engineers about their experience investigating the root causes of “noisy” detections.

There is always a concern about unreliable or missing data. We all want clean, structured, fully consistent datasets, but the reality in detection engineering is far from that ideal. And when the data itself is shaky, the insights you extract become shaky as well.

Metrics like TP, FP, and even BP rates may look valuable, but they all rely heavily on consistent and accurate SOC classifications, which are often influenced by shift differences, analyst subjectivity, workload, and inconsistent tagging. Making these rates too unreliable to serve as the foundation for any meaningful analysis.

This brings us to SOC classification data in general. It is a great addition to your tuning dashboard, and absolutely worth collecting, but it should not be treated as a source of absolute truth. At best, it is a supporting signal. At worst, it becomes noise that looks like signal.

This is why trigger data, time-series patterns, metadata, and deployment context often give you a clearer picture than TP or FP rates ever will.

Conclusion

In this blog, we walked through our approaches and insights in collecting and using data for detection maintenance. Even though there is always more data we could collect, the key is to focus on the data that actually fits (y)our needs, and to experiment as much as possible before deciding to collect anything new.

The way we see it is simple: collect a small subset of data, experiment with it, make sure it is consistent across multiple environments, start analyzing it, and see if you can draw useful conclusions. If you can, automate the collection process and build tooling around that data to streamline querying and analysis. If you cannot, drop it and collect something else. This is the essence of iterative processes.

The data presented in this blog can always be enriched, and the level of enrichment depends entirely on the maturity of each organization. We are sharing the data that has worked well for us and that we know brings value, while we continue testing additional data and iterating across more environments in parallel, in close collaboration with our clients. We have more data to show, and we plan to update this blog in the future. As always, our goal is to provide maintenance building blocks, make the community think while helping make maintenance a bit more “mainstream.”

Data science is hard. Detection maintenance is hard. But that’s not a reason to all stop working in cybersecurity and start herding sheep instead, right? Let us know your thoughts on this blog, give us feedback, and let’s maintain this discussion.

So, what’s next for all this data? Let’s bring it to life with visual analytics. More on that in the next blog.

FalconForce realizes ambitions by working closely with its customers in a methodical manner, improving their security in the digital domain.

Energieweg 3
3542 DZ Utrecht
The Netherlands

FalconForce B.V.
[email protected]
(+31) 85 044 93 34

LinkedIn
X
Bluesky
GitHub

KVK 76682307
BTW NL860745314B01

Terms & Conditions

ISO27001 certified