Isolation Forest + eBPF events to create a Linux based endpoint detection system [P]
Hey everyone. I’ve been working on a machine learning project called guardd and wanted to get some feedback on the ML side of it.
It’s basically a host-based anomaly detection system for Linux using Isolation Forest. I’m collecting exec and network events, grouping them into 60 second windows, then turning that into feature vectors that get scored by the model.
Right now the features are things like counts of exec and network events, how many unique processes, files, IPs and ports show up in a window, some parent-child relationship patterns, a few simple ratios between features, and also some “new vs baseline” tracking like processes or relationships that weren’t seen during training.
Training is fully unsupervised. It collects baseline data, trains an Isolation Forest, then uses score_samples during detection. The threshold is just based on a percentile from the training score distribution.
The main issue right now is false positives, especially from stuff like browsers. Anything with a lot of variance can end up looking anomalous depending on what ended up in the baseline, so the model is pretty sensitive to training data.
Right now I’m looking at adding some time-based features like time of day or activity patterns, improving normalization a bit, and trying to handle bursty behavior better.
Curious what people think about feature design for this kind of data, how to make Isolation Forest less sensitive to noisy but normal behavior, and whether staying fully unsupervised makes sense here or if moving toward something more hybrid would be better.
Would appreciate any thoughts on the approach.
Repo is here: https://github.com/benny-e/guardd.git
[link] [comments]
Want to read more?
Check out the full article on the original site