Graph Neural Networks for CMDB Intelligence

Program: Data Science Master's Degree
Location: Not Specified (remote)
Student: John C. Platais

Configuration Management Databases (CMDBs) are essential for modern IT operations. They document the systems, applications, and infrastructure that support enterprise environments. However, many CMDBs become outdated over time. This can happen when discovery tools fail to identify all assets, when relationships change faster than updates are made, or when records are entered manually. These gaps can result in issues with impact analysis, incident response, change management, and automated remediation.

The goal of this project was to investigate whether machine learning, specifically graph-based models such as Graph Neural Networks (GNNs), can enhance the accuracy of CMDB data. This includes predicting missing relationships, identifying misclassified assets, and recognizing service-level structures. To facilitate experimentation at a realistic scale while addressing privacy concerns, a synthetic dataset was created, representing approximately 3.55 million nodes and 3.63 million edges. This dataset also included metadata fields and noise to reflect the complexities typical of real CMDBs.

The project compared feature-engineered baselines (XGBoost and Gradient Boosting) with graph-native models, including GraphSAGE, Graph Autoencoder (GAE), and Graph Isomorphism Network (GIN). Results show that while GNNs achieved slightly higher performance in node classification, reaching 60% accuracy and 76% ROC AUC, compared to the baseline’s 59% accuracy and 74% ROC AUC, traditional models outperformed GNNs on link prediction and graph classification. The XGBoost link-prediction model achieved a ROC AUC of approximately 0.90, whereas the Graph Autoencoder produced near-random results, demonstrating the challenges posed by sparse and incomplete neighborhoods. For graph classification, Gradient Boosting consistently produced strong results, while GNN based models showed unstable performance due to limited structural diversity in the synthetic subgraphs. These findings suggest that hybrid approaches, which combine engineered features with learned graph embeddings, may be more effective than relying solely on either classical or graph-based methods.