Infrastructure Monitoring

Infrastructure monitoring refers to the process of observing and analyzing the performance and availability of enterprise servers, virtual machines, containers, networks, and other facilities that play a crucial role in delivering applications or services to end users.

Its value includes but is not limited to:

Performance Optimization: Track key performance metrics to identify areas for improvement, such as optimizing CPU or memory usage to enhance performance.
Proactive Problem Detection: Through real-time infrastructure monitoring, problems can be detected proactively before they impact end users or cause service interruptions. Alerting and notification features help IT teams discover and resolve potential infrastructure issues before they escalate into serious incidents.
SLA Compliance: Help enterprises meet Service Level Agreement (SLA) requirements by tracking and reporting Key Performance Indicators (KPIs). Monitoring metrics such as uptime, response time, and availability provides necessary data to ensure SLA compliance and demonstrate IT service reliability.
Capacity Optimization and Cost Management: Monitoring infrastructure resources and usage enables organizations to optimize resource allocation, identify idle or underutilized resources, and make informed decisions about resource configuration.
Capacity Planning and Scalability: By monitoring infrastructure metrics over time, organizations can analyze usage patterns, predict future resource requirements, and plan capacity expansion.

The infrastructure monitoring module displays all collected host, container, process, and network data, helping users quickly understand resource usage and performance. The default view is the home list page.

Hosts

Host List

The host list page displays all collected host resource data.

At the top, the left side allows free switching between list and honeycomb views; the right side search bar allows quick filtering of target data by entering host names.

The left quick filter box enables rapid filtering of target host data through multiple filter options. The host list page's default filter options include operating system and host status.

The right data list displays each host's name, operating system, status, CPU usage, memory usage, and CPU load for the selected time period by default.

Data Timeliness Notes

List data updates every 5 minutes
List data judgment and statistics logic:
- Status: Considered offline if no data reported within 5 minutes.
- Performance metrics: Average value over the last 15 minutes is calculated every 5 minutes, not real-time data.

Host Honeycomb View

The host honeycomb page displays all collected host resource data in a graphical format.

Each hexagon represents a host, with color filling indicating metrics, defaulting to CPU usage but can be switched to memory usage. Hovering over a hexagon displays that host's name, CPU usage, memory usage, and CPU load metrics.

Host Details

Clicking on a host in the list opens a drawer page showing host details, displaying system information, containers, processes, and log information for that host.

System Information: Shows the host's attributes, processor, network, memory, and disk information.
Containers: Displays container running conditions over the past 15 minutes, including container name, status, CPU usage, and memory usage, sorted by container name by default.
Processes: Shows process running conditions over the past 15 minutes, including process name, status, CPU usage, and memory usage, sorted by process name by default.
Logs: Displays log information from the past hour, including log time, level, and information, sorted by time in descending order. Clicking a log entry opens a new page locating the selected log details.

Containers

Container List

The container list page displays all collected container resource data.

At the top, the left side allows free switching between list and honeycomb views; the right side search box enables quick filtering of target data by entering container names.

The left quick filter box enables rapid filtering of target container data through multiple filter options. The container list page's default filter options include host, container image, and container status.

The right data list displays each container's name, operating system, status, image, IP, host, CPU usage, and memory usage for the selected time period by default.

Data Timeliness Notes

List data updates every 5 minutes
List data judgment and statistics logic:
- Status: Considered offline if no data reported within 5 minutes.
- Performance metrics: Average value over the last 15 minutes is calculated every 5 minutes, not real-time data.

Container Honeycomb View

The container honeycomb page displays all collected container resource data in a graphical format.

Each hexagon represents a container, with color filling indicating metrics, defaulting to CPU usage but can be switched to memory usage. Hovering over a hexagon displays that container's name, CPU usage, and memory usage.

Processes

The process list page displays all collected process data.

The top search box allows quick filtering of target data by entering different tags and tag values, such as process name, host, etc.

The left quick filter box enables rapid filtering of target process data through multiple filter options. The process list page's default filter options include host, status, and username.

The right data list displays each process's name, username, host, status, CPU usage, memory usage, and start time for the selected time period by default.

Data Timeliness Notes

List data updates every 5 minutes
List data judgment and statistics logic:
- Status: Considered offline if no data reported within 5 minutes.
- Performance metrics: Average value over the last 15 minutes is calculated every 5 minutes, not real-time data.

Network

List

The network list page displays all service data collected through eBPF.

The top search bar allows quick filtering of target data by entering service names.

The bottom data list displays each service's name, type, error rate, latency, maximum network time, and instance count (online count/total reported count) for the selected time period by default.

Service Topology

The service topology page visually displays all collected service-to-service call relationships, along with service names and types.

Hovering over a service icon displays that service's name, error rate, latency, network time, and instance count. Additionally, services with direct call relationships are highlighted, while those without are grayed out. A green service icon indicates zero error rate; a red service icon indicates error rate greater than zero.

Clicking a service icon provides a "View Upstream/Downstream" button, which when clicked shows that service's upstream and downstream topology, identifying different call relationships when a service has multiple instances.

Host Groups

The host groups page shows which services are running on different hosts. A green service icon indicates zero error rate; a red service icon indicates error rate greater than zero.