Services
In modern complex distributed systems, systems are typically composed of multiple services, with each service responsible for specific functionality. Observing system performance from a service perspective means evaluating the overall system performance by observing individual services or groups of related services, allowing clear visibility into each service's contribution and impact on overall system performance, enabling more precise identification of performance bottlenecks.
A service's performance issues can affect the entire system's user experience. For example, in an e-commerce system with product display, shopping cart, order processing, and payment services, if the payment service responds slowly, even if the product display and shopping cart services are smooth, users waiting for long periods during payment will negatively impact satisfaction with the entire e-commerce system.
Service List
The service list page displays all collected services and key performance indicators, helping users quickly understand the basic information of all current services.
The search box at the top allows you to enter a service name to quickly filter target services.
The quick filter box on the left allows you to quickly filter target services through multiple filter options. The default filter options for the service page include service type, environment, version, and service name.
The data list on the right displays the service name, health score, average response time, P99 response time, error rate, and requests per second within the selected time period. The service list is sorted by requests per second in descending order by default, and clicking other performance data headers allows free ascending or descending sorting.
Service Details
Click on a service name in the service list to open a drawer page on the right showing the service details.
The top shows basic performance data of the selected service, such as health score, error rate, and average response time.
The bottom switches between different tabs to show detailed information (the timeline in the top left can switch data ranges):
- Performance: Shows trend graphs of the current service's request count, response time distribution, error request count, health score, database call performance, and downstream call performance.
- Upstream/Downstream Topology: Shows the direct upstream and downstream call topology of the current service.
- Resources: Shows the current service's resource information, including resource name, average response time, P95 response time, P99 response time, error rate, and call count. Additionally, clicking on a resource will show buttons for "View Call Topology" and "View Traces".
- Traces: Shows trace information generated by the current service, including trace start time, service name, resource, duration, method, and HTTP status code. Additionally, clicking on a trace will open a new page locating to the selected trace details page.
- Logs: Shows log information of the current service, including log time, log level, and log message. Additionally, clicking on a log will open a new page locating to the selected log details page.
Topology Graph
The topology graph page visually displays all collected service call relationships, as well as service names and types.
The meaning of service icons, outer rings, and colors in the topology graph:
- The center icon indicates the service type, currently supporting differentiation between Web, Database, MySQL, Redis, etc.
- A solid outer ring indicates the service is an APM service collected through light-agent; a dashed outer ring indicates detailed data was not collected for that service, identified as a third-party call.
- The outer ring color indicates different ranges under the current fill metric.
- Values from small to large are divided into four levels: green, yellow, orange, and red; the values represented by different colors can be viewed in the legend at the bottom right of the page; third-party calls are always displayed in gray.
- Default fill is by "error rate", but can be customized to switch to "requests per second", "average response time", or "P99 response time".
Hovering over a service icon will display the service's name, error rate, average response time, and requests per second; third-party calls will only show the name. At the same time, direct upstream and downstream call relationships for that service will be highlighted, while others are grayed out. Hovering over the connection line between two services will display performance metrics of their calls, including error rate, P99 response time, and requests per second.
Clicking on a service icon allows you to freely choose to view related detail data, such as viewing upstream/downstream, viewing service details, viewing related logs, viewing related traces.
- View Upstream/Downstream: Only shows the topology graph of services with direct call relationships before and after the selected service. Click "Return to Global Topology" in the top left to return to the global topology graph.
- View Service Details: Will jump to the service details page with the service name and current timeline filter information.
- View Related Logs: Will jump to the logs module with the service name and current timeline filter information.
- View Related Traces: Will jump to the traces module with the service name and current timeline filter information.
Resource Analysis
The resource analysis module displays all collected resources and their dependencies with services, performance data, and call relationships.
The search box at the top allows you to enter a resource name to quickly filter target resources.
The quick filter box on the left allows you to quickly filter target resources through multiple filter options. The default filter options for the resource analysis page include service name, average response time range, P95 response time range, and P99 response time range.
The data list on the right displays the name, average response time, P95 response time, P99 response time, call count, and error rate for all resources of each service within the selected time period.
Click on a resource name in the list to jump to view that resource's call relationship topology.
- Each card represents a resource and shows the service name and type it belongs to;
- The percentage below the card represents what percentage that resource accounts for in its direct upstream calls; the percentages of each resource's direct downstream sum to 100%;
- By default shows the downstream call relationship topology of the selected resource; clicking the "plus button" on the left of the selected resource card will expand that resource's direct upstream calls, but without showing specific call percentages.
- By default collapses resources with call percentages less than 1%, click "View More" to expand; the threshold for collapsing call percentages can be customized in the top right.
- Click on a resource card to show buttons for "View Call Topology", clicking will switch to show the call topology from the selected resource's perspective; "View Traces", clicking will enter the traces module to view traces related to that resource.