Query Pelican Server Metrics via Prometheus
For Pelican
>= 7.16.0. Older version of Pelican may not include all of the metrics listed.
Pelican servers have Prometheus embedded by default and provide a handful of Prometheus metrics to monitor server status. You can access the metrics endpoint at https://<pelican-server-host>:<server-web-port>/metrics to see all the available metrics and their current values. By default, /metrics is a protected endpoint and you are required to login and get authenticated to view the page. You can change Monitoring.MetricAuthorization to false in config to turn off the authentication.
Pelican also exposes Prometheus PromQL query engine at https://<pelican-server-host>:<server-web-port>/api/v1.0/prometheus where you can query the metrics against Prometheus powerful query language.
Example: https://<pelican-server-host>:<server-web-port>/api/v1.0/prometheus/query?query=pelican_component_health_status[10m] queries the pelican_component_health_status metric and shows data collected in past 10 min.
However, Pelican does not support Prometheus native /graph endpoint nor other Prometheus native web services other than the two above. For custom data visualizations, Grafana is one of the popular software to use.
Pelican included metrics from built-in gin web server, as well as Go runtime. For all metrics available, visit https://<pelican-server-host>:<server-web-port>/api/v1.0/prometheus/label/__name__/values.
Pelican also has a set of built-in metrics to monitor Pelican server’s status, listed below.
Counter Metrics
Many metrics in this documentation are counters (typically identified by names ending in _total or _count). Counter metrics are monotonically increasing values that accumulate over time. Important notes about counters:
- Counters reset on restart: When a Pelican server restarts, all counter values reset to 0. The counter then begins accumulating from 0 again.
- Time ranges are required: To get meaningful data from counters, you must use PromQL functions that calculate changes over time, such as:
rate()- calculates the per-second average rate of increaseirate()- calculates the per-second instant rate of increaseincrease()- calculates the increase over a time range
- Example usage: Instead of querying
xrootd_server_bytes_totaldirectly, userate(xrootd_server_bytes_total[5m])to get bytes per second, orincrease(xrootd_server_bytes_total[1h])to get total bytes transferred in the last hour.
Gauge Metrics
Many metrics in this documentation are gauges. Gauge metrics represent a single numerical value that can go up or down over time. Important notes about gauges:
- Gauges represent current state: Unlike counters, gauges show the current value of something at a point in time (e.g., number of active connections, current CPU usage, available disk space).
- Gauges can be queried directly: You can query gauge metrics directly without time-range functions like
rate()orincrease(). The value represents the current state. - Gauges persist across restarts: When a server restarts, gauge values reset to their initial state (often 0), but they don’t accumulate like counters. The gauge will reflect the new current state after restart.
- Example usage: Query
xrootd_server_io_activedirectly to see the current number of active IO operations, or useavg_over_time(xrootd_server_io_active[5m])to see the average over the last 5 minutes.
All Servers
All of the Pelican servers have the following metrics:
process_start_time_seconds
The UNIX epoch time in seconds when the Pelican process started.
To get the duration of the Pelican server running time, use the following PromQL:
time() - process_start_time_secondsThis yields the duration in seconds.
pelican_component_health_status
The health status of Pelican server components. The metric value can be converted into following status:
Note: This is a gauge metric representing the current health status. Query directly to see the current state of each component.
1: Critical
2: Warning
3: OK
4: UnknownLabel: component
| Label Value | Description | Availability |
|---|---|---|
web-ui | Admin website | All servers |
xrootd | XRootD process | Origin and cache servers |
cmsd | CMSD process | Origin and cache servers |
federation | Advertisement to the Director | Origin and cache servers |
registry | Namespace registration at the Registry | Origin and cache servers |
director | Object transfer tests from the Director | Origin and cache servers |
topology | Data fetch from the OSDF topology server | All servers (OSDF mode only) |
IO-concurrency | Health status indicating whether the average concurrent IO operations exceed the configured concurrency limit, used by the Director to determine if redirects should be reduced | Origin and cache servers |
prometheus | Health status of the embedded Prometheus server. Critical indicates Prometheus failed to start or the server is not ready to receive web requests (metrics unavailable). OK indicates Prometheus started successfully and is ready | All servers |
config-updates | Health status of XRootD configuration file updates (authfile and scitokens.cfg). Critical indicates files are stale beyond the configured timeout and may trigger auto-shutdown if enabled. Warning indicates update failures observed but within timeout. OK indicates both files updated successfully | Origin and cache servers |
pelican_component_health_status_last_update
The timestamp of last update of health status of Pelican server components. The value is UNIX time in seconds. It shares the same label as pelican_component_health_status
Note: This is a gauge metric representing the last update timestamp. Query directly to see when each component’s health status was last updated.
pelican_server_xrootd_last_crash
The timestamp (seconds) of the last crash of the XRootD server.
Note: This is a gauge metric representing the timestamp of the last crash. Query directly to see when XRootD last crashed. A value of 0 indicates no crashes have occurred since the server started.
Registry
pelican_registry_federation_namespaces
The number of namespace registrations in the registry.
Note: This is a gauge metric representing the current number of namespace registrations. Query directly to see the current count.
Label: status
| Label Values | Description |
|---|---|
ok | The number of namespaces that are have a valid registration. |
error | The number of namespaces that have an error in their registration. |
pelican_osdf_institution_count
Total number of contributing institutions. This is only available when running in OSDF mode.
Note: This is a gauge metric representing the current number of institutions. Query directly to see the current count.
Storage Servers (Origin and Cache)
xrootd_monitoring_packets_received_total
The total number of XRootD monitoring UDP packets received.
Note: This is a counter metric. Use
rate(xrootd_monitoring_packets_received_total[5m])to get packets per second, orincrease(xrootd_monitoring_packets_received_total[1h])to get total packets in the last hour.
xrootd_sched_thread_count
The number of XRootD scheduler threads. Ref: https://xrootd.web.cern.ch/doc/dev6/xrd_monitoring.htm#_Toc204013493
Note: This is a gauge metric representing the current number of threads. Query directly or use
avg_over_time(xrootd_sched_thread_count[5m])to see the average over a time range.
Label: state
| Label Value | Description |
|---|---|
idle | Scheduler threads waiting for work |
running | Scheduler threads running |
xrootd_sched_thread_creations
Number of scheduler thread creations.
Note: This is a counter metric. Use
rate(xrootd_sched_thread_creations[5m])to get thread creations per second, orincrease(xrootd_sched_thread_creations[1h])to get total thread creations in the last hour.
xrootd_sched_thread_destructions
Number of scheduler thread destructions.
Note: This is a counter metric. Use
rate(xrootd_sched_thread_destructions[5m])to get thread destructions per second, orincrease(xrootd_sched_thread_destructions[1h])to get total thread destructions in the last hour.
xrootd_sched_thread_limit_reached
Number of times the scheduler thread limit has been reached.
Note: This is a counter metric. Use
rate(xrootd_sched_thread_limit_reached[5m])to get limit hits per second, orincrease(xrootd_sched_thread_limit_reached[1h])to get total limit hits in the last hour.
xrootd_sched_jobs
Number of scheduler jobs requiring a thread.
Note: This is a gauge metric representing the current number of jobs. Query directly or use
avg_over_time(xrootd_sched_jobs[5m])to see the average over a time range.
xrootd_sched_queue_longest_length
Length of the longest run-queue.
Note: This is a gauge metric representing the current longest queue length. Query directly or use
avg_over_time(xrootd_sched_queue_longest_length[5m])to see the average over a time range.
xrootd_sched_queued
Number of jobs queued.
Note: This is a gauge metric representing the current number of queued jobs. Query directly or use
avg_over_time(xrootd_sched_queued[5m])to see the average over a time range.
xrootd_server_bytes_total
The total number of bytes XRootD sent/received. Ref: https://xrootd.web.cern.ch/doc/dev6/xrd_monitoring.htm#_Toc204013487 (See link.in and link.out)
Note: This is a counter metric. Use
rate(xrootd_server_bytes_total[5m])to get bytes per second, orincrease(xrootd_server_bytes_total[1h])to get total bytes in the last hour.
Label: direction
| Label Values | Description |
|---|---|
tx | Bytes sent |
rx | Bytes received |
xrootd_server_connections_total
The total number of server connections to XRootD.
Note: This is a counter metric. Use
rate(xrootd_server_connections_total[5m])to get connections per second, orincrease(xrootd_server_connections_total[1h])to get total connections in the last hour.
xrootd_storage_volume_bytes
The storage volume usage on the storage server.
Note: This is a gauge metric representing the current storage volume. Query directly or use
avg_over_time(xrootd_storage_volume_bytes[5m])to see the average over a time range.
Label: type
| Label Values | Description |
|---|---|
total | Total bytes visible on the storage server |
free | Available bytes to use |
Label: server_type
| Label Values | Description |
|---|---|
Origin | Origin server |
Cache | Cache server |
Label: ns
The top-level namespace the XRootD is serving for. Example: /foo
xrootd_transfer_bytes
The bytes of transfers for individual object. Ref: https://xrootd.web.cern.ch/doc/dev6/xrd_monitoring.htm#_Toc204013508 (See XrdXrootdMonStatXFR)
Note: This is a counter metric. Use
rate(xrootd_transfer_bytes[5m])to get bytes per second, orincrease(xrootd_transfer_bytes[1h])to get total bytes in the last hour.
Label: path
The path to the object (filename).
Label: ap
Authentication protocol name used to authenticate the client. Default is https
Label: dn
Client’s distinguished name as reported by ap. If no name is present, the variable data is null.
Label: role
Client’s role name as reported by prot. If no role name is present, the variable data is null.
Label: org
Client’s group names in a space-separated list. If no groups are present, the tag variable data is null.
Label: proj
Client’s User-Agent header when requesting the file. This is used to label the project name that accesses the file.
Label: type
| Label Values | Description |
|---|---|
read | Bytes read from file using read() |
readv | Bytes read from file using readv() |
write | Bytes written to file |
xrootd_transfer_operations_total
The number of transfer operations performed for individual object. The labels for this metric is the same as the ones in xrootd_transfer_bytes
Note: This is a counter metric. Use
rate(xrootd_transfer_operations_total[5m])to get operations per second, orincrease(xrootd_transfer_operations_total[1h])to get total operations in the last hour.
xrootd_transfer_readv_segments_total
The number of segments in readv operations for individual object. The labels for this metric is the same as the ones in xrootd_transfer_bytes except that type label isn’t available in this metric.
Note: This is a counter metric. Use
rate(xrootd_transfer_readv_segments_total[5m])to get segments per second, orincrease(xrootd_transfer_readv_segments_total[1h])to get total segments in the last hour.
xrootd_cache_access_bytes
Number of bytes the data requested is in the cache or not.
Note: This is a gauge metric representing the current cache access state. Query directly or use
avg_over_time(xrootd_cache_access_bytes[5m])to see the average over a time range.
Label: path
The path to the object (filename).
Label: type
| Label Values | Description |
|---|---|
hit | Bytes served from cache. |
miss | Bytes missed in cache. |
bypass | Bytes that bypassed the cache. |
xrootd_server_io_total
Total storage operations in origin/cache server.
Note: This is a counter metric. Use
rate(xrootd_server_io_total[5m])to get operations per second, orincrease(xrootd_server_io_total[1h])to get total operations in the last hour.
xrootd_server_io_active
Number of ongoing storage operations in origin/cache server.
Note: This is a gauge metric representing the current number of active IO operations. Query directly or use
avg_over_time(xrootd_server_io_active[5m])to see the average over a time range.
xrootd_server_io_wait_seconds_total
The aggregate time spent in storage operations in origin/cache server.
Note: This is a counter metric. Use
rate(xrootd_server_io_wait_seconds_total[5m])to get average wait time per second, orincrease(xrootd_server_io_wait_seconds_total[1h])to get total wait time in the last hour.
xrootd_cpu_utilization
CPU utilization of the XRootD server, represented as the average number of CPU cores utilized (e.g., 1.0 = one full core, 2.5 = two and a half cores).
Note: This is a gauge metric representing the current CPU utilization. Query directly to see the current utilization, or use
avg_over_time(xrootd_cpu_utilization[5m])to see the average over a time range.
OSS Layer Metrics
The following metrics are available from the XRootD OSS layer.
Note: All metrics ending in
_totalare counter metrics. Userate()orincrease()with a time range to query them. All metrics ending in_time_secondsare histogram metrics that track operation duration distributions.
| Metric Name | Description |
|---|---|
xrootd_oss_reads_total | The total number of read operations on the OSS. |
xrootd_oss_writes_total | The total number of write operations on the OSS. |
xrootd_oss_stats_total | The total number of stat operations on the OSS. |
xrootd_oss_pgreads_total | The total number of page read operations on the OSS. |
xrootd_oss_pgwrites_total | The total number of page write operations on the OSS. |
xrootd_oss_readv_total | The total number of readv operations on the OSS. |
xrootd_oss_readv_segments_total | The total number of segments in readv operations on the OSS. |
xrootd_oss_dirlists_total | The total number of directory list operations on the OSS. |
xrootd_oss_dirlist_entries_total | The total number of directory list entries on the OSS. |
xrootd_oss_truncates_total | The total number of truncate operations on the OSS. |
xrootd_oss_unlinks_total | The total number of unlink operations on the OSS. |
xrootd_oss_chmods_total | The total number of chmod operations on the OSS. |
xrootd_oss_opens_total | The total number of open operations on the OSS. |
xrootd_oss_renames_total | The total number of rename operations on the OSS. |
xrootd_oss_slow_reads_total | The total number of slow read operations on the OSS. |
xrootd_oss_slow_writes_total | The total number of slow write operations on the OSS. |
xrootd_oss_slow_stats_total | The total number of slow stat operations on the OSS. |
xrootd_oss_slow_pgreads_total | The total number of slow page read operations on the OSS. |
xrootd_oss_slow_pgwrites_total | The total number of slow page write operations on the OSS. |
xrootd_oss_slow_readv_total | The total number of slow readv operations on the OSS. |
xrootd_oss_slow_readv_segments_total | The total number of segments in slow readv operations on the OSS. |
xrootd_oss_slow_dirlists_total | The total number of slow directory list operations on the OSS. |
xrootd_oss_slow_dirlist_entries_total | The total number of slow directory list entries on the OSS. |
xrootd_oss_slow_truncates_total | The total number of slow truncate operations on the OSS. |
xrootd_oss_slow_unlinks_total | The total number of slow unlink operations on the OSS. |
xrootd_oss_slow_chmods_total | The total number of slow chmod operations on the OSS. |
xrootd_oss_slow_opens_total | The total number of slow open operations on the OSS. |
xrootd_oss_slow_renames_total | The total number of slow rename operations on the OSS. |
xrootd_oss_open_time_seconds | The time taken for open operations on the OSS. |
xrootd_oss_read_time_seconds | The time taken for read operations on the OSS. |
xrootd_oss_readv_time_seconds | The time taken for readv operations on the OSS. |
xrootd_oss_pgread_time_seconds | The time taken for page read operations on the OSS. |
xrootd_oss_write_time_seconds | The time taken for write operations on the OSS. |
xrootd_oss_pgwrite_time_seconds | The time taken for page write operations on the OSS. |
xrootd_oss_dirlist_time_seconds | The time taken for directory list operations on the OSS. |
xrootd_oss_stat_time_seconds | The time taken for stat operations on the OSS. |
xrootd_oss_truncate_time_seconds | The time taken for truncate operations on the OSS. |
xrootd_oss_unlink_time_seconds | The time taken for unlink operations on the OSS. |
xrootd_oss_rename_time_seconds | The time taken for rename operations on the OSS. |
xrootd_oss_chmod_time_seconds | The time taken for chmod operations on the OSS. |
xrootd_oss_slow_open_time_seconds | The time taken for slow open operations on the OSS. |
xrootd_oss_slow_read_time_seconds | The time taken for slow read operations on the OSS. |
xrootd_oss_slow_readv_time_seconds | The time taken for slow readv operations on the OSS. |
xrootd_oss_slow_pgread_time_seconds | The time taken for slow page read operations on the OSS. |
xrootd_oss_slow_write_time_seconds | The time taken for slow write operations on the OSS. |
xrootd_oss_slow_pgwrite_time_seconds | The time taken for slow page write operations on the OSS. |
xrootd_oss_slow_dirlist_time_seconds | The time taken for slow directory list operations on the OSS. |
xrootd_oss_slow_stat_time_seconds | The time taken for slow stat operations on the OSS. |
xrootd_oss_slow_truncate_time_seconds | The time taken for slow truncate operations on the OSS. |
xrootd_oss_slow_unlink_time_seconds | The time taken for slow unlink operations on the OSS. |
xrootd_oss_slow_rename_time_seconds | The time taken for slow rename operations on the OSS. |
xrootd_oss_slow_chmod_time_seconds | The time taken for slow chmod operations on the OSS. |
S3 Cache Plugin Metrics
The following metrics are available from the XRootD S3 cache plugin.
Note: All metrics ending in
_totalare counter metrics. Userate()orincrease()with a time range to query them.
| Metric Name | Description | Labels |
|---|---|---|
xrootd_s3_cache_bytes_total | Bytes transferred by the S3 cache plugin. | type: hit, miss, bypass, fetch, unused, prefetch |
xrootd_s3_cache_hits_total | Number of cache hits, partial hits, or misses. | type: full, partial, miss |
xrootd_s3_cache_requests_total | Number of cache requests. | type: bypass, fetch, prefetch |
xrootd_s3_cache_errors_total | Number of errors encountered by the S3 cache plugin. | |
xrootd_s3_cache_request_seconds_total | Total time spent in S3 requests. | type: bypass, fetch |
XrdCl Client Metrics
The following metrics are available from the XrdCl client.
Note: All metrics ending in
_totalare counter metrics. Userate()orincrease()with a time range to query them. Metrics ending in_timestamp_secondsare gauge metrics representing timestamps and can be queried directly.xrootd_xrdcl_queue_pendingis a gauge metric representing the current queue size.
| Metric Name | Description | Labels |
|---|---|---|
xrootd_xrdcl_prefetch_count_total | Total number of prefetches started. | |
xrootd_xrdcl_prefetch_expired_total | Total number of prefetches that expired. | |
xrootd_xrdcl_prefetch_failed_total | Total number of prefetches that failed. | |
xrootd_xrdcl_prefetch_reads_hit_total | Total number of successful reads from prefetch buffer. | |
xrootd_xrdcl_prefetch_reads_miss_total | Total number of reads that missed the prefetch buffer. | |
xrootd_xrdcl_prefetch_bytes_used_total | Total number of bytes served from prefetch. | |
xrootd_xrdcl_queue_produced_total | Total number of HTTP requests placed into the queue. | |
xrootd_xrdcl_queue_consumed_total | Total number of HTTP requests read from the queue. | |
xrootd_xrdcl_queue_pending | Number of pending HTTP requests in the queue. | |
xrootd_xrdcl_queue_rejected_total | Total number of HTTP requests rejected due to overload. | |
xrootd_xrdcl_worker_oldest_op_timestamp_seconds | Timestamp of the oldest operation in any of the worker threads. | |
xrootd_xrdcl_worker_oldest_cycle_timestamp_seconds | Timestamp of the oldest event loop completion in any of the worker threads. | |
xrootd_xrdcl_http_requests_total | Statistics about HTTP requests. | verb, status, type |
xrootd_xrdcl_http_request_duration_seconds_total | Total duration of HTTP requests. | verb, status, type |
xrootd_xrdcl_http_bytes_total | Bytes transferred for HTTP requests. | verb, status |
xrootd_xrdcl_conncall_total | Statistics about connection calls. | type |
Cache Eviction Metrics
The following metrics are available from the XRootD cache eviction process.
Note: All cache eviction metrics are gauge metrics representing current state. Query them directly or use
avg_over_time()to see averages over a time range.
| Metric Name | Description | Labels |
|---|---|---|
xrootd_cache_eviction_last_update_time_seconds | The last time xrootd cache eviction metrics were updated. | |
xrootd_cache_eviction_disk_usage_bytes | The disk usage of the xrootd cache. | |
xrootd_cache_eviction_snapshot_stats_reset_time_seconds | The time when the snapshot statistics were last reset. | |
xrootd_cache_eviction_disk_total_bytes | The total disk space available for the cache. | |
xrootd_cache_eviction_file_usage_bytes | The file usage of the xrootd cache. | |
xrootd_cache_eviction_meta_total_bytes | The total metadata storage available for the cache. | |
xrootd_cache_eviction_meta_used_bytes | The used metadata storage for the cache. | |
xrootd_cache_eviction_dir_num_ios | Number of I/Os per directory. | dir_name |
xrootd_cache_eviction_dir_duration | Duration of I/Os per directory. | dir_name |
xrootd_cache_eviction_dir_bytes | Bytes transferred per directory. | dir_name, type: hit, missed, bypassed, written |
xrootd_cache_eviction_dir_st_block_bytes | Bytes from storage blocks per directory. | dir_name, type: added, removed |
xrootd_cache_eviction_dir_n_cksum_errors | Number of checksum errors per directory. | dir_name |
xrootd_cache_eviction_dir_files_count | File operations per directory (opened, closed, created, removed). | dir_name, type: opened, closed, created, removed |
xrootd_cache_eviction_dir_directories_count | Directory operations (created, removed) per directory. | dir_name, type: created, removed |
xrootd_cache_eviction_dir_last_access_time_seconds | Last access time per directory. | dir_name, type: open, close |
xrootd_cache_eviction_dir_st_blocks_usage_count | Storage blocks usage per directory. | dir_name |
xrootd_cache_eviction_dir_n_files_open_count | Number of open files per directory. | dir_name |
xrootd_cache_eviction_dir_n_files_count | Number of files per directory. | dir_name |
xrootd_cache_eviction_dir_n_directories_count | Number of directories per directory. | dir_name |
Director
up
The Pelican director scrapes Prometheus metrics from all origins and cache servers that successfully advertise to the director. This metric reflects the Pelican origin or cache servers that are scraped by the director.
Label: server_name
The name of the storage server. By default it’s the hostname.
Label: server_type
| Label Values | Description |
|---|---|
Origin | Origin server |
Cache | Cache server |
Label: server_url
The storage server XRootD url.
Label: server_web_url
The storage server web url.
Label: server_auth_url
The storage server authentication url.
Label: server_lat
The storage server latitude.
Label: server_long
The storage server longitude.
# of Active Origins and Caches
With the up metric, it is possible to count number of active origin and cache servers in the federation by a simple Prometheus query: count(up{server_type=<"Origin">}) for counting origin servers, or count(up{server_type=<"Cache">}) for counting cache servers.
pelican_director_advertisements_received_total
The accumulated number of origin/cache advertisements to the director. This metric shows if an origin/cache server successfully joins the federation or not. For origin servers, it also shows if each federation namespace prefix it exports passed director verification.
Note: This is a counter metric. Use
rate(pelican_director_advertisements_received_total[5m])to get advertisements per second, orincrease(pelican_director_advertisements_received_total[1h])to get total advertisements in the last hour.
Label: server_name
The name of the storage server. By default it’s the hostname.
Label: server_type
| Label Values | Description |
|---|---|
Origin | Origin server |
Cache | Cache server |
Label: server_web_url
The storage server web url.
Label: namespace_prefix
The federation namespace prefix the storage server exported.
Label: status_code
The status code of the director’s response. The most useful value is 403, which means the server advertisement didn’t pass director’s verification.
| Label Values | Description |
|---|---|
200 | Advertisement succeeded |
403 | Advertisement verification failed |
500 | Director has errors when verifying or saving the advertisement |
pelican_director_stat_total
The accumulated number of stat query the director made to origin/cache servers to check for object availability. Only available when Director.EnableStat is set to true. This metric is a good indicator of object availability and origin/cache service quality.
Note: This is a gauge metric representing the current accumulated count. Query directly to see the current total. Note that this metric accumulates but is a gauge (not a counter), so it may reset on restart.
Label: server_name
The name of the storage server. By default it’s the hostname.
Label: server_type
| Label Values | Description |
|---|---|
Origin | Origin server |
Cache | Cache server |
Label: server_url
The storage server XRootD url.
Label: result
The stat query result.
| Label Values | Description |
|---|---|
Succeeded | The object requested is on the server |
NotFound | The requested object could not be found on the server |
Timeout | The query exceeded the allotted time and was not completed. |
Cancelled | The query is cancelled as maximum number of responses has been reached |
Forbidden | The object request was denied due to lack of permissions or missing token |
UnknownErr | An unexpected error occurred. Typically when the server refused to connect |
Label: cached_result
Whether the result was cached.
pelican_director_stat_active
The ongoing stat queries at the server. Note that Prometheus samples the metric value per 15s, and each stat request only takes ~10-100ms to finish. The value of this metric can’t capture per-second transient requests.
Note: This is a gauge metric representing the current number of active stat queries. Query directly to see the current count.
Label: server_name
The name of the storage server. By default it’s the hostname.
Label: server_type
| Label Values | Description |
|---|---|
Origin | Origin server |
Cache | Cache server |
Label: server_url
The storage server XRootD url.
pelican_director_total_ftx_test_suite
The number of file transfer test suite the director issued. In Pelican, director creates a test file and sent to origin servers to as a health test. It issues such test suite when it receives the registration from the origin server. In a test suite, a timer was set to run a cycle of uploading, getting, and deleting the test file every 15 seconds. Such cycle is called a “test run”. In theory, director should issue only one test for each origin servers; however, since the registration information was stored in a TTL cache in director, and it expires after certain period of time, and the test suite issued will be cancelled. A new test suite is issued with the new registration. Thus, director can issue multiple test suites to an origin server.
Note: This is a counter metric. Use
rate(pelican_director_total_ftx_test_suite[5m])to get test suites per second, orincrease(pelican_director_total_ftx_test_suite[1h])to get total test suites in the last hour.
Label: server_name
The name of the storage server. By default it’s the hostname.
Label: server_type
| Label Values | Description |
|---|---|
Origin | Origin server |
Cache | Cache server |
Label: server_web_url
The storage server web url.
pelican_director_active_ftx_test_suite
The number of active director file transfer test suite. As mentioned in previous metric, the test suites are individual tasks running independently from the main program logic. This can cause race condition in some condition where an expired test suite was not cleared but a new test suite is issued for the same origin. This metric records such condition for debugging and monitoring. The value of the metric should be 1 for all the time.
This metric shares the same label as pelican_director_total_ftx_test_suite
Note: This is a gauge metric representing the current number of active test suites. Query directly to see the current count.
pelican_director_total_ftx_test_runs
The number of file transfer test runs the director issued. A “test run” is a set of upload/get/delete of test files to a origin. It executes in a cycle of 15s (by default).
Note: This is a counter metric. Use
rate(pelican_director_total_ftx_test_runs[5m])to get test runs per second, orincrease(pelican_director_total_ftx_test_runs[1h])to get total test runs in the last hour.
This metric shares the same label as pelican_director_total_ftx_test_suite, with two additions:
Label: status
| Label Values | Description |
|---|---|
Success | The test run succeeded |
Failed | The test run failed |
Label: report_status
| Label Values | Description |
|---|---|
Success | The reporting to the origin of test run status succeeded |
Failed | The reporting to the origin of test run status failed |
pelican_director_map_items_total
The total number of map items in the director, by the name of the map.
Note: This is a gauge metric representing the current number of map items. Query directly to see the current count.
Label: name
The name of the map. One of healthTestUtils, filteredServers, serverStatUtils, serverStatEntries.
pelican_director_ttl_cache
The statistics of various TTL caches.
Note: This is a gauge metric representing the current TTL cache statistics. Query directly to see the current values.
Label: name
The name of the cache. One of serverAds, jwks.
Label: type
The type of the statistic. One of evictions, insertions, hits, misses, total.
pelican_director_server_count
The number of servers currently recognized by the Director, delineated by pelican/non-pelican and origin/cache.
Note: This is a gauge metric representing the current number of servers. Query directly to see the current count.
Label: server_name
The name of the server.
Label: server_type
| Label Values | Description |
|---|---|
Origin | Origin server |
Cache | Cache server |
Label: from_topology
Whether the server was discovered from the topology.
pelican_director_client_requests_total
The total number of requests from clients.
Note: This is a counter metric. Use
rate(pelican_director_client_requests_total[5m])to get requests per second, orincrease(pelican_director_client_requests_total[1h])to get total requests in the last hour.
Label: version
The client version.
Label: service
The service that received the request.
pelican_director_redirects_total
The total number of redirects the director issued.
Note: This is a counter metric. Use
rate(pelican_director_redirects_total[5m])to get redirects per second, orincrease(pelican_director_redirects_total[1h])to get total redirects in the last hour.
Label: destination
The destination of the redirect.
Label: status_code
The status code of the redirect.
Label: version
The client version.
Label: network
The network of the client.
pelican_director_maxmind_server_errors_total
The total number of errors encountered trying to resolve server coordinates using the GeoIP MaxMind database.
Note: This is a counter metric. Use
rate(pelican_director_maxmind_server_errors_total[5m])to get errors per second, orincrease(pelican_director_maxmind_server_errors_total[1h])to get total errors in the last hour.
Label: network
The network address that was being resolved.
Label: server_name
The name of the server that was being resolved.
pelican_director_maxmind_client_errors_total
The total number of errors encountered trying to resolve client coordinates using the GeoIP MaxMind database.
Note: This is a counter metric. Use
rate(pelican_director_maxmind_client_errors_total[5m])to get errors per second, orincrease(pelican_director_maxmind_client_errors_total[1h])to get total errors in the last hour.
Label: network
The network address that was being resolved.
Label: project
The project of the client that was being resolved.
pelican_director_rejected_advertisements
The total number of advertisements rejected by the director.
Note: This is a counter metric. Use
rate(pelican_director_rejected_advertisements[5m])to get rejections per second, orincrease(pelican_director_rejected_advertisements[1h])to get total rejections in the last hour.
Label: hostname
The hostname of the server that sent the advertisement.
pelican_director_server_statusweight
The EWMA-smoothed status weight generated by the Director for each server.
Note: This is a gauge metric representing the current status weight. Query directly to see the current weight value.
Label: server_name
The name of the server.
Label: server_url
The URL of the server.
Label: server_type
| Label Values | Description |
|---|---|
Origin | Origin server |
Cache | Cache server |
Deprecated Metrics
The following metrics are deprecated and will be removed in a future release.
pelican_director_geoip_errors: [Deprecated — split into separate client/server metrics (pelican_director_maxmind_{server,client}_errors_total)] The total number of errors encountered trying to resolve coordinates using the GeoIP MaxMind database.xrootd_monitoring_packets_received: Renamed toxrootd_monitoring_packets_received_total.xrootd_transfer_readv_segments_count: Renamed toxrootd_transfer_readv_segments_total.xrootd_transfer_operations_count: Renamed toxrootd_transfer_operations_total.xrootd_server_connection_count: Renamed toxrootd_server_connections_total.xrootd_server_bytes: Renamed toxrootd_server_bytes_total.xrootd_server_io_wait_time: Renamed toxrootd_server_io_wait_seconds_total.