Skip to Content
Monitoring Pelican ServicesPrometheus

Query Pelican Server Metrics via Prometheus

For Pelican >= 7.16.0. Older version of Pelican may not include all of the metrics listed.

Pelican servers have Prometheus  embedded by default and provide a handful of Prometheus metrics to monitor server status. You can access the metrics endpoint at https://<pelican-server-host>:<server-web-port>/metrics to see all the available metrics and their current values. By default, /metrics is a protected endpoint and you are required to login and get authenticated to view the page. You can change Monitoring.MetricAuthorization to false in config to turn off the authentication.

Pelican also exposes Prometheus PromQL query engine  at https://<pelican-server-host>:<server-web-port>/api/v1.0/prometheus where you can query the metrics against Prometheus powerful query language.

Example: https://<pelican-server-host>:<server-web-port>/api/v1.0/prometheus/query?query=pelican_component_health_status[10m] queries the pelican_component_health_status metric and shows data collected in past 10 min.

However, Pelican does not support Prometheus native /graph endpoint nor other Prometheus native web services other than the two above. For custom data visualizations, Grafana  is one of the popular software to use.

Pelican included metrics from built-in gin  web server, as well as Go runtime. For all metrics available, visit https://<pelican-server-host>:<server-web-port>/api/v1.0/prometheus/label/__name__/values.

Pelican also has a set of built-in metrics to monitor Pelican server’s status, listed below.

Counter Metrics

Many metrics in this documentation are counters (typically identified by names ending in _total or _count). Counter metrics are monotonically increasing values that accumulate over time. Important notes about counters:

  • Counters reset on restart: When a Pelican server restarts, all counter values reset to 0. The counter then begins accumulating from 0 again.
  • Time ranges are required: To get meaningful data from counters, you must use PromQL functions that calculate changes over time, such as:
    • rate() - calculates the per-second average rate of increase
    • irate() - calculates the per-second instant rate of increase
    • increase() - calculates the increase over a time range
  • Example usage: Instead of querying xrootd_server_bytes_total directly, use rate(xrootd_server_bytes_total[5m]) to get bytes per second, or increase(xrootd_server_bytes_total[1h]) to get total bytes transferred in the last hour.

Gauge Metrics

Many metrics in this documentation are gauges. Gauge metrics represent a single numerical value that can go up or down over time. Important notes about gauges:

  • Gauges represent current state: Unlike counters, gauges show the current value of something at a point in time (e.g., number of active connections, current CPU usage, available disk space).
  • Gauges can be queried directly: You can query gauge metrics directly without time-range functions like rate() or increase(). The value represents the current state.
  • Gauges persist across restarts: When a server restarts, gauge values reset to their initial state (often 0), but they don’t accumulate like counters. The gauge will reflect the new current state after restart.
  • Example usage: Query xrootd_server_io_active directly to see the current number of active IO operations, or use avg_over_time(xrootd_server_io_active[5m]) to see the average over the last 5 minutes.

All Servers

All of the Pelican servers have the following metrics:

process_start_time_seconds

The UNIX epoch time in seconds when the Pelican process started.

To get the duration of the Pelican server running time, use the following PromQL:

PromQL
time() - process_start_time_seconds

This yields the duration in seconds.

pelican_component_health_status

The health status of Pelican server components. The metric value can be converted into following status:

Note: This is a gauge metric representing the current health status. Query directly to see the current state of each component.

1: Critical 2: Warning 3: OK 4: Unknown

Label: component

Label ValueDescriptionAvailability
web-uiAdmin websiteAll servers
xrootdXRootD processOrigin and cache servers
cmsdCMSD processOrigin and cache servers
federationAdvertisement to the DirectorOrigin and cache servers
registryNamespace registration at the RegistryOrigin and cache servers
directorObject transfer tests from the DirectorOrigin and cache servers
topologyData fetch from the OSDF topology serverAll servers (OSDF mode only)
IO-concurrencyHealth status indicating whether the average concurrent IO operations exceed the configured concurrency limit, used by the Director to determine if redirects should be reducedOrigin and cache servers
prometheusHealth status of the embedded Prometheus server. Critical indicates Prometheus failed to start or the server is not ready to receive web requests (metrics unavailable). OK indicates Prometheus started successfully and is readyAll servers
config-updatesHealth status of XRootD configuration file updates (authfile and scitokens.cfg). Critical indicates files are stale beyond the configured timeout and may trigger auto-shutdown if enabled. Warning indicates update failures observed but within timeout. OK indicates both files updated successfullyOrigin and cache servers

pelican_component_health_status_last_update

The timestamp of last update of health status of Pelican server components. The value is UNIX time in seconds. It shares the same label as pelican_component_health_status

Note: This is a gauge metric representing the last update timestamp. Query directly to see when each component’s health status was last updated.

pelican_server_xrootd_last_crash

The timestamp (seconds) of the last crash of the XRootD server.

Note: This is a gauge metric representing the timestamp of the last crash. Query directly to see when XRootD last crashed. A value of 0 indicates no crashes have occurred since the server started.

Registry

pelican_registry_federation_namespaces

The number of namespace registrations in the registry.

Note: This is a gauge metric representing the current number of namespace registrations. Query directly to see the current count.

Label: status

Label ValuesDescription
okThe number of namespaces that are have a valid registration.
errorThe number of namespaces that have an error in their registration.

pelican_osdf_institution_count

Total number of contributing institutions. This is only available when running in OSDF mode.

Note: This is a gauge metric representing the current number of institutions. Query directly to see the current count.

Storage Servers (Origin and Cache)

xrootd_monitoring_packets_received_total

The total number of XRootD monitoring  UDP packets received.

Note: This is a counter metric. Use rate(xrootd_monitoring_packets_received_total[5m]) to get packets per second, or increase(xrootd_monitoring_packets_received_total[1h]) to get total packets in the last hour.

xrootd_sched_thread_count

The number of XRootD scheduler threads. Ref: https://xrootd.web.cern.ch/doc/dev6/xrd_monitoring.htm#_Toc204013493 

Note: This is a gauge metric representing the current number of threads. Query directly or use avg_over_time(xrootd_sched_thread_count[5m]) to see the average over a time range.

Label: state

Label ValueDescription
idleScheduler threads waiting for work
runningScheduler threads running

xrootd_sched_thread_creations

Number of scheduler thread creations.

Note: This is a counter metric. Use rate(xrootd_sched_thread_creations[5m]) to get thread creations per second, or increase(xrootd_sched_thread_creations[1h]) to get total thread creations in the last hour.

xrootd_sched_thread_destructions

Number of scheduler thread destructions.

Note: This is a counter metric. Use rate(xrootd_sched_thread_destructions[5m]) to get thread destructions per second, or increase(xrootd_sched_thread_destructions[1h]) to get total thread destructions in the last hour.

xrootd_sched_thread_limit_reached

Number of times the scheduler thread limit has been reached.

Note: This is a counter metric. Use rate(xrootd_sched_thread_limit_reached[5m]) to get limit hits per second, or increase(xrootd_sched_thread_limit_reached[1h]) to get total limit hits in the last hour.

xrootd_sched_jobs

Number of scheduler jobs requiring a thread.

Note: This is a gauge metric representing the current number of jobs. Query directly or use avg_over_time(xrootd_sched_jobs[5m]) to see the average over a time range.

xrootd_sched_queue_longest_length

Length of the longest run-queue.

Note: This is a gauge metric representing the current longest queue length. Query directly or use avg_over_time(xrootd_sched_queue_longest_length[5m]) to see the average over a time range.

xrootd_sched_queued

Number of jobs queued.

Note: This is a gauge metric representing the current number of queued jobs. Query directly or use avg_over_time(xrootd_sched_queued[5m]) to see the average over a time range.

xrootd_server_bytes_total

The total number of bytes XRootD sent/received. Ref: https://xrootd.web.cern.ch/doc/dev6/xrd_monitoring.htm#_Toc204013487  (See link.in and link.out)

Note: This is a counter metric. Use rate(xrootd_server_bytes_total[5m]) to get bytes per second, or increase(xrootd_server_bytes_total[1h]) to get total bytes in the last hour.

Label: direction

Label ValuesDescription
txBytes sent
rxBytes received

xrootd_server_connections_total

The total number of server connections to XRootD.

Note: This is a counter metric. Use rate(xrootd_server_connections_total[5m]) to get connections per second, or increase(xrootd_server_connections_total[1h]) to get total connections in the last hour.

xrootd_storage_volume_bytes

The storage volume usage on the storage server.

Note: This is a gauge metric representing the current storage volume. Query directly or use avg_over_time(xrootd_storage_volume_bytes[5m]) to see the average over a time range.

Label: type

Label ValuesDescription
totalTotal bytes visible on the storage server
freeAvailable bytes to use

Label: server_type

Label ValuesDescription
OriginOrigin server
CacheCache server

Label: ns

The top-level namespace the XRootD is serving for. Example: /foo

xrootd_transfer_bytes

The bytes of transfers for individual object. Ref: https://xrootd.web.cern.ch/doc/dev6/xrd_monitoring.htm#_Toc204013508  (See XrdXrootdMonStatXFR)

Note: This is a counter metric. Use rate(xrootd_transfer_bytes[5m]) to get bytes per second, or increase(xrootd_transfer_bytes[1h]) to get total bytes in the last hour.

Label: path

The path to the object (filename).

Label: ap

Authentication protocol name used to authenticate the client. Default is https

Label: dn

Client’s distinguished name as reported by ap. If no name is present, the variable data is null.

Label: role

Client’s role name as reported by prot. If no role name is present, the variable data is null.

Label: org

Client’s group names in a space-separated list. If no groups are present, the tag variable data is null.

Label: proj

Client’s User-Agent header when requesting the file. This is used to label the project name that accesses the file.

Label: type

Label ValuesDescription
readBytes read from file using read()
readvBytes read from file using readv()
writeBytes written to file

xrootd_transfer_operations_total

The number of transfer operations performed for individual object. The labels for this metric is the same as the ones in xrootd_transfer_bytes

Note: This is a counter metric. Use rate(xrootd_transfer_operations_total[5m]) to get operations per second, or increase(xrootd_transfer_operations_total[1h]) to get total operations in the last hour.

xrootd_transfer_readv_segments_total

The number of segments in readv operations for individual object. The labels for this metric is the same as the ones in xrootd_transfer_bytes except that type label isn’t available in this metric.

Note: This is a counter metric. Use rate(xrootd_transfer_readv_segments_total[5m]) to get segments per second, or increase(xrootd_transfer_readv_segments_total[1h]) to get total segments in the last hour.

xrootd_cache_access_bytes

Number of bytes the data requested is in the cache or not.

Note: This is a gauge metric representing the current cache access state. Query directly or use avg_over_time(xrootd_cache_access_bytes[5m]) to see the average over a time range.

Label: path

The path to the object (filename).

Label: type

Label ValuesDescription
hitBytes served from cache.
missBytes missed in cache.
bypassBytes that bypassed the cache.

xrootd_server_io_total

Total storage operations in origin/cache server.

Note: This is a counter metric. Use rate(xrootd_server_io_total[5m]) to get operations per second, or increase(xrootd_server_io_total[1h]) to get total operations in the last hour.

xrootd_server_io_active

Number of ongoing storage operations in origin/cache server.

Note: This is a gauge metric representing the current number of active IO operations. Query directly or use avg_over_time(xrootd_server_io_active[5m]) to see the average over a time range.

xrootd_server_io_wait_seconds_total

The aggregate time spent in storage operations in origin/cache server.

Note: This is a counter metric. Use rate(xrootd_server_io_wait_seconds_total[5m]) to get average wait time per second, or increase(xrootd_server_io_wait_seconds_total[1h]) to get total wait time in the last hour.

xrootd_cpu_utilization

CPU utilization of the XRootD server, represented as the average number of CPU cores utilized (e.g., 1.0 = one full core, 2.5 = two and a half cores).

Note: This is a gauge metric representing the current CPU utilization. Query directly to see the current utilization, or use avg_over_time(xrootd_cpu_utilization[5m]) to see the average over a time range.

OSS Layer Metrics

The following metrics are available from the XRootD OSS layer.

Note: All metrics ending in _total are counter metrics. Use rate() or increase() with a time range to query them. All metrics ending in _time_seconds are histogram metrics that track operation duration distributions.

Metric NameDescription
xrootd_oss_reads_totalThe total number of read operations on the OSS.
xrootd_oss_writes_totalThe total number of write operations on the OSS.
xrootd_oss_stats_totalThe total number of stat operations on the OSS.
xrootd_oss_pgreads_totalThe total number of page read operations on the OSS.
xrootd_oss_pgwrites_totalThe total number of page write operations on the OSS.
xrootd_oss_readv_totalThe total number of readv operations on the OSS.
xrootd_oss_readv_segments_totalThe total number of segments in readv operations on the OSS.
xrootd_oss_dirlists_totalThe total number of directory list operations on the OSS.
xrootd_oss_dirlist_entries_totalThe total number of directory list entries on the OSS.
xrootd_oss_truncates_totalThe total number of truncate operations on the OSS.
xrootd_oss_unlinks_totalThe total number of unlink operations on the OSS.
xrootd_oss_chmods_totalThe total number of chmod operations on the OSS.
xrootd_oss_opens_totalThe total number of open operations on the OSS.
xrootd_oss_renames_totalThe total number of rename operations on the OSS.
xrootd_oss_slow_reads_totalThe total number of slow read operations on the OSS.
xrootd_oss_slow_writes_totalThe total number of slow write operations on the OSS.
xrootd_oss_slow_stats_totalThe total number of slow stat operations on the OSS.
xrootd_oss_slow_pgreads_totalThe total number of slow page read operations on the OSS.
xrootd_oss_slow_pgwrites_totalThe total number of slow page write operations on the OSS.
xrootd_oss_slow_readv_totalThe total number of slow readv operations on the OSS.
xrootd_oss_slow_readv_segments_totalThe total number of segments in slow readv operations on the OSS.
xrootd_oss_slow_dirlists_totalThe total number of slow directory list operations on the OSS.
xrootd_oss_slow_dirlist_entries_totalThe total number of slow directory list entries on the OSS.
xrootd_oss_slow_truncates_totalThe total number of slow truncate operations on the OSS.
xrootd_oss_slow_unlinks_totalThe total number of slow unlink operations on the OSS.
xrootd_oss_slow_chmods_totalThe total number of slow chmod operations on the OSS.
xrootd_oss_slow_opens_totalThe total number of slow open operations on the OSS.
xrootd_oss_slow_renames_totalThe total number of slow rename operations on the OSS.
xrootd_oss_open_time_secondsThe time taken for open operations on the OSS.
xrootd_oss_read_time_secondsThe time taken for read operations on the OSS.
xrootd_oss_readv_time_secondsThe time taken for readv operations on the OSS.
xrootd_oss_pgread_time_secondsThe time taken for page read operations on the OSS.
xrootd_oss_write_time_secondsThe time taken for write operations on the OSS.
xrootd_oss_pgwrite_time_secondsThe time taken for page write operations on the OSS.
xrootd_oss_dirlist_time_secondsThe time taken for directory list operations on the OSS.
xrootd_oss_stat_time_secondsThe time taken for stat operations on the OSS.
xrootd_oss_truncate_time_secondsThe time taken for truncate operations on the OSS.
xrootd_oss_unlink_time_secondsThe time taken for unlink operations on the OSS.
xrootd_oss_rename_time_secondsThe time taken for rename operations on the OSS.
xrootd_oss_chmod_time_secondsThe time taken for chmod operations on the OSS.
xrootd_oss_slow_open_time_secondsThe time taken for slow open operations on the OSS.
xrootd_oss_slow_read_time_secondsThe time taken for slow read operations on the OSS.
xrootd_oss_slow_readv_time_secondsThe time taken for slow readv operations on the OSS.
xrootd_oss_slow_pgread_time_secondsThe time taken for slow page read operations on the OSS.
xrootd_oss_slow_write_time_secondsThe time taken for slow write operations on the OSS.
xrootd_oss_slow_pgwrite_time_secondsThe time taken for slow page write operations on the OSS.
xrootd_oss_slow_dirlist_time_secondsThe time taken for slow directory list operations on the OSS.
xrootd_oss_slow_stat_time_secondsThe time taken for slow stat operations on the OSS.
xrootd_oss_slow_truncate_time_secondsThe time taken for slow truncate operations on the OSS.
xrootd_oss_slow_unlink_time_secondsThe time taken for slow unlink operations on the OSS.
xrootd_oss_slow_rename_time_secondsThe time taken for slow rename operations on the OSS.
xrootd_oss_slow_chmod_time_secondsThe time taken for slow chmod operations on the OSS.

S3 Cache Plugin Metrics

The following metrics are available from the XRootD S3 cache plugin.

Note: All metrics ending in _total are counter metrics. Use rate() or increase() with a time range to query them.

Metric NameDescriptionLabels
xrootd_s3_cache_bytes_totalBytes transferred by the S3 cache plugin.type: hit, miss, bypass, fetch, unused, prefetch
xrootd_s3_cache_hits_totalNumber of cache hits, partial hits, or misses.type: full, partial, miss
xrootd_s3_cache_requests_totalNumber of cache requests.type: bypass, fetch, prefetch
xrootd_s3_cache_errors_totalNumber of errors encountered by the S3 cache plugin.
xrootd_s3_cache_request_seconds_totalTotal time spent in S3 requests.type: bypass, fetch

XrdCl Client Metrics

The following metrics are available from the XrdCl client.

Note: All metrics ending in _total are counter metrics. Use rate() or increase() with a time range to query them. Metrics ending in _timestamp_seconds are gauge metrics representing timestamps and can be queried directly. xrootd_xrdcl_queue_pending is a gauge metric representing the current queue size.

Metric NameDescriptionLabels
xrootd_xrdcl_prefetch_count_totalTotal number of prefetches started.
xrootd_xrdcl_prefetch_expired_totalTotal number of prefetches that expired.
xrootd_xrdcl_prefetch_failed_totalTotal number of prefetches that failed.
xrootd_xrdcl_prefetch_reads_hit_totalTotal number of successful reads from prefetch buffer.
xrootd_xrdcl_prefetch_reads_miss_totalTotal number of reads that missed the prefetch buffer.
xrootd_xrdcl_prefetch_bytes_used_totalTotal number of bytes served from prefetch.
xrootd_xrdcl_queue_produced_totalTotal number of HTTP requests placed into the queue.
xrootd_xrdcl_queue_consumed_totalTotal number of HTTP requests read from the queue.
xrootd_xrdcl_queue_pendingNumber of pending HTTP requests in the queue.
xrootd_xrdcl_queue_rejected_totalTotal number of HTTP requests rejected due to overload.
xrootd_xrdcl_worker_oldest_op_timestamp_secondsTimestamp of the oldest operation in any of the worker threads.
xrootd_xrdcl_worker_oldest_cycle_timestamp_secondsTimestamp of the oldest event loop completion in any of the worker threads.
xrootd_xrdcl_http_requests_totalStatistics about HTTP requests.verb, status, type
xrootd_xrdcl_http_request_duration_seconds_totalTotal duration of HTTP requests.verb, status, type
xrootd_xrdcl_http_bytes_totalBytes transferred for HTTP requests.verb, status
xrootd_xrdcl_conncall_totalStatistics about connection calls.type

Cache Eviction Metrics

The following metrics are available from the XRootD cache eviction process.

Note: All cache eviction metrics are gauge metrics representing current state. Query them directly or use avg_over_time() to see averages over a time range.

Metric NameDescriptionLabels
xrootd_cache_eviction_last_update_time_secondsThe last time xrootd cache eviction metrics were updated.
xrootd_cache_eviction_disk_usage_bytesThe disk usage of the xrootd cache.
xrootd_cache_eviction_snapshot_stats_reset_time_secondsThe time when the snapshot statistics were last reset.
xrootd_cache_eviction_disk_total_bytesThe total disk space available for the cache.
xrootd_cache_eviction_file_usage_bytesThe file usage of the xrootd cache.
xrootd_cache_eviction_meta_total_bytesThe total metadata storage available for the cache.
xrootd_cache_eviction_meta_used_bytesThe used metadata storage for the cache.
xrootd_cache_eviction_dir_num_iosNumber of I/Os per directory.dir_name
xrootd_cache_eviction_dir_durationDuration of I/Os per directory.dir_name
xrootd_cache_eviction_dir_bytesBytes transferred per directory.dir_name, type: hit, missed, bypassed, written
xrootd_cache_eviction_dir_st_block_bytesBytes from storage blocks per directory.dir_name, type: added, removed
xrootd_cache_eviction_dir_n_cksum_errorsNumber of checksum errors per directory.dir_name
xrootd_cache_eviction_dir_files_countFile operations per directory (opened, closed, created, removed).dir_name, type: opened, closed, created, removed
xrootd_cache_eviction_dir_directories_countDirectory operations (created, removed) per directory.dir_name, type: created, removed
xrootd_cache_eviction_dir_last_access_time_secondsLast access time per directory.dir_name, type: open, close
xrootd_cache_eviction_dir_st_blocks_usage_countStorage blocks usage per directory.dir_name
xrootd_cache_eviction_dir_n_files_open_countNumber of open files per directory.dir_name
xrootd_cache_eviction_dir_n_files_countNumber of files per directory.dir_name
xrootd_cache_eviction_dir_n_directories_countNumber of directories per directory.dir_name

Director

up

The Pelican director scrapes Prometheus metrics from all origins and cache servers that successfully advertise to the director. This metric reflects the Pelican origin or cache servers that are scraped by the director.

Label: server_name

The name of the storage server. By default it’s the hostname.

Label: server_type

Label ValuesDescription
OriginOrigin server
CacheCache server

Label: server_url

The storage server XRootD url.

Label: server_web_url

The storage server web url.

Label: server_auth_url

The storage server authentication url.

Label: server_lat

The storage server latitude.

Label: server_long

The storage server longitude.

# of Active Origins and Caches

With the up metric, it is possible to count number of active origin and cache servers in the federation by a simple Prometheus query: count(up{server_type=<"Origin">}) for counting origin servers, or count(up{server_type=<"Cache">}) for counting cache servers.

pelican_director_advertisements_received_total

The accumulated number of origin/cache advertisements to the director. This metric shows if an origin/cache server successfully joins the federation or not. For origin servers, it also shows if each federation namespace prefix it exports passed director verification.

Note: This is a counter metric. Use rate(pelican_director_advertisements_received_total[5m]) to get advertisements per second, or increase(pelican_director_advertisements_received_total[1h]) to get total advertisements in the last hour.

Label: server_name

The name of the storage server. By default it’s the hostname.

Label: server_type

Label ValuesDescription
OriginOrigin server
CacheCache server

Label: server_web_url

The storage server web url.

Label: namespace_prefix

The federation namespace prefix the storage server exported.

Label: status_code

The status code of the director’s response. The most useful value is 403, which means the server advertisement didn’t pass director’s verification.

Label ValuesDescription
200Advertisement succeeded
403Advertisement verification failed
500Director has errors when verifying or saving the advertisement

pelican_director_stat_total

The accumulated number of stat query the director made to origin/cache servers to check for object availability. Only available when Director.EnableStat is set to true. This metric is a good indicator of object availability and origin/cache service quality.

Note: This is a gauge metric representing the current accumulated count. Query directly to see the current total. Note that this metric accumulates but is a gauge (not a counter), so it may reset on restart.

Label: server_name

The name of the storage server. By default it’s the hostname.

Label: server_type

Label ValuesDescription
OriginOrigin server
CacheCache server

Label: server_url

The storage server XRootD url.

Label: result

The stat query result.

Label ValuesDescription
SucceededThe object requested is on the server
NotFoundThe requested object could not be found on the server
TimeoutThe query exceeded the allotted time and was not completed.
CancelledThe query is cancelled as maximum number of responses has been reached
ForbiddenThe object request was denied due to lack of permissions or missing token
UnknownErrAn unexpected error occurred. Typically when the server refused to connect

Label: cached_result

Whether the result was cached.

pelican_director_stat_active

The ongoing stat queries at the server. Note that Prometheus samples the metric value per 15s, and each stat request only takes ~10-100ms to finish. The value of this metric can’t capture per-second transient requests.

Note: This is a gauge metric representing the current number of active stat queries. Query directly to see the current count.

Label: server_name

The name of the storage server. By default it’s the hostname.

Label: server_type

Label ValuesDescription
OriginOrigin server
CacheCache server

Label: server_url

The storage server XRootD url.

pelican_director_total_ftx_test_suite

The number of file transfer test suite the director issued. In Pelican, director creates a test file and sent to origin servers to as a health test. It issues such test suite when it receives the registration from the origin server. In a test suite, a timer was set to run a cycle of uploading, getting, and deleting the test file every 15 seconds. Such cycle is called a “test run”. In theory, director should issue only one test for each origin servers; however, since the registration information was stored in a TTL cache in director, and it expires after certain period of time, and the test suite issued will be cancelled. A new test suite is issued with the new registration. Thus, director can issue multiple test suites to an origin server.

Note: This is a counter metric. Use rate(pelican_director_total_ftx_test_suite[5m]) to get test suites per second, or increase(pelican_director_total_ftx_test_suite[1h]) to get total test suites in the last hour.

Label: server_name

The name of the storage server. By default it’s the hostname.

Label: server_type

Label ValuesDescription
OriginOrigin server
CacheCache server

Label: server_web_url

The storage server web url.

pelican_director_active_ftx_test_suite

The number of active director file transfer test suite. As mentioned in previous metric, the test suites are individual tasks running independently from the main program logic. This can cause race condition in some condition where an expired test suite was not cleared but a new test suite is issued for the same origin. This metric records such condition for debugging and monitoring. The value of the metric should be 1 for all the time.

This metric shares the same label as pelican_director_total_ftx_test_suite

Note: This is a gauge metric representing the current number of active test suites. Query directly to see the current count.

pelican_director_total_ftx_test_runs

The number of file transfer test runs the director issued. A “test run” is a set of upload/get/delete of test files to a origin. It executes in a cycle of 15s (by default).

Note: This is a counter metric. Use rate(pelican_director_total_ftx_test_runs[5m]) to get test runs per second, or increase(pelican_director_total_ftx_test_runs[1h]) to get total test runs in the last hour.

This metric shares the same label as pelican_director_total_ftx_test_suite, with two additions:

Label: status

Label ValuesDescription
SuccessThe test run succeeded
FailedThe test run failed

Label: report_status

Label ValuesDescription
SuccessThe reporting to the origin of test run status succeeded
FailedThe reporting to the origin of test run status failed

pelican_director_map_items_total

The total number of map items in the director, by the name of the map.

Note: This is a gauge metric representing the current number of map items. Query directly to see the current count.

Label: name

The name of the map. One of healthTestUtils, filteredServers, serverStatUtils, serverStatEntries.

pelican_director_ttl_cache

The statistics of various TTL caches.

Note: This is a gauge metric representing the current TTL cache statistics. Query directly to see the current values.

Label: name

The name of the cache. One of serverAds, jwks.

Label: type

The type of the statistic. One of evictions, insertions, hits, misses, total.

pelican_director_server_count

The number of servers currently recognized by the Director, delineated by pelican/non-pelican and origin/cache.

Note: This is a gauge metric representing the current number of servers. Query directly to see the current count.

Label: server_name

The name of the server.

Label: server_type

Label ValuesDescription
OriginOrigin server
CacheCache server

Label: from_topology

Whether the server was discovered from the topology.

pelican_director_client_requests_total

The total number of requests from clients.

Note: This is a counter metric. Use rate(pelican_director_client_requests_total[5m]) to get requests per second, or increase(pelican_director_client_requests_total[1h]) to get total requests in the last hour.

Label: version

The client version.

Label: service

The service that received the request.

pelican_director_redirects_total

The total number of redirects the director issued.

Note: This is a counter metric. Use rate(pelican_director_redirects_total[5m]) to get redirects per second, or increase(pelican_director_redirects_total[1h]) to get total redirects in the last hour.

Label: destination

The destination of the redirect.

Label: status_code

The status code of the redirect.

Label: version

The client version.

Label: network

The network of the client.

pelican_director_maxmind_server_errors_total

The total number of errors encountered trying to resolve server coordinates using the GeoIP MaxMind database.

Note: This is a counter metric. Use rate(pelican_director_maxmind_server_errors_total[5m]) to get errors per second, or increase(pelican_director_maxmind_server_errors_total[1h]) to get total errors in the last hour.

Label: network

The network address that was being resolved.

Label: server_name

The name of the server that was being resolved.

pelican_director_maxmind_client_errors_total

The total number of errors encountered trying to resolve client coordinates using the GeoIP MaxMind database.

Note: This is a counter metric. Use rate(pelican_director_maxmind_client_errors_total[5m]) to get errors per second, or increase(pelican_director_maxmind_client_errors_total[1h]) to get total errors in the last hour.

Label: network

The network address that was being resolved.

Label: project

The project of the client that was being resolved.

pelican_director_rejected_advertisements

The total number of advertisements rejected by the director.

Note: This is a counter metric. Use rate(pelican_director_rejected_advertisements[5m]) to get rejections per second, or increase(pelican_director_rejected_advertisements[1h]) to get total rejections in the last hour.

Label: hostname

The hostname of the server that sent the advertisement.

pelican_director_server_statusweight

The EWMA-smoothed status weight generated by the Director for each server.

Note: This is a gauge metric representing the current status weight. Query directly to see the current weight value.

Label: server_name

The name of the server.

Label: server_url

The URL of the server.

Label: server_type

Label ValuesDescription
OriginOrigin server
CacheCache server

Deprecated Metrics

The following metrics are deprecated and will be removed in a future release.

  • pelican_director_geoip_errors: [Deprecated — split into separate client/server metrics (pelican_director_maxmind_{server,client}_errors_total)] The total number of errors encountered trying to resolve coordinates using the GeoIP MaxMind database.
  • xrootd_monitoring_packets_received: Renamed to xrootd_monitoring_packets_received_total.
  • xrootd_transfer_readv_segments_count: Renamed to xrootd_transfer_readv_segments_total.
  • xrootd_transfer_operations_count: Renamed to xrootd_transfer_operations_total.
  • xrootd_server_connection_count: Renamed to xrootd_server_connections_total.
  • xrootd_server_bytes: Renamed to xrootd_server_bytes_total.
  • xrootd_server_io_wait_time: Renamed to xrootd_server_io_wait_seconds_total.