FAQs and Troubleshooting

Frequently Asked Questions & Troubleshooting

Why do osdf:/// URLs use 3 slashes instead of two slashes like regular URLs?

TL;DR: osdf:/// URLs use three slashes because the osdf scheme inherently specifies the networked system, making the federation hostname redundant. This is similar to how file:/// URLs work, where the triple slash indicates the resource is on the local machine, eliminating the need for a hostname. Most Pelican clients should detect and handle osdf://-schemed URLs if they're missing the third slash, but this is technically an abuse of the well-defined URL structure.

URLs, or Uniform Resource Locators, play a crucial role in the way computers are able to discover, locate and access digital resources. Since their broad adoption in the mid 1990s, their structure has become a well-defined internet standard1. To quote their definition:

"Uniform Resource Locators" (URLs), in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location").2

Whether you're trying to access a remote PDF or watch your favorite playlist of YouTube cats, the URL you give your browser conveys important information about what it's supposed to find, where it can look, and how it should be accessed.

Understanding a URLs basic components will help answer why osdf:/// URLs typically require triple slashes.

For our intents and purposes, URLs contain three main parts -- a scheme, a hostname and a path, where these pieces can be loosely defined as follows 3:

  1. scheme: A URL's scheme tells the computer how something should be accessed. In most cases, this specifies a protocol like https, ftp, or in our case pelican and osdf. Essentially, this tells your computer what "language" it needs to speak to interact with the resource.
  2. hostname: The URL's hostname gives the computer information about who/what remote resource might be able to fulfill your request.
  3. path: The path component of a URL specifies the name of a requested resource from the requested hostname. Typically this is something like a specific web page or file.

These components are stitched together in a predictable fashion:

<scheme>://<hostname>/<path>

For example, when you visit https://docs.pelicanplatform.org/parameters, you've defined the URL scheme as https, the hostname as docs.pelicanplatform.org and the path as parameters. Together, these pieces tell your computer to use HTTPS to access the parameters page from Pelican's docs.pelicanplatform.org documentation website.

The pelican-schemed URLs you use to access objects from Pelican federations follow the same setup, leading to URLs like:

pelican://osg-htc.org/some/object

Here, you've indicated you want to use the pelican protocol to interact with some/object from the osg-htc.org federation.

However, some URL schemes are inherent to a specific location and don't need a hostname. If you've ever used a browser to open a PDF on your personal computer, you've likely seen a URL like file:///some/path/to/file.pdf. The triple slash after the file scheme happens because file already pre-supposes that the browser needs to get a file from the local machine, so the hostname information isn't needed. It is equally valid to use the URL file://localhost/some/path/to/file.pdf, but that's more to type! Instead, we wind up cutting out the redundant information to yield

file://localhost/some/path/to/file.pdf --> file:///some/path/to/file.pdf

Similarly, the osdf URL scheme already encodes two pieces of information -- that you're speaking Pelican and you're talking to the OSDF, whose hostname is osg-htc.org. By using osdf URLs, you've indicated the object you're interacting with is part of a specific networked system that should already be understood.

The hostname in the previous pelican-schemed URL matches the OSDF's hostname, so it can be rewritten using an osdf url:

pelican://osg-htc.org/some/object --> osdf://osg-htc.org/some/object = osdf:///some/object

On the other hand, construction of a URL like osdf://some/object has the potential to confuse many clients that aren't aware of the osdf:// protocol. That is because now there is part of the object's name ("some") where such clients might be expecting the federation's hostname.

More information about pelican and osdf URLs can be found in our client usage docs.

1: For more information about the structure of URLs, see RFC 1738 (opens in a new tab).
2: For more information on the difference between URIs, URLs and URNs, see RFC 3986 (opens in a new tab).
3: URLs can also contain things like ports, query parameters and "fragments," and while Pelican makes use of these, they aren't as crucial to understanding the question at hand.

Why isn't Pelican using the closest cache(s) when I download objects?

Whenever a Pelican client tries to download an object, one of its first steps is to talk to the appropriate federation's Director, where the Directors job is to match the client's request to some service(s) that can best fulfill the request. This usually means giving the client an ordered list of caches that the Director thinks either have the object or that are capable of delivering the object quickly.

By default, Directors order this list by trying to determine the physical distance between the client and any caches in the federation with closer caches being assigned higher priority1. This troubleshooting guide assumes the Director is configured for distance-based sorting. There are several ways this process can break.

Client Resolution

First, the Director uses the IP address of the incoming client request to generate a lat/long pair and confidence range for the client. It does this by running the client's IP address through a local database2. Issues that can occur at this stage include:

  • The IP address reported by the client is invalid, or in a private range (e.g. 192.168.0.12 for IPv4)
  • The IP address is valid, but the database doesn't have an entry for it
  • The IP address is valid and has an entry, but the database reports a confidence range greater than 900km. In any of these cases, the Director will decide it can't reasonably determine where the client is, and it will assign a temporary lat/long pair by picking a coordinate somewhere in the continental US. This coordinate is cached for a short time (~20 minutes), so subsequent requests from the same client will resolve to the same spot.

If the list of caches you see being tried look like they have a geographic center, but not the correct geographic center, you might try determining the IP address the Director sees when the client contacts it. This can be done by running:

curl ifconfig.me

and running the resulting IP address through MaxMind's GeoLite City demo (opens in a new tab). If the location it determines is incorrect, has a large accuracy radius, or appears to be otherwise invalid, that's likely causing a problem.

NOTE: The database used by this demo is not exactly the same database used by the Director. If you see a problem here, there's definitely an issue, but if this step yields the expected results, there may still be issues with the Director's database. If the list of caches tried by your client(s) appear to have a geographic center that's incorrect, contact your federation administrators to ask if they can create a manual override for your IP range.

Cache/Origin Resolution

Alternatively, the client's location may be known by the Director, but locations for some caches (or even origins) in the list can't be determined. While the Director can use a client's IP address directly for geo-location, it uses a DNS lookup against cache/origin hostnames to determine IP addresses. Failure to produce an IP address in this step means something more fundamental is wrong with the cache/origin, and that it should be fixed before receiving any requests. However, it's still possible that the resolved IP address is incorrect, or has the same types of issues client IPs might have with the MaxMind database. When this happens, the server should be sorted to the end of the potential list of servers.

Errors can compound if both of these issues (client and cache/origin geo-location failures) occur. If the cache list you see from the Director has no discernible geographic center, you should contact your federation administrators for help debugging.

Finally, some Pelican clients may allow you to select a cache manually if you have a strong preference for which cache to use. For example, the Pelican CLI lets you do this by specifying the cache's hostname/port with the -c flag:

pelican object get -c https://my-chosen-cache:8443 pelican://osg-htc.org/some/object

NOTE: Specifying caches in this manner should only be done if you're positive the Director is having sorting troubles -- the Director's sorting algorithms are no longer tied explicitly to geo-location, and it may be trying to make better decisions based on object availability and the detected load of various caches.

1: Directors may implement more intelligent cache selection schemes. For a full list of options, see the documentation for the Director's Director.CacheSortMethod config parameter.
2: In particular, the Director uses the MaxMind GeoLite City database (opens in a new tab), which it updates twice weekly on Wednesdays and Fridays (shortly after the databases are updated upstream by MaxMind).

How can I tell what services my Pelican client will talk to before I try to get/put objects?

Pelican is built on HTTP, so any client that speaks HTTP can determine this. For more information about Pelican's use of HTTP verbs, see Getting Data With Pelican These instructions are for using curl on the command line. Broadly speaking, they require:

  1. Finding the hostname of your federation's Director
  2. Determining whether you want the Director to provide a list of caches or origins
  3. Creating the correct curl command
  4. Interpreting the Director's response headers

If you don't have a working understanding of how Pelican finds and uses Directors in a federation, see About Pelican/A First Look Under The Hood for a quick recap.

Discovering your Director

If you have a pelican-schemed URL and a terminal with curl, you have everything you need to get started. First, get the federation discovery URL from your pelican url, e.g. pelican://osg-htc.org/some/object results in the discovery URL of osg-htc.org

Next, curl the discovery URL at the /.well-known/pelican-configuration path:

$ curl https://osg-htc.org/.well-known/pelican-configuration
{
  "director_endpoint": "https://osdf-director.osg-htc.org",
  "namespace_registration_endpoint": "https://osdf-registry.osg-htc.org",
  "jwks_uri": "https://osg-htc.org/osdf/public_signing_key.jwks"
}

If successful, this should return a JSON. Your federation's Director is the URL provided by the director_endpoint key, e.g. https://osdf-director.osg-htc.org

Deciding Whether to Ask About Caches or Origins

Next, you should determine whether you expect your client to talk to a cache or an origin. The answer is usually a cache unless you are trying to run
a) GET operation using the "direct read" flags/URL query parameters, or
b) PUT operation to write an object via the origin.

The following steps are very similar in both cases, but require minor adjustments in the way you interact with the Director.

Discovering Caches

If you're expecting your client to manipulate objects through a cache, you'll point curl at the Director's "object" discovery API, which uses the path /api/v1.0/director/object/<your namespace & object> path prefix.

For the previous pelican URL of pelican://osg-htc.org/some/object, you'd then construct the following curl command pointed at the Director with the correct path:

$ curl -v https://osdf-director.osg-htc.org/api/v1.0/director/object/some/object

If successful, the Director's response will be a series of HTTP headers with information you can use to answer a variety of questions. In particular, the list of caches you can expect your client to try are included in the link header, which will look something like:

link:
<https://osdf-uw-cache.svc.osg-htc.org:8443/some/object>; rel="duplicate"; pri=1; depth=2,
<https://sc-cache.chtc.wisc.edu:8443/some/object>; rel="duplicate"; pri=2; depth=2,
<https://dtn-pas.cinc.nrp.internet2.edu:8443/some/object>; rel="duplicate"; pri=3; depth=2,
<https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/some/object>; rel="duplicate"; pri=4; depth=2,
<https://its-condor-xrootd1.syr.edu:8443/some/object>; rel="duplicate"; pri=5; depth=2,
<https://osg-gftp2.pace.gatech.edu:8443/some/object>; rel="duplicate"; pri=6; depth=2

Discovering Origins

If you're expecting your client to manipulate objects through an origin, you'll point curl at the Director's "origin" discovery API, which uses the path /api/v1.0/director/origin/<your namespace & object> path prefix.

Using the previous pelican-schemed URL, you'd then construct the following curl command:

$ curl -v https://osdf-director.osg-htc.org/api/v1.0/director/origin/some/object

Once again, the link header will include an ordered list of Origins that your client expects to try when looking for /some/object:

< link: <https://chtc-osdf-origin.chtc.wisc.edu:8443/some/object>; rel="duplicate"; pri=1; depth=1

This list is usually much shorter and likely contains a single Origin.