Frequently Asked Questions & Troubleshooting

Data Access Model

Can I use Pelican Platform to search for the data I need?

When a namespace is configured to allow listings, Pelican clients let users search for objects by prefix, similar to the way you might check for files on a computer with the terminal command ls /foo/bar/. However, this kind of searching is limited in that it can only tell you the names of objects under a prefix as defined by the prefix owner – it does not necessarily tell you anything about the contents of objects, and it does not guarantee the data owner has chosen descriptive, reasonable names. This is why Pelican encourages leaving breadcrumbs of data provenance in namespace prefixes, because it gives users information about who to contact for more information. See Choosing a Namespace Prefix for more information.

Can Pelican be used to satisfy “FAIR” data requirements?

Supporting FAIR data – that is, making data “Findable”, “Accessible”, “Interoperable” and “Reusable” – is one of Pelican’s main goals, because we believe that FAIR data is the bedrock of robust, open science.

However, at this time Pelican’s primary targets are data accessibility and interoperability, which Pelican supports through its federated approach to data management (accessibility) along with its tight integration with HTCSS and caching mechanisms (interoperability).

While Pelican lets users discover object names by prefix/namespace, Pelican does not meet the full set of requirements for “Findability” because it has no native integration with data cataloging technologies, and it provides no way for the generic researcher to answer “where can I get data related to ABC that looks like XYZ”. However, Pelican has partnered with the National Data Platform to explore these concepts further.

Finally, Pelican does not address data “Reusability” because it has no archival features, and its “object immutability” rules are to prevent undefined behavior, not to guarantee that object contents never change.

All this being said, researchers who use Pelican can still help address “Findability” and “Reusability” through practicing good data hygiene, such as by choosing a good namespace prefix and structuring object/file names according to best practices. See Choosing a Namespace Prefix for more information.

Using Pelican Platform

I am using _ computer with _ operating system - can I still use Pelican to download objects?

Yes!

The Pelican Platform provides several ways of accessing objects via a Pelican Federation using the Pelican Client. The most commonly used is the Pelican Command-Line Interface (aka Pelican CLI), which is a standalone program that should be compatible with most modern computers, and it does not require admin permissions to install or use. As long as you have an internet connection and know how to access the object via a Pelican Federation, you just need to download the Pelican CLI!

Visit Getting Started - Accessing Data to get started with the Pelican CLI. To learn more about the Pelican Clients, visit Getting Data With Pelican.

Why do `osdf:///` URLs use 3 slashes instead of two slashes like regular URLs?

TL;DR: osdf:/// URLs use three slashes because the osdf scheme inherently specifies the networked system, making the federation hostname redundant. This is similar to how file:/// URLs work, where the triple slash indicates the resource is on the local machine, eliminating the need for a hostname. Most Pelican clients should detect and handle osdf://-schemed URLs if they’re missing the third slash, but this is technically an abuse of the well-defined URL structure.

URLs, or Uniform Resource Locators, play a crucial role in the way computers are able to discover, locate and access digital resources. Since their broad adoption in the mid 1990s, their structure has become a well-defined internet standard¹. To quote their definition:

“Uniform Resource Locators” (URLs), in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network “location”).²

Whether you’re trying to access a remote PDF or watch your favorite playlist of YouTube cats, the URL you give your browser conveys important information about what it’s supposed to find, where it can look, and how it should be accessed.

Understanding a URLs basic components will help answer why osdf:/// URLs typically require triple slashes.

For our intents and purposes, URLs contain three main parts — a scheme, a hostname and a path, where these pieces can be loosely defined as follows ³:

scheme: A URL’s scheme tells the computer how something should be accessed. In most cases, this specifies a protocol like https, ftp, or in our case pelican and osdf. Essentially, this tells your computer what “language” it needs to speak to interact with the resource.
hostname: The URL’s hostname gives the computer information about who/what remote resource might be able to fulfill your request.
path: The path component of a URL specifies the name of a requested resource from the requested hostname. Typically this is something like a specific web page or file.

These components are stitched together in a predictable fashion:


<scheme>://<hostname>/<path>

For example, when you visit https://docs.pelicanplatform.org/parameters, you’ve defined the URL scheme as https, the hostname as docs.pelicanplatform.org and the path as parameters. Together, these pieces tell your computer to use HTTPS to access the parameters page from Pelican’s docs.pelicanplatform.org documentation website.

The pelican-schemed URLs you use to access objects from Pelican federations follow the same setup, leading to URLs like:


pelican://osg-htc.org/some/object

Here, you’ve indicated you want to use the pelican protocol to interact with some/object from the osg-htc.org federation.

However, some URL schemes are inherent to a specific location and don’t need a hostname. If you’ve ever used a browser to open a PDF on your personal computer, you’ve likely seen a URL like file:///some/path/to/file.pdf. The triple slash after the file scheme happens because file already pre-supposes that the browser needs to get a file from the local machine, so the hostname information isn’t needed. It is equally valid to use the URL file://localhost/some/path/to/file.pdf, but that’s more to type! Instead, we wind up cutting out the redundant information to yield

file://~~localhost~~/some/path/to/file.pdf —> file:///some/path/to/file.pdf

Similarly, the osdf URL scheme already encodes two pieces of information — that you’re speaking Pelican and you’re talking to the OSDF, whose hostname is osg-htc.org. By using osdf URLs, you’ve indicated the object you’re interacting with is part of a specific networked system that should already be understood.

The hostname in the previous pelican-schemed URL matches the OSDF’s hostname, so it can be rewritten using an osdf url:

pelican://osg-htc.org/some/object —> osdf://~~osg-htc.org~~/some/object = osdf:///some/object

On the other hand, construction of a URL like osdf://some/object has the potential to confuse many clients that aren’t aware of the osdf:// protocol. That is because now there is part of the object’s name (“some”) where such clients might be expecting the federation’s hostname.

More information about pelican and osdf URLs can be found in our client usage docs.

1: For more information about the structure of URLs, see RFC 1738 .
2: For more information on the difference between URIs, URLs and URNs, see RFC 3986 .
3: URLs can also contain things like ports, query parameters and “fragments,” and while Pelican makes use of these, they aren’t as crucial to understanding the question at hand.

Can I run a Pelican Origin from my laptop?

You can, but it is not recommended.

While it is technically possible to run a Pelican Origin for data on your laptop, there are several reasons why you should not.

Whenever your laptop is closed, runs out of battery, or simply isn’t connected to the internet, the Origin would no longer be available to any researcher who wishes to access that data.
You would need to restart the Origin each time your laptop shut down.
There is the potential for high network usage on your laptop if a large amount of the data is requested at once.

Why isn’t Pelican using the closest cache(s) when I download objects?

Whenever a Pelican client tries to download an object, one of its first steps is to talk to the appropriate federation’s Director, where the Directors job is to match the client’s request to some service(s) that can best fulfill the request. This usually means giving the client an ordered list of caches that the Director thinks either have the object or that are capable of delivering the object quickly.

By default, Directors order this list by trying to determine the physical distance between the client and any caches in the federation with closer caches being assigned higher priority¹. This troubleshooting guide assumes the Director is configured for distance-based sorting. There are several ways this process can break.

Client Resolution

First, the Director uses the IP address of the incoming client request to generate a lat/long pair and confidence range for the client. It does this by running the client’s IP address through a local database². Issues that can occur at this stage include:

The IP address reported by the client is invalid, or in a private range (e.g. 192.168.0.12 for IPv4)
The IP address is valid, but the database doesn’t have an entry for it
The IP address is valid and has an entry, but the database reports a confidence range greater than 900km. In any of these cases, the Director will decide it can’t reasonably determine where the client is, and it will assign a temporary lat/long pair by picking a coordinate somewhere in the continental US. This coordinate is cached for a short time (~20 minutes), so subsequent requests from the same client will resolve to the same spot.

If the list of caches you see being tried look like they have a geographic center, but not the correct geographic center, you might try determining the IP address the Director sees when the client contacts it. This can be done by running:


curl ifconfig.me

and running the resulting IP address through MaxMind’s GeoLite City demo . If the location it determines is incorrect, has a large accuracy radius, or appears to be otherwise invalid, that’s likely causing a problem.

NOTE: The database used by this demo is not exactly the same database used by the Director. If you see a problem here, there’s definitely an issue, but if this step yields the expected results, there may still be issues with the Director’s database. If the list of caches tried by your client(s) appear to have a geographic center that’s incorrect, contact your federation administrators to ask if they can create a manual override for your IP range.

Cache/Origin Resolution

Alternatively, the client’s location may be known by the Director, but locations for some caches (or even origins) in the list can’t be determined. While the Director can use a client’s IP address directly for geo-location, it uses a DNS lookup against cache/origin hostnames to determine IP addresses. Failure to produce an IP address in this step means something more fundamental is wrong with the cache/origin, and that it should be fixed before receiving any requests. However, it’s still possible that the resolved IP address is incorrect, or has the same types of issues client IPs might have with the MaxMind database. When this happens, the server should be sorted to the end of the potential list of servers.

Errors can compound if both of these issues (client and cache/origin geo-location failures) occur. If the cache list you see from the Director has no discernible geographic center, you should contact your federation administrators for help debugging.

Finally, some Pelican clients may allow you to select a cache manually if you have a strong preference for which cache to use. For example, the Pelican CLI lets you do this by specifying the cache’s hostname/port with the -c flag:


pelican object get -c https://my-chosen-cache:8443 pelican://osg-htc.org/some/object

NOTE: Specifying caches in this manner should only be done if you’re positive the Director is having sorting troubles — the Director’s sorting algorithms are no longer tied explicitly to geo-location, and it may be trying to make better decisions based on object availability and the detected load of various caches.

1: Directors may implement more intelligent cache selection schemes. For a full list of options, see the documentation for the Director’s Director.CacheSortMethod config parameter.
2: In particular, the Director uses the MaxMind GeoLite City database , which it updates twice weekly on Wednesdays and Fridays (shortly after the databases are updated upstream by MaxMind).

How can I tell what services my Pelican client will talk to before I try to get/put objects?

Pelican is built on HTTP, so any client that speaks HTTP can determine this. For more information about Pelican’s use of HTTP verbs, see Getting Data With Pelican These instructions are for using curl on the command line. Broadly speaking, they require:

Finding the hostname of your federation’s Director
Determining whether you want the Director to provide a list of caches or origins
Creating the correct curl command
Interpreting the Director’s response headers

If you don’t have a working understanding of how Pelican finds and uses Directors in a federation, see About Pelican/A First Look Under The Hood for a quick recap.

Discovering your Director

If you have a pelican-schemed URL and a terminal with curl, you have everything you need to get started. First, get the federation discovery URL from your pelican url, e.g. pelican://osg-htc.org/some/object results in the discovery URL of osg-htc.org

Next, curl the discovery URL at the /.well-known/pelican-configuration path:


$ curl https://osg-htc.org/.well-known/pelican-configuration
{
  "director_endpoint": "https://osdf-director.osg-htc.org",
  "namespace_registration_endpoint": "https://osdf-registry.osg-htc.org",
  "jwks_uri": "https://osg-htc.org/osdf/public_signing_key.jwks"
}

If successful, this should return a JSON. Your federation’s Director is the URL provided by the director_endpoint key, e.g. https://osdf-director.osg-htc.org

Deciding Whether to Ask About Caches or Origins

Next, you should determine whether you expect your client to talk to a cache or an origin. The answer is usually a cache unless you are trying to run
a) GET operation using the “direct read” flags/URL query parameters, or
b) PUT operation to write an object via the origin.

The following steps are very similar in both cases, but require minor adjustments in the way you interact with the Director.

Discovering Caches

If you’re expecting your client to manipulate objects through a cache, you’ll point curl at the Director’s “object” discovery API, which uses the path /api/v1.0/director/object/<your namespace & object> path prefix.

For the previous pelican URL of pelican://osg-htc.org/some/object, you’d then construct the following curl command pointed at the Director with the correct path:


$ curl -v https://osdf-director.osg-htc.org/api/v1.0/director/object/some/object

If successful, the Director’s response will be a series of HTTP headers with information you can use to answer a variety of questions. In particular, the list of caches you can expect your client to try are included in the link header, which will look something like:


link:
<https://osdf-uw-cache.svc.osg-htc.org:8443/some/object>; rel="duplicate"; pri=1; depth=2,
<https://sc-cache.chtc.wisc.edu:8443/some/object>; rel="duplicate"; pri=2; depth=2,
<https://dtn-pas.cinc.nrp.internet2.edu:8443/some/object>; rel="duplicate"; pri=3; depth=2,
<https://osg-kansas-city-stashcache.nrp.internet2.edu:8443/some/object>; rel="duplicate"; pri=4; depth=2,
<https://its-condor-xrootd1.syr.edu:8443/some/object>; rel="duplicate"; pri=5; depth=2,
<https://osg-gftp2.pace.gatech.edu:8443/some/object>; rel="duplicate"; pri=6; depth=2

Discovering Origins

If you’re expecting your client to manipulate objects through an origin, you’ll point curl at the Director’s “origin” discovery API, which uses the path /api/v1.0/director/origin/<your namespace & object> path prefix.

Using the previous pelican-schemed URL, you’d then construct the following curl command:


$ curl -v https://osdf-director.osg-htc.org/api/v1.0/director/origin/some/object

Once again, the link header will include an ordered list of Origins that your client expects to try when looking for /some/object:


< link: <https://chtc-osdf-origin.chtc.wisc.edu:8443/some/object>; rel="duplicate"; pri=1; depth=1

This list is usually much shorter and likely contains a single Origin.

Frequently Asked Questions & Troubleshooting

Meta

Is it “Pelican” or “Pelican Platform”?

Data Access Model

Can I use Pelican Platform to search for the data I need?

Can Pelican be used to satisfy “FAIR” data requirements?

Using Pelican Platform

I am using _ computer with _ operating system - can I still use Pelican to download objects?

Why do `osdf:///` URLs use 3 slashes instead of two slashes like regular URLs?

Can I run a Pelican Origin from my laptop?

Why isn’t Pelican using the closest cache(s) when I download objects?

Client Resolution

Cache/Origin Resolution

How can I tell what services my Pelican client will talk to before I try to get/put objects?

Discovering your Director

Deciding Whether to Ask About Caches or Origins

Discovering Caches

Discovering Origins

Frequently Asked Questions & Troubleshooting

Meta

Is it “Pelican” or “Pelican Platform”?

How is the Pelican Platform related to the HTCondor Software Suite (HTCSS)?

Data Access Model

Can I use Pelican Platform to search for the data I need?

Can Pelican be used to satisfy “FAIR” data requirements?

Using Pelican Platform

I am using _____ computer with _____ operating system - can I still use Pelican to download objects?

Why do osdf:/// URLs use 3 slashes instead of two slashes like regular URLs?

Can I run a Pelican Origin from my laptop?

Why isn’t Pelican using the closest cache(s) when I download objects?

Client Resolution

Cache/Origin Resolution

How can I tell what services my Pelican client will talk to before I try to get/put objects?

Discovering your Director

Deciding Whether to Ask About Caches or Origins

Discovering Caches

Discovering Origins

I am using _ computer with _ operating system - can I still use Pelican to download objects?

Why do `osdf:///` URLs use 3 slashes instead of two slashes like regular URLs?