Federating Your Data

Federation Your Data via Pelican Origin

Pelican users who want to share data within a Pelican federation do so via an Origin. Origins are a crucial component of Pelican's architecture for two reasons: they act as an adapter between various storage backends and Pelican federations, and they provide fine-grained access controls for that data. That is, they figure out how to take data from wherever it lives (such as a POSIX filesystem, S3 buckets, HTTPS servers, etc.) and transform it into a format that the federation can utilize while respecting your data access requirements.

NOTE: An important distinction between origins and data backends is that, generally speaking, origins do NOT store any data themselves; their primary function is to facilitate data accessibility.

This document contains instructions on how to serve a Pelican origin on top of a variety of storage backend types.

Before Starting

Install Pelican

If you haven't installed Pelican, follow the instructions to install pelican.

For Linux users, it is recommended to install Pelican using one of the package managers (RPM, APK, Deb, etc.) so that Pelican dependencies are automatically handled. You may also run a Pelican Docker image to serve a Pelican origin. If you prefer to install Pelican as a standalone binary, you need to follow additional instructions (opens in a new tab) to install dependencies for the Pelican origin server.

NOTE: Serving origins with a standalone Pelican binary is possible, but not recommended.

MacOS and Windows users who want to serve a Pelican origin should use the Pelican Docker image.

Open Firewall Port for Pelican Origin

At their core, Pelican origins are web servers that listen to two TCP ports for file transfers and Web UI. By default, the Web UI and API interface for your origin will be at port 8444, and the port for object transfers will be at 8443. You may change these port numbers through the configuration file with parameters Server.WebPort and Origin.Port, respectively.

In order for Pelican origins to work properly, these ports need to be accessible by the federation, which in most cases means they need to be open to the internet. If your server host has a firewall policy in place, please open these two ports for both incoming the outgoing TCP requests.

NOTE: If it is not possible for you to expose any ports through the firewall (e.g. you're on a local network or behind a NAT), Pelican has a special feature called a Connection Broker that allows you to serve origins without publicly-accessible ports and TLS credentials. However, this is an experimental feature and requires the Pelican federation you are joining to be compatible. If you are interested in learning more about the Connection Broker, please contact help@pelicanplatform.org for further instructions.

Prepare TLS Credentials

Data transfers in Pelican rely on HTTPS, the encryption scheme used by everyone from banks to instagram that's responsible for securely transmitting data between internet-connected computers. To configure the origin with HTTPS, you'll first need to acquire three things:

  • A valid Transport Layer Security (TLS) certificate
  • The private key associated with the certificate
  • The Intermediate Certificate or the chain file, that establishes the trust chain to a root certificate

NOTE: For local development and testing, you may skip setting up TLS credentials by setting configuration parameter TLSSkipVerify to true. You should NOT set this for production, as it makes all data, including your passwords, available to anyone who can monitor your network.

You need to contact a Certificate Authority (CA) who owns the root certificate for getting these credentials. One popular CA that provides free TLS certificates is Let's Encrypt (opens in a new tab). You may follow their guide (opens in a new tab) to obtain the credentials listed above. Note that you need to have a valid domain before proceeding.

Once you go through the process, locate your credential files and set the following parameters to the file locations or copy the credential files to the default locations of the parameters:

  • Server.TLSCertificate

    • The certificate file from Let's Encrypt (or another CA), usually named as example.com.crt or example.com.pem
    • Default location: ~/.config/pelican/certificates/tls.crt for non-root users or /etc/pelican/certificates/tls.crt for root users.
  • Server.TLSKey

    • The private key corresponding to the TLS certificate, usually named as example.com.key.
    • Default location: ~/.config/pelican/certificates/tls.key for non-root users or /etc/pelican/certificates/tls.key for root users.
  • Server.TLSCACertificateFile

    • The intermediate certificate from Let's Encrypt to establish the trust chain to a root certificate, usually named as letsencrypt-intermediate.pem or chain.pem
    • Default location: ~/.config/pelican/certificates/tlsca.pem for non-root users or /etc/pelican/certificates/tlsca.pem for root users.

Since your TLS certificate is associated with your domain name, you will need to change the default hostname of Pelican server to be consistent. Set Server.Hostname to your domain name (e.g. example.com).

Launch the Origin with POSIX backend

When you've completed the aforementioned steps, you're ready to start configuring the origin that will add your data to a federation. Serving an origin is the process of taking some underlying storage repository and making its data accessible via a namespace prefix in your federation. For example, you might make files in the directory /my/directory available at the federation path /my/namespace so that anyone with access to the federation can get objects from the directory

By default, Pelican origins serve files from a POSIX backend, the filesystem used by Linux computers. However, Pelican aims to support a variety of backends and we currently also support serving objects from S3. Configuration for S3 is mostly similar to configuration for POSIX file systems, but with a few important differences. For information about S3 backends, refer to the S3 Backend page.

If you are running Pelican docker image to serve an origin, please refer to Pelican Docker image documentation.

Find a Federation to join

Before serving an origin, you need to decide which federation your data will be accessed through. For example, the Open Science Data Federation (OSDF) is Pelican's flagship federation, and if you are interested in serving an OSDF origin, you can refer to the OSDF website (opens in a new tab) for details about how to join.

Federations are identified their URL, which is used to host information that origins need for discovering other federation services. For example, the OSDF's federation URL is https://osg-htc.org, and an origin that joins the OSDF will visit https://osg-htc.org/.well-known/pelican-configuration to get important metadata about the federation's central services (the Director and Registry).

To point your origin at a specific federation, you can either pass the -f <federation URL> flag if running from the command line, or configure Federation.DiscoveryUrl: <federation URL> in your config yaml.

Starting the Origin

Origins can be configured via the command line, a config file named pelican.yaml, environment variables, or through a combinations of the three. While simple origins can be run entirely from command line arguments, more complex origins will require configuration your your pelican.yaml.

To start a simple pelican origin from the command line that serves POSIX data, run:

pelican origin serve -f <federation URL> -v </path/to/data>:</your/federation/prefix>

Where:

  • <federation URL> is the federation URL discussed above
  • </path/to/data> is the absolute path to the directory containing files you want to export as Pelican objects in your federation
  • </your/federation/prefix> is the federation prefix at which files in /path/to/data will be accessed from in the federation. Note that federation prefixes follow POSIX path conventions, and they must begin with / to denote an absolute path.

NOTE: By default, origins require authorization tokens for object access. Pelican currently does not support serving a public origin using only the command line, but various access controls can be configured through your configuration file. For more information, see origin capabilities below.

To run the same origin using a pelican.yaml configuration file, save your configuration to /etc/pelican/pelican.yaml if you're running Pelican as root, or at ~/.config/pelican/pelican.yaml if you're running as a non-root user. The command line origin from above could be configured accordingly:

pelican.yaml
# Tell Pelican which federation you're joining
Federation:
  DiscoveryUrl: <federation URL>
 
# Configure your Origin
Origin:
  # POSIX is the default storage type for Pelican origins
  # and can be omitted
  StorageType: "posix"
 
  Exports:
    - StoragePrefix: "/path/to/data"
      FederationPrefix: "/your/federation/prefix"
      # Explicitly state what capabilities you want this prefix to have
      Capabilities: ["Reads", "Writes"]
 

and then simply run

pelican origin serve

Pelican will read the config file and apply it to your origin.

Finally, origins can be configured to a limited extent with environment variables. In Pelican's environment variable model, configuration options are taken from pelican.yaml, flattened, and prepended with PELICAN prefix.

For example, you might configure the origin's storage type by setting the environment variable PELICAN_ORIGIN_STORAGETYPE=posix.

NOTE: Environment variable configuration does not support complex structures that can be built with yaml configuration, such as object-type parameters.

The first time the origin is started, you will see something that looks like the following:

$ pelican origin serve -f https://osg-htc.org -v $PWD:/demo
 
Pelican admin interface is not initialized
To initialize, login at https://localhost:8444/view/initialization/code/ with the following code:
551220

See the admin website configuration documentation section for more information about initializing your origin's admin website.

Origin and Namespace Capabilities

Origins and namespaces can be configured with a set of capabilities, which are the configuration options used to define data access controls. In the previous yaml configuration, we configured the origin capabilities using the Capabilities list of the Origin.Exports block by specifying that the federation prefix /your/federation/prefix supports "Reads" and "Writes". This list of capabilities can be used for further control of what types of operations the namespace is willing to support. Available capabilities include:

  • Reads: When included, objects from the namespace can be read with a valid authorization token.
  • PublicReads: When set, objects from the namespace become public and require no authorization to read.
  • Writes: When included, objects can be written back to the storage backend by Pelican. Write operations always require a valid authorization token.
  • DirectReads: When included, a namespace indicates that it is willing to serve clients directly and does not require data to be pulled through a cache. Disabling this feature may be useful in cases where the origin isn't very performant or has to pay egress costs when data moves through it. Note that this is respected by federation central services, but may not be respected by all clients.
  • Listings: When included, the namespace indicates it will allow object discovery. Be careful when setting this for authorized namespaces, as this will allow anyone to discover the names of objects exported by this namespace.

NOTE: Most origins should have either Reads or PublicReads enabled. If neither is set, the origin won't export any data.

There is an important distinction between origin capabilities and namespace capabilities. While it's sometimes easy to treat origins and namespaces as the same thing, Pelican must distinguish between them because two separate origins may export portions of the same namespace, and a single origin may export two disparate prefixes. The only exception to this rule is when a single origin serves a single namespace, or the origin exports multiple prefixes that should all have the same capabilities.

To configure origin capabilities, you can set top-level options for the origin:

  • Origin.EnableReads: When true, the origin supports reads that are accompanied by a valid authorization token.
  • Origin.EnablePublicReads: When true, the origin supports reads by anyone without an authorization token.
  • Origin.EnableWrites: When true, objects can be written back to the storage backend through the origin. Writes always require a valid authorization token.
  • Origin.EnableDirectReads: When true, the origin indicates it's willing to serve clients directly, potentially without caching data. Note that this is respected by federation central services, but may not be respected by all clients.
  • Origin.EnableListings: When true, the origin will allow object discovery.

If no Origin.Exports block is provided to Pelican, these values will also be applied to your federation prefix.

NOTE: Pelican tries to resolve differences between origin and namespace configurations by respecting the more restrictive of the two. If you serve an origin that enables public reads, but the underlying prefix it exports disables all reads, you won't be able to read from that namespace.

Multi-Export Origins

The previous examples have shown how one might export a single namespace, but Pelican origins can export multiple paths from the same storage backend under different namespaces. For example, assume you have have two POSIX directories called /my/data/public and /my/data/private. If you want to make your public data available under the namespace /my/prefix/public and your private data available under /my/prefix/private, you'll need to configure a multi-export origin, which is accomplished through the origin's Exports block. Below is an example of what that looks like, along with how you could configure access control for the two namespaces:

pelican.yaml
Federation:
  DiscoveryUrl: https://my-federation.com
 
Origin:
  StorageType: posix
 
  # The actual namespaces we export
  Exports:
    - StoragePrefix: /my/data/public
      FederationPrefix: /my/prefix/public
      # Don't set Reads -- it should be toggled true by setting PublicReads
      Capabilities: ["PublicReads", "Listings", "DirectReads"]
    - StoragePrefix: /my/data/private
      FederationPrefix: /my/prefix/private
      # We set "Reads" but not "PublicReads" indicating we want authorization
      Capabilities: ["Reads", "DirectReads"]

NOTE: While multiple namespaces can be exported by the same origin, they must all have the same underlying storage type. That is, if the origin serves files from POSIX, it must only serve files from POSIX and not S3.

Additional Command Line Arguments for Origins

This section documents additional arguments you can pass via the command line when serving origins.

  • -h or --help: Output documentation on the serve command and its arguments.

  • -m or --mode: Set the mode for the origin service ('posix'|'s3, default to 'posix').

  • -p or --port: Set the port at which the Pelican admin website should be accessible.

  • --writeable: A boolean value to allow or disable writing to the origin (default is true).

  • --config: Set the location of the configuration file.

  • -d or --debug: Enable debugging mode, which greatly increases the Pelican's logging verbosity

  • -l or --log: Set the location of a file that will capture Pelican logs. Setting this will prevent logging output from printing to your terminal.

For more information about available yaml configuration options, refer to the Parameters page.

Login to Admin Website

After your origin is running, the next step is to initialize its admin website, which can be used by administrators for monitoring and further configuration. To initialize this interface, go to the URL specified in the terminal. By default, it should point to https://localhost:8444/view/initialization/code/ (opens in a new tab)

You will be directed to the page to activate the website with a one-time passcode. Copy the passcode from the terminal where you launch Pelican origin and paste to the website to finish activation.

Screenshot of Pelican website activation page

In our case, it's 551220 from the example terminal above.

NOTE: that your one-time passcode will be different from the example.

NOTE: These one-time passcodes will be refreshed every minute. Find the latest passcode in the terminal before proceeding.

Set up password for the admin

After activating the website, you will be redirected to set up the password for the admin account. Type your password and re-type again to confirm. Then store this password in a safe location.

Screenshot of Pelican website reset password page

Visit the Origin's Dashboard Page

Once the password is confirmed, you will be redirected to the origin's dashboard page.

Screenshot of Pelican origin website dashboard page

The origin dashboard page consists of the following panels:

  • Status
  • Data Exports
  • Federation Overview
  • Transfer Rate

The Status panel shows information about the health status of the origin, delineated by the following components:

  • CMSD indicates whether the Cluster Management Service Distributed (CMSD) (opens in a new tab) daemon is running.
  • Director indicates whether the origin can advertise itself to its federation director, which is required for other members in the federation to discover your origin's existence and how to access objects from it.
  • Federation indicates whether the origin can fetch from the federation discovery URL, which contains locations of central services (Director and Registry).
  • Registry indicates whether the origin successfully registered the federation prefixes it exports at the Registry.
  • Web UI indicates whether the admin website is configured and running.
  • XRootD indicates whether Pelican's underlying file transfer software is functioning as expected.
Screenshot of the Status panel on Pelican origin website dashboard page

The Data Exports panel lists information about the federation prefixes that are currently being exported by the origin. It also shows each prefix's advertised capabilities, like "reads", "writes", or "listings".

If any of the namespace prefix registrations is missing required information, there will be a warning message and a button to "Complete Registration". Note that each namespace prefix is registered separately, and the Origin server itself also registers at the registry. It is advised that the Origin admin completes the registration as soon as they can.

Screenshot of the Data Exports panel on Pelican origin website dashboard page

The Federation Overview panel lists links to various federation services (director, registry, etc.). Note that the link to the Discovery item is the federation discovery URL where the metadata of a federation is located.

Screenshot of the Data Exports panel on Pelican origin website dashboard page

The Transfer Rate graph visualizes object transfer metrics like transfer speed for both receiving (rx) and transmitting (tx) data. You may change the time range of the graph by changing the Reporting Period and Graph Settings.

Screenshot of the graph panel on Pelican origin website dashboard page

NOTE: This graph may be empty when the origin first starts, as it takes several minutes to collect enough data for the display. Try refreshing the page after the origin has been running for ~5 minutes and you you should see data being aggregated.

For local deployment

When you hit the URL at https://localhost:8444/view/initialization/code/ (opens in a new tab), You may see a warning that looks like the following (with some differences depending on the browser you use):

Image of a safari browser invalid certificate warning

The warning is due to the fact that Pelican servers by default use https for network requests, which requires a set of TLS certificates to secure the connection between the server and the browser. If you don't have TLS certificates properly configured and you turned on the TLSSkipVerify configuration parameter, then the origin will generate a set of self-signed certificates that are not trusted by the browser.

For local testing, it's OK to proceed with the warning for local deployment.

Test Origin Functionality

Once you have your origin set up, follow the steps below to test if your origin can serve a file through a Pelican federation. It's best to test your origin while it's serving public data to minimize the risk that any test tokens you generate may be malformed and the reason objects can't be pulled through the origin.

  1. Create a test file under the directory on your host machine that binds to a Pelican namespace. This the <local_directory> in -v <local_directory>:<namespace_prefix> argument when you run the Pelican origin. Assuming your directory is /tmp/demo, run the following command to create a test file named testfile.txt under /tmp/demo

    echo "This is a test file.\n" > /tmp/demo/testfile.txt
  2. In a separate terminal, run the following command to get the data from your origin through the Pelican federation

    $ cd ~
    $ pelican object get -f <federation> <namespace_prefix>/testfile.txt .

    Where:

    • cd ~ switches the working directory to your home directory
    • <federation> is the same URL you pass to -f argument when running the origin
    • <namespace_prefix> is the -v <local_directory>:<namespace_prefix> argument when running the origin.

    You should see the output like the following:

    $ pelican object get -f <federation> /demo/testfile.txt .
    testfile.txt 36.00 b / 36.00 b [=============================================================================================] Done!
  3. Confirm the file content by running:

    $ cat testfile.txt
    This is a test file.

Congratulations! Your have finished setting up and running your origin.

Next Steps

Obtain the Public Key of the Origin

Pelican origins follow OpenID Connect (OIDC) (opens in a new tab) protocol for authenticating users for object access. In the OIDC setup, a user's identity is encoded in a token that is signed by the origin, using its private key. The private key only lives on the origin server and is never exposed to the public. To verify the integrity the token and make sure it's not tampered by malicious attackers, a public key generated from the private key is available for others to use to check the signature of the token.

The public keys are used in a Pelican federation to facilitate user authentication and server identity verification. They are also used to register a namespace at the Pelican registry. The public key used to register a namespace at the Pelican registry should be the public key of the Pelican origin that exports and hosts the namespace. This is because the Pelican origin manages user identity verification and authorization for the namespaces it exports.

Obtain the Public Key via the Admin Website

To get the public key of via the admin website, follow the steps below.

  • Setup and login to Pelican Origin admin website
  • On the home page, navigate to the Data Exports panel on the top right
  • Click the icon to download the public key

Obtain the Public Key via Direct Download

To get the public key of via direct download, go to https://<origin-hostname>:<origin-web-port>/.well-known/issuer.jwks. The public key will be automatically downloaded as a JSON file. An example file is as following:

issuer.jwks
{
   "keys":[
      {
         "kty":"EC",
         "use":"sig",
         "crv":"P-256",
         "kid":"sig-1712605941",
         "x":"c3BhLEMGqmO98-43pD3R5DJtUGIQf4McsNBWuVTorVI",
         "y":"eWssaV6WrG4kVla1Ygdr_qUZeqgN7I2fDeb3K1qL0qI",
         "alg":"ES384"
      }
   ]
}

Note that you need to run your Pelican origin before downloading the public key