Data Staging and Transfer to Jobs¶
Overview¶
As a distributed system, jobs in the PATh Facility can run in different physical locations, where the computers that are executing jobs don't have direct access to the files placed on the Access Point (e.g. in a /home
directory on ap1.facility.path-cc.io
). In order to run on this kind of distributed system, jobs need to "bring along" the data, code, packages, and other files from the Access Point (where the job is submitted) to the PATh Facility execute points (where the job will run). HTCondor's file transfer tools and plugins make this possible; input and output files are specified as part of the job submission and then moved to and from the execution location.
This guide describes where to place files on PATh Facility Access Points, and how to use these files within jobs.
Data Spaces on the PATh Facility¶
There are two spaces for placing files on the PATh Facility Access Point, and each has a corresponding transfer method for referencing files in the submit file.
Location | File Sizes | Transfer Method | Initial Quota |
---|---|---|---|
/home/$USER |
Input: less than 1Gb per job | file paths in transfer_input_files |
50GB |
Output: less than 1Gb per job | |||
/path-facility/data/$USER |
greater than 1Gb per job OR shared files used by many jobs |
osdf:/// links in transfer_input_files |
500GB / 250k items |
greater than 1Gb per job |
Space for Project or Public Data¶
The examples above are both specific to a single user. If you need to share files with other members of your group or project, or need to make files publicly available, please contact the facilitation team to arrange a project folder: support@path-cc.io
Transferring Data To/From HTCondor Jobs¶
Regardless of where data is placed, jobs should only be submitted with condor_submit
from /home
.
Transfer Smaller Job Input and Output Files to/from /home
¶
You should use your /home
directory to stage job files where:
* individual input files per job are less than 1GB per file, and if there
are multiple files, they total less than 1GB
* output files per job are less than 1GB per file
Input Files from /home
¶
To transfer input files from /home
, list the files by name in the transfer_input_files
submit file option. You can use either absolute or relative paths to your input files. Multiple files can be specified using a comma-separated list.
Some examples:
- Transferring multiple files from the submission directory transfer_input_files = my_data.csv, my_software.tar.gz, my_script.py
- Transferring a file using an absolute path is useful if a file is not in the same directory tree as your submit file: transfer_input_files = /home/username/path/to/my_software.tar.gz
Output Files to /home
¶
By default, files created by your job will automatically be returned to your /home
directory. If you would like a file to return to a diffrent subfolder within your /home
directory, use HTCondor's transfer_output_remaps
option.
Transfer Larger Job Input and Output Files to/from /path-facility/data
¶
You should use your /path-facility/data
directory to stage job files where:
* individual input files per job are greater than 1GB per file
* an input file (of any size) is used by many jobs
* output files per job are greater than 1GB per file
Important Note: Large files stored in
/path-facility/data
are cached, so it is important to use a descriptive file name (possibly using version names or dates within the file name), or a directory structure with unique names to ensure you know what version of the file you are using within your job.
Behind the scenes, the files in /path-facility/data
are being distributed
using a network called the Open Science Data Federation (or OSDF), which is
why you'll see that acronym in the commands and variables below.
Input Files from /path-facility/data
¶
To transfer input files from /path-facility/data
, use the osdf:///
plugin syntax as part of the transfer_input_files
submit file option.
Some examples:
-
Transferring one file from
/path-facility/data
transfer_input_files = osdf:///path-facility/data/<username>/InFile.txt
-
When using multiple files from
/path-facility/data
, it can be useful to use HTCondor submit file variables to make your list of files more readable:# Define a variable (example: OSDF_LOCATION) equal to the # path you would like files transferred to, and call this # variable using $(variable) OSDF_LOCATION = osdf:///path-facility/data/<username> transfer_input_files = $(OSDF_LOCATION)/InputFile.txt, $(OSDF_LOCATION)/database.sql
Output Files to /path-facility/data
¶
If you would like a job to transfer a large file back to your /path-facility/data
directory, in your HTCondor submit file, use the same osdf:///
plugin syntax as for input files, but with the HTCondor transfer_output_remaps
submit file option. When
transferring multiple files back to /path-facility/data
in this way, you will separate
the different files/remaps with a semi-colon.
Some examples:
-
Transferring one output file (
OutFile.txt
) back to/path-facility/data
:transfer_output_remaps = "OutFile.txt=osdf:///ospool/protected/<username>/OutFile.txt"
-
When using multiple files from
/path-facility/data
, it can be useful to use HTCondor submit file variables to make your list of files more readable. Also note the semi-colon separator in the list of output files.# Define a variable (example: OSDF_LOCATION) equal to the # path you would like files transferred to, and call this # variable using $(variable) OSDF_LOCATION = osdf:///path-facility/data/<username> transfer_output_remaps = "file1.txt = $(OSDF_LOCATION)/file1.txt; file2.txt = $(OSDF_LOCATION)/file2.txt; file3.txt = $(OSDF_LOCATION)/file3.txt"
Moving Data to/from PATh Facility Access Points¶
In general, common Unix tools such as rsync, scp, PuTTY, WinSCP, gFTP, etc. can be used to upload data from your computer or another server to your PATh Facility Access Point or to download files. Files should be uploaded/created and staged in /home
or /path-facility
for preparation to use in jobs (as described above).
Check Your Quota and Available Space¶
Check your /home
quota¶
To check your home quota and usage, run:
$ quota -vs
Check your /path-facility/data
quota¶
For now, contact the facilitation team if you are unsure what your /path-facility/data
quota is.
Request Quota Increase¶
Contact us at support@path-cc.io if you think you need a quota increase. We have space for substantial workloads when communicated with in advance.
Data Policies¶
In general, users are responsible for managing data and for using appropriate mechanisms for delivering data to/from jobs. Each space for data is controlled with a quota and should be treated as temporary storage for active job execution. The PATh Facility has no routine backup of data in these locations, and users should remove old data after jobs complete.
Data stored within /home
and /path-facility/data
is available only to your jobs, but highly sensitive data (e.g. HIPAA) should never be uploaded to PATh servces.
PATh staff reserve the right to monitor and/or remove data without notice to the user if doing so is necessary for ensuring proper use or to quickly fix a performance or security issue. Additionally, users should not use PATh resources or services for long-term data storage (see above).