LHCO module¶
Main script for LHC Olympics data processing.
This script is designed to apply jet clustering and feature engineering to the LHC olympics datasets, which are available as a collection of 4-vectors associated with events’ constituent particles. Given that all events are independent, this process can be heavily parallelized resulting in significant execution time reduction.
Example
For script usage details use:
$ ./LHCO --help
Alternatively you can import this module for use within your custom pipeline:
from LHCO import clustering_mpi, params, merge
At the moment the full scope of this project is yet to be realized, but the remaining features are soon to be implemented.
- LHCO.clustering_mpi(path, j, max_events, chunk_size, tmp_dir, out_dir, out_prefix='results', quiet=False, scalars=True, images=False, **kwargs)¶
Applies clustering to LHC Olympics data using multiprocessing.
Main function performing the clustering. It spreads the input events across
jlogical cores and prints a progress bar for each process’ progress towards completing a chunk of events. Each process stores its result in a temporary fille; at the end of the clustering all filles will be merged.- Parameters
path (Path) – path of .hdf input file containing LHCO data.
j (int) – Number of logical cores to distribute the load. If 0, then all available cores will be used.
max_events (int) – maximum number of events to be used. If 0, all events in the file will be used.
chunk_size (int) – number of events to be distributed to each job. If 0, then the chunk size will e adjusted so that the number of jobs is equal to the number of logical cores
tmp_dir (Path) – path of the directory where temporary result files will be stored. All contents of the directory will be erased. If the directory does not exist, it will be created.
out_dir (Path) – path of the directory where the merged result will be saved.
out_prefix (str) – prefix used for the results’ filenames.
quiet (bool) – suppresses the output of
tqdmprogress barsscalars (bool) – compute the scalar features after clustering
images (bool) – generate jet images after clustering
**kwargs – keyword arguments for specifing clustering parameters. Default values can be found in the
LHCO.paramsdict.
- Returns
None
- LHCO.data_urls = {'BBOX1': 'https://zenodo.org/record/4536624/files/events_LHCO2020_BlackBox1.h5?download=1', 'BBOX2': 'https://zenodo.org/record/4536624/files/events_LHCO2020_BlackBox2.h5?download=1', 'BBOX3': 'https://zenodo.org/record/4536624/files/events_LHCO2020_BlackBox3.h5?download=1', 'BBOXMC': 'https://zenodo.org/record/4536624/files/events_LHCO2020_backgroundMC_Pythia.h5?download=1', 'RnD': 'https://zenodo.org/record/4536377/files/events_anomalydetection.h5?download=1', 'RnD_3prong': 'https://zenodo.org/record/4536377/files/events_anomalydetection_Z_XY_qqq.h5?download=1'}¶
URLs for all LHC Olympics datasets Zenodo download links.
Key-value pairs of Dataset Identifiers and URLs.
- Available datasets are:
RnD,RnD_3prong,BBOXMC,BBOX1,BBOX2,BBOX3
- Type
dict
- LHCO.download_file(url, path, descriptor=None, chunk_size=1048576, timeout=None)¶
Downloads a file to the specified path
- Parameters
url (str) – URL of the file
path (Path) – The location where the file will be saved
descriptor (string) – Progres bar annotation
chunksize (int) – Number of bytes per chunk
timeout (float or tuple) – Seconds to wait for the response
- Returns
None
- LHCO.masterkey_urls = {'BBOX1': 'https://zenodo.org/record/4536624/files/events_LHCO2020_BlackBox1.masterkey?download=1', 'BBOX3': 'https://zenodo.org/record/4536624/files/events_LHCO2020_BlackBox3.masterkey?download=1'}¶
URLs for LHC Olympics datasets’ masterkeys.
Key-value pairs of Dataset Identifiers and URLs. Only
BBOX1andBBOX3datasets have masterkeys.- Type
dict
- LHCO.merge(path, feature)¶
Merge all .hdf files in given directory.
This function is called once at the end of the clustering run in order to unite all the partial results obtained from parallell clustering execution.
- Returns
pd.DataFramefrom merged .hdf files
- LHCO.merge_all(tmp_dir, out_prefix, out_dir)¶
Merges both images and/or scalar files resulted from clustering.
- Parameters
tmp_dir (Path) – path of the directory where temporary result files will be stored. All contents of the directory will be erased. If the directory does not exist, it will be created.
out_dir (Path) – path of the directory where the merged result will be saved.
out_prefix (str) – prefix used for the results’ filenames.
- Returns
None
- LHCO.params = {'R': 1.0, 'R2': 0.2, 'cluster_algo': 'antikt', 'dcut': 0.1, 'masterkey': None, 'njets': 2, 'ptmin': 0, 'ptmin2': 0}¶
default values for data clustering parameters
- The parameters in question are:
R: Radius used in primary clustering.njets: Number of jets expected per event.cluster_algo: Algorithm used for primary clustering (see pyjet documentation).masterkey: Path to masterkey file containing truth information.R2: Radius for secondary clustering, used in the calculation of several substructure features which are dependent on sub-jets.ptmin: Minimum pT cutoff of expected primary jets.ptmin2: Minimum pT cutoff applied to subjets.dcut: Minimmum distance between exclusive sub-jets (also used for calculating substructure features).
- Type
dict
- LHCO.run_procs(procs, n_workers, bar=None)¶
Paralell execution of a collection of processes across n_workers.
Called by the clusterin_mpi function after setting up the jobs, this function manages the execution of the parallel processes. Optionally can display a progress bar for completed jobs.
- Parameters
procs (list) – Collection of mpi.Process objects to be executed
n_workers (int) – Number of physicals cores available for parallel execution of jobs
bar (obj) – Instance of tqdm.tqdm used for displaying progress