LHCO module

Main script for LHC Olympics data processing.

This script is designed to apply jet clustering and feature engineering to the LHC olympics datasets, which are available as a collection of 4-vectors associated with events’ constituent particles. Given that all events are independent, this process can be heavily parallelized resulting in significant execution time reduction.

Example

For script usage details use:

$ ./LHCO --help

Alternatively you can import this module for use within your custom pipeline:

from LHCO import clustering_mpi, params, merge

At the moment the full scope of this project is yet to be realized, but the remaining features are soon to be implemented.

LHCO.clustering_mpi(path, j, max_events, chunk_size, tmp_dir, out_dir, out_prefix='results', quiet=False, scalars=True, images=False, **kwargs)

Applies clustering to LHC Olympics data using multiprocessing.

Main function performing the clustering. It spreads the input events across j logical cores and prints a progress bar for each process’ progress towards completing a chunk of events. Each process stores its result in a temporary fille; at the end of the clustering all filles will be merged.

Parameters
  • path (Path) – path of .hdf input file containing LHCO data.

  • j (int) – Number of logical cores to distribute the load. If 0, then all available cores will be used.

  • max_events (int) – maximum number of events to be used. If 0, all events in the file will be used.

  • chunk_size (int) – number of events to be distributed to each job. If 0, then the chunk size will e adjusted so that the number of jobs is equal to the number of logical cores

  • tmp_dir (Path) – path of the directory where temporary result files will be stored. All contents of the directory will be erased. If the directory does not exist, it will be created.

  • out_dir (Path) – path of the directory where the merged result will be saved.

  • out_prefix (str) – prefix used for the results’ filenames.

  • quiet (bool) – suppresses the output of tqdm progress bars

  • scalars (bool) – compute the scalar features after clustering

  • images (bool) – generate jet images after clustering

  • **kwargs – keyword arguments for specifing clustering parameters. Default values can be found in the LHCO.params dict.

Returns

None

LHCO.data_urls = {'BBOX1': 'https://zenodo.org/record/4536624/files/events_LHCO2020_BlackBox1.h5?download=1', 'BBOX2': 'https://zenodo.org/record/4536624/files/events_LHCO2020_BlackBox2.h5?download=1', 'BBOX3': 'https://zenodo.org/record/4536624/files/events_LHCO2020_BlackBox3.h5?download=1', 'BBOXMC': 'https://zenodo.org/record/4536624/files/events_LHCO2020_backgroundMC_Pythia.h5?download=1', 'RnD': 'https://zenodo.org/record/4536377/files/events_anomalydetection.h5?download=1', 'RnD_3prong': 'https://zenodo.org/record/4536377/files/events_anomalydetection_Z_XY_qqq.h5?download=1'}

URLs for all LHC Olympics datasets Zenodo download links.

Key-value pairs of Dataset Identifiers and URLs.

Available datasets are:

RnD, RnD_3prong, BBOXMC, BBOX1, BBOX2, BBOX3

Type

dict

LHCO.download_file(url, path, descriptor=None, chunk_size=1048576, timeout=None)

Downloads a file to the specified path

Parameters
  • url (str) – URL of the file

  • path (Path) – The location where the file will be saved

  • descriptor (string) – Progres bar annotation

  • chunksize (int) – Number of bytes per chunk

  • timeout (float or tuple) – Seconds to wait for the response

Returns

None

LHCO.masterkey_urls = {'BBOX1': 'https://zenodo.org/record/4536624/files/events_LHCO2020_BlackBox1.masterkey?download=1', 'BBOX3': 'https://zenodo.org/record/4536624/files/events_LHCO2020_BlackBox3.masterkey?download=1'}

URLs for LHC Olympics datasets’ masterkeys.

Key-value pairs of Dataset Identifiers and URLs. Only BBOX1 and BBOX3 datasets have masterkeys.

Type

dict

LHCO.merge(path, feature)

Merge all .hdf files in given directory.

This function is called once at the end of the clustering run in order to unite all the partial results obtained from parallell clustering execution.

Returns

pd.DataFrame from merged .hdf files

LHCO.merge_all(tmp_dir, out_prefix, out_dir)

Merges both images and/or scalar files resulted from clustering.

Parameters
  • tmp_dir (Path) – path of the directory where temporary result files will be stored. All contents of the directory will be erased. If the directory does not exist, it will be created.

  • out_dir (Path) – path of the directory where the merged result will be saved.

  • out_prefix (str) – prefix used for the results’ filenames.

Returns

None

LHCO.params = {'R': 1.0, 'R2': 0.2, 'cluster_algo': 'antikt', 'dcut': 0.1, 'masterkey': None, 'njets': 2, 'ptmin': 0, 'ptmin2': 0}

default values for data clustering parameters

The parameters in question are:
  • R: Radius used in primary clustering.

  • njets: Number of jets expected per event.

  • cluster_algo: Algorithm used for primary clustering (see pyjet documentation).

  • masterkey: Path to masterkey file containing truth information.

  • R2: Radius for secondary clustering, used in the calculation of several substructure features which are dependent on sub-jets.

  • ptmin: Minimum pT cutoff of expected primary jets.

  • ptmin2: Minimum pT cutoff applied to subjets.

  • dcut: Minimmum distance between exclusive sub-jets (also used for calculating substructure features).

Type

dict

LHCO.run_procs(procs, n_workers, bar=None)

Paralell execution of a collection of processes across n_workers.

Called by the clusterin_mpi function after setting up the jobs, this function manages the execution of the parallel processes. Optionally can display a progress bar for completed jobs.

Parameters
  • procs (list) – Collection of mpi.Process objects to be executed

  • n_workers (int) – Number of physicals cores available for parallel execution of jobs

  • bar (obj) – Instance of tqdm.tqdm used for displaying progress