Cellular Content Exploration

Cellular content exploration module¶

For the exploration step, MiLoPYP learns an embedding space for 3D macromolecules such that similar structures are grouped together while dissimilar ones are separated. The embedding space is then projected into 2D and 3D which allows easy identification of the distribution of macromolecular structures across an entire dataset.

Input preparation¶

There are two different mode for this module:

2d3d: Input to this module include aligned 2D tilt series (.mrc, or .ali as in sample_data), corresponding angles (.tlt), 3D reconstructed tomograms from tilt series (.rec).

3d: Input to this module is 3D reconstructed tomograms(.rec) alone.

For 2d3d mode, the training file should be a .txt file with tomogram name, path to 2D tilt series, path to corresponding angles, path to 3D reconstructed tomograms in the following format:

image_name   rec_path    tilt_path   angle_path  

tomo1   path_to_rec_1   path_to_tilt_1   path_to_angle_1

tomo2   path_to_rec_2   path_to_tilt_2   path_to_angle_2
...

For example, suppose we store tilt series, corresponding angles, and reconstructed tomograms to the data directory with path: /data, the training txt file will look like this:

image_name   rec_path    tilt_path   angle_path  

tomo1   sample_data/tomo1.rec   sample_data/tomo1.mrc   sample_data/tomo1.tlt

tomo2   sample_data/tomo2.rec   sample_data/tomo1.mrc   sample_data/tomo1.tlt
...

For 3d mode, the training txt file only need rec_path and tomogram name. tilt_path and angle_path are not needed. Therefore, the file will have the following format:

image_name   rec_path     

tomo1   path_to_rec_1   

tomo2   path_to_rec_2   
...

It is also ok to use the same file formatted for 2d3d mode. Only rec_path will be used.

Warning

Make sure the 2D tilt seires are aligned tilt series. It also needs to be the same x-y dimension as 3D tomogram. Typically we recommend using downsampled tilt series and 3D tomograms (with x-y dimension less than 2000px)

Once files are generated, move all training files to data/ directory (create data/ directory if it doesn't exist)

Training¶

To train exploration module in 2d3d mode (with tilt series and tomograms), run:

python simsiam_main.py simsiam2d3d --num_epochs 300 --exp_id test_sample --bbox 36 --dataset simsiam2d3d --arch simsiam2d3d_18 --lr 1e-3 --train_img_txt sample_train_explore_img.txt --batch_size 256 --val_intervals 20 --save_all --gauss 0.8 --dog 3,5

For 2d3d mode, training-related files will be saved to exp/simsiam2d3d/test_sample, including log file with loss and all used arguments.

To train exploration module in 3d mode (using tomogram only), run:

python simsiam_main.py simsiam3d --num_epochs 300 --exp_id test_sample --bbox 36 --dataset simsiam3d --arch simsiam2d_18 --lr 1e-3 --train_img_txt sample_train_explore_img.txt --batch_size 256 --val_intervals 20 --save_all --gauss 0.8 --dog 3,5

For 3d mode, training-related files will be saved to exp/simsiam3d/test_sample, including log file with loss and all used arguments.

Arguments	Purpose
`num_epochs`	number of training epochs, recommend 100 to 300
`exp_id`	experiment id you want to save it as.
`bbox`	bounding box size for cropped patches, should bigger than particles
`dataset`	sampling and dataloader mode
`arch`	model backbone architecture
`lr`	learning rate
`training_img_txt`	input txt for training
`batch_size`	batch size for training
`val_intervals`	save model every val_intervals
`save_all`	whether to save all models for each val_interval
`gauss`	preprocessing gaussian filter to denoise tilt series and tomogram
`dog`	kernel sizes for difference of gaussian pyramid, comma delimited

For more information regarding arguments, go to opts.py

Inference¶

After training, to map tomograms/tilt series into embeddings, use trained model to perform mapping.

For 2d3d mode, run:

python simsiam_test_hm_2d3d.py simsiam2d3d --exp_id test_sample --bbox 36 --dataset simsiam2d3d --arch simsiam2d3d_18 --test_img_txt sample_train_explore_img.txt --load_model exp/simsiam2d3d/test_sample/model_300.pth --gauss 0.8 --dog 3,5

For 3d mode, run:

python simsiam_test_hm_3d.py simsiam3d --exp_id test_sample --bbox 36 --dataset simsiam3d --arch simsiam2d_18 --test_img_txt sample_train_explore_img.txt --load_model exp/simsiam3d/test_sample/model_300.pth --gauss 0.8 --dog 3,5

Here we are using trained model from 300^th epoch.

Note: Please make sure to use same architecture, bounding box size, gauss, and dog argument for both training and inference and select proper trained model

For example, if use --bbox 36 --gauss 0.8 --dog 3,5 in training arguments, make sure the same arguments are used for inference. To find used arguments for training, go to the output folder and check opts.txt. For trained model selection, check loss in log.txt and select models with lower loss.

Inference output is a npz file that contains embeddings, corresponding coordinates, original cropped patches from tomogram, names of corresponding tomogram. Output is saved to exp/simsiam2d3d/test_sample/all_output_info.npz for 2d3d mode and exp/simsiam3d/test_sample/all_output_info.npzfor 3d mode.

2D visualization plot¶

2D visualization plot2D visualization with labels

To generate 2D visualization plot, run:

python plot_2d.py --input exp/simsiam2d3d/test_sample/all_output_info.npz --n_cluster 48 --num_neighbor 40 --mode umap --path exp/simsiam2d3d/test_sample/ --min_dist_vis 1.3e-3

2D visualization plot will be saved to exp/simsiam2d3d/test_sample/2d_visualization_out.png and an additional visualization plot with labels generated using over-clustering algorithm exp/simsiam2d3d/test_sample/2d_visualization_labels.png, in the same directory, additional outputs include all_colors.npy that will be used as input for 3D tomogram visualization plotting and interactive_info_parquet.gzip that includes labels from overclustering and is used as input for interactive session. PNGs of all cropped patches will be also saved to exp/simsiam2d3d/test_sample/imgs/ folder. This will later be used for interactive session. For 3d mode, replace simsiam2d3d with simsiam3d.

Arguments	Purpose
`input`	output all_output_info.npz from inference
`n_cluster`	number of clusters for overclustering
`num_neighbor`	number of neighbors for both tsne and umap clustering
`mode`	whether to use tsne or umap for dimensionality reduction
`path`	path of directory to save all output and images
`host`	local host for images, default is 7000
`min_dist_umap`	min distance in UMAP
`min_dist_vis`	min distance for patch display on2d visualization

3D Tomogram visualization¶

To generate 3D tomogram visualization plot, run:

python visualize_3dhm.py --input exp/simsiam2d3d/test_sample/all_output_info.npz --color exp/simsiam2d3d/test_sample/all_colors.npy --dir_simsiam exp/simsiam2d3d/test_sample/ --rec_dir sample_data/

Outputs include two numpy arrays: *rec3d.npy is the numpy array for 3D tomogram and *hm3d_simsiam.npy is the numpy array for color heatmaps. To generate example visualization plot, two arrays can be

loaded using napari as two layers and transparency of each layer can be adjusted in the napari interface
blending two arrays by weighted averaging w x rec_3d.npy + (1-w) x hm3d_simsiam.npy and the resulting array will be the 3D tomogram visualization with color.

Arguments	Purpose
`input`	output all_output_info.npz from inference
`color`	.npy color array, generated by plot_2d.py
`dir_simsiam`	directory to the current run
`rec_dir`	path to directory with corresponding rec file

For 3d mode, replace simsiam2d3d with simsiam3d.

3D interactive session¶

Interactive session requires loading of local images (which are already generated by plot_2d.py). Local images are generated by plot_2d.py under directory exp/simsiam2d3d/test_sample/imgs/. To connect images to local host and keep it running in the background, initiate a new screen session in terminal with screen, go to exp/simsiam2d3d/test_sample/ and run python -m http.server 7000. The images will be hosted on 7000. Detach the screen session.

Warning

Make sure to use the same number as the host argument in plot_2d.py, default is 7000.

To initiate interactive session, run:

python phoenix_visualization.py --input exp/simsiam2d3d/test_sample/interactive_info_parquet.gzip

On terminal, it should show headers of parquet file, including local host address for interactive session. For example, here the local host address is http://localhost:33203/

You should now be able to access the interactive session at http://localhost:33203/.

Warning

If you are running everything on a remote cluster and want to visualize everything through local browser, you will first need to connect remote to local. This needs to be done for both images and the interactive session. To connect images with local host 7000, use ssh -N -f -L localhost:7000:localhost:7000 your_remote_login_address on local terminal. To connect interactive session with local host 33203, use ssh -N -f -L localhost:33203:localhost:33203 your_remote_login_address.

In the interactive session, you should be able to visualize clusters of 3D embeddings, you will be able to adjust the number of points to be displayed in the cluster, coloring of each embedding based on label, select subclusters based on labels, and export selected subclusters.

homepage for interactive sessionadjust number of points to be displayedcoloring of embeddings based on labelsselect and export subclusters based on labels

Exported coordinates can be downloaded through interactive session. In the example below, the downloaded parquet is named as 2023-10-08_19-44-41.parquet.

Convert parquet to training coordinates .txt files by running:

python interactive_to_training_coords.py --input path_to_dir_of_parquets --output training_coordinates.txt

Arguments	Purpose
`input`	full path to directory that contains all downloaded parquet file
`output`	full path to output training coordinates file
`if_double`	whether z coordinates obtained from DoGs in exploration module is downscaled by 2, if so needs to rescale it back

Now, we can using the same training image file sample_train_explore_img.txt and the generated trainning coordinates file training_coordinates.txt for the subsequent refinement module.

Cellular Content Exploration