k2 / README.md

Update README.md

739f37e verified over 1 year ago

10.5 kB

	---
	tags:
	- speech-recognition
	- ASR
	- k2
	- sherpa
	- PyTorch
	license: cc-by-4.0
	library_name: icefall
	datasets:
	- librispeech
	inference: false
	---



	-1. Create your own virtualenv

	# Install CUDA and cuDNN

	0. Run the following command:
	```nvidia-smi \| head -n 4```

	Install CUDA <= Cuda Version mentioned.

	1. Install CUDA (I am installing CUDA 12.1)
	```
	wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
	```
	```
	chmod +x cuda_12.1.0_530.30.02_linux.run
	```
	(change the 'installpath')
	```
	./cuda_12.1.0_530.30.02_linux.run \
	--silent \
	--toolkit \
	--installpath=/speech/hasan/software/cuda-12.1.0 \
	--no-opengl-libs \
	--no-drm \
	--no-man-page
	```

	## Install cuDNN for CUDA 12.1
	```
	wget https://huggingface.co/csukuangfj/cudnn/resolve/main/cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz
	```
	```
	tar xvf cudnn-linux-x86_64-8.9.5.29_cuda12-archive.tar.xz --strip-components=1 -C /speech/hasan/software/cuda-12.1.0
	```

	Create a file `activate-cuda-12.1.sh`, copy the following code and then run `source activate-cuda-12.1.sh`
	```
	export CUDA_HOME=/speech/hasan/software/cuda-12.1.0
	export PATH=$CUDA_HOME/bin:$PATH
	export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
	export LD_LIBRARY_PATH=$CUDA_HOME/lib:$LD_LIBRARY_PATH
	export LD_LIBRARY_PATH=$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH
	export CUDAToolkit_ROOT_DIR=$CUDA_HOME
	export CUDAToolkit_ROOT=$CUDA_HOME

	export CUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME
	export CUDA_TOOLKIT_ROOT=$CUDA_HOME
	export CUDA_BIN_PATH=$CUDA_HOME
	export CUDA_PATH=$CUDA_HOME
	export CUDA_INC_PATH=$CUDA_HOME/targets/x86_64-linux
	export CFLAGS=-I$CUDA_HOME/targets/x86_64-linux/include:$CFLAGS
	export CUDAToolkit_TARGET_DIR=$CUDA_HOME/targets/x86_64-linux
	```

	Check your installation by running:
	```
	which nvcc
	```
	Desired output:
	```
	/speech/hasan/software/cuda-12.1.0/bin/nvcc
	```
	```
	nvcc --version
	```
	Desired output:
	```
	nvcc: NVIDIA (R) Cuda compiler driver
	Copyright (c) 2005-2023 NVIDIA Corporation
	Built on Tue_Feb__7_19:32:13_PST_2023
	Cuda compilation tools, release 12.1, V12.1.66
	Build cuda_12.1.r12.1/compiler.32415258_0
	```

	[Reference](https://k2-fsa.github.io/k2/installation/cuda-cudnn.html)

	# Install Torch and TorchAudio

	torch==2.2.1 and torchaudio==2.2.1 are compatible, [reference](https://pytorch.org/get-started/previous-versions/#linux-and-windows-1), so I'll install that

	```
	pip install torch==2.2.1+cu121 torchaudio==2.2.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html
	```

	Verify Installation
	```
	python3 -c "import torch; print(torch.__version__)"
	python3 -c "import torchaudio; print(torchaudio.__version__)"
	```
	Desired output:
	```
	2.3.0+cu121
	```

	## Install k2
	```
	pip install k2==1.24.4.dev20240425+cuda12.1.torch2.2.1 -f https://k2-fsa.github.io/k2/cuda.html
	```

	Verify Installation
	```
	python3 -m k2.version
	```

	## Install lhotse
	```
	pip install git+https://github.com/lhotse-speech/lhotse
	```
	Verify Installation:
	```
	python3 -c "import lhotse; print(lhotse.__version__)"
	```
	Desired output:
	```
	1.24.0.dev+git.4d57d53.clean
	```

	## Install icefall
	```
	git clone https://github.com/k2-fsa/icefall
	cd icefall/
	pip install -r ./requirements.txt
	```
	Export the path where you cloned icefall
	```
	export PYTHONPATH=/speech/hasan/icefall_install/icefall:$PYTHONPATH
	cd egs/yesno/ASR/
	```
	Test your Installation
	```
	./prepare.sh
	```
	export CUDA_VISIBLE_DEVICES=""
	./tdnn/train.py
	```
	```
	./tdnn/decode.py
	```

	## Congrats!
	[Reference](https://icefall.readthedocs.io/en/latest/installation/index.html)

	## install kaldi feat
	pip install kaldifeat==1.25.4.dev20240425+cpu.torch2.3.0 -f https://csukuangfj.github.io/kaldifeat/cpu.html
	## install sherpa
	pip install k2_sherpa==1.3.dev20240227+cpu.torch2.2.1 -f https://k2-fsa.github.io/sherpa/cpu.html

	## training
	python3 egs/<dataset_name>/ASR/zipformer/train.py \
	--world-size <number_of_gpus> \
	--num-epochs <number_of_epochs> \
	--start-epoch <starting_epoch> \
	--exp-dir <experiment_directory> \
	--max-duration <max_duration_per_batch> \
	--num-workers <number_of_data_workers> \
	--on-the-fly-feats <True_or_False> \
	--manifest-dir <manifest_directory> \
	--num-buckets <number_of_buckets> \
	--bpe-model <path_to_bpe_model> \
	--train-cuts <path_to_training_cuts> \
	--valid-cuts <path_to_validation_cuts> \
	--causal <1_or_0> \
	--master-port <port_number>

	Parameter Reference:

	--world-size: Number of GPUs or processes to use for distributed training.
	--num-epochs: Total number of epochs to run the training.
	--start-epoch: Epoch to start training from (helpful when resuming).
	--exp-dir: Path to the directory where experiment logs and model checkpoints will be saved.
	--max-duration: Maximum duration of audio samples per batch (in seconds or milliseconds, depending on the setup).
	--num-workers: Number of workers for loading data.
	--on-the-fly-feats: Whether to compute features on-the-fly during training (True or False).
	--manifest-dir: Directory containing the manifest files (JSON) for training and validation data.
	--num-buckets: Number of buckets used for bucketing data by sequence length.
	--bpe-model: Path to the Byte-Pair Encoding model for text tokenization.
	--train-cuts: Path to the JSONL file containing the training cuts.
	--valid-cuts: Path to the JSONL file containing the validation cuts.
	--causal: Set to 1 for causal training (useful for certain model architectures like Zipformer).
	--master-port: Port number for distributed training communication
	# sample decode file
	Streaming ASR Decoding with Zipformer

	This script facilitates the streaming decoding of ASR models using Zipformer in the Icefall framework. It supports greedy search decoding along with the configuration for chunked streaming.
	./streaming_decode.py --epoch <EPOCH_NUMBER> \
	--avg <AVERAGE_NUMBER> \
	--exp-dir <EXPERIMENT_DIR> \
	--decoding-method <DECODING_METHOD> \
	--manifest-dir <MANIFEST_DIR> \
	--cut-set-name <CUT_SET_NAME> \
	--bpe-model <BPE_MODEL_PATH> \
	--causal <CAUSAL_FLAG> \
	--chunk-size <CHUNK_SIZE> \
	--left-context-frames <LEFT_CONTEXT_FRAMES> \
	--on-the-fly-feats <ON_THE_FLY_FEATS_FLAG> \
	--use-averaged-model <AVERAGED_MODEL_FLAG> \
	--num-workers <NUM_WORKERS> \
	--max-duration <MAX_DURATION> \
	--num-decode-streams <NUM_DECODE_STREAMS> \
	--context-size <CONTEXT_SIZE>
	Parameters

	--epoch: Specifies which training epoch to use for decoding. A higher epoch number means the model has undergone more training.

	--avg: Number of checkpoints to average. For example, --avg 4 means the last 4 checkpoints will be averaged for decoding.

	--exp-dir: Directory where the model's experimental data, such as checkpoints and logs, are stored.

	--decoding-method: Decoding strategy to be used. Common methods include greedy_search, beam_search, etc.

	--manifest-dir: Directory containing manifest files for the datasets to be decoded.

	--cut-set-name: Specifies which cut set to use for decoding, typically indicating the subset of data like test_1, test_2, etc.

	--bpe-model: Path to the BPE model to be used for tokenization during decoding.

	--causal: Indicates whether causal convolution should be used. Set 1 for causal and 0 for non-causal.

	--chunk-size: The size of each chunk to be processed during streaming.

	--left-context-frames: Number of frames from the left context to be included during chunked decoding.

	--on-the-fly-feats: If set to True, feature extraction is performed on-the-fly, without precomputing the features.

	--use-averaged-model: If True, the model will use averaged parameters from multiple epochs or checkpoints.

	--num-workers: Number of workers to be used for data loading during decoding.

	--max-duration: The maximum duration (in seconds) of audio files to decode in one batch.

	--num-decode-streams: Number of parallel decoding streams to process.

	--context-size: The size of the right context to be used during chunk-based streaming decoding.

	Sherpa Online WebSocket Server

	This script sets up a WebSocket server for real-time ASR decoding using the Sherpa framework. It supports GPU-based decoding, different decoding methods, and tokenized models.
	sherpa-online-websocket-server --use-gpu=<USE_GPU_FLAG> \
	--tokens=<TOKENS_FILE_PATH> \
	--port=<PORT_NUMBER> \
	--doc-root=<DOCUMENT_ROOT> \
	--nn-model=<MODEL_PATH> \
	--decoding-method=<DECODING_METHOD>
	Parameters

	--use-gpu: Set this flag to True for GPU-based decoding, or False for CPU-based decoding.

	--tokens: Path to the file containing the token list (e.g., BPE tokens) required for decoding.

	--port: Port number for the WebSocket server. Ensure this port is open and not blocked by firewalls.

	--doc-root: The root directory for the server's documentation or web resources. This is the directory that serves files when accessed via a browser.

	--nn-model: Path to the neural network model to be used for decoding. The model is usually a jit_script file trained for speech recognition.

	--decoding-method: The decoding strategy to use. Common methods include greedy_search, beam_search, etc. Choose based on your model and application needs.
	Example
	sherpa-online-websocket-server --use-gpu=True \
	--tokens=/path/to/tokens.txt \
	--port=8003 \
	--doc-root=/path/to/web/document/root \
	--nn-model=/path/to/jit_script_model.pt \
	--decoding-method=greedy_search
	Notes

	GPU support: If using GPU, ensure that CUDA is properly set up on the system.
	Token file: The token file should correspond to the language and tokenization scheme used when training the neural network model.
	Neural Network Model: The model provided should be compatible with the decoding method specified (e.g., chunk-based decoding for streaming models).