URL fixes and URL check fixes (#2692)

asumagic · web-flow · commit 4e62cf913568 · 2024-09-24T15:07:53.000+01:00
* Fix URLs to `unstable-v0.6`, use relative link for `recipes/` * Relative link in readme for PERFORMANCE.md * URL to Spectral Clutsering in README dead, point to web archive copy * Replace dead URL to `wham_noise.zip` This AWS bucket link is used in `recipes/ESC50` and is linked to in http://wham.whisper.ai/ which seems to be treated as the official source for that dataset elsewhere in SB. Not going to bother moving it to `fetch`. * check_url.yaml rework, regex support, parallel, expanded scope * Tutorial URL fix * Fix links to inference code that was changed in 1.0 * Web archive BPE_Gage.pdf * Remove broken link to PyTorch doc in tutorial I am not sure what this is actually intended to point to, as the URL seemingly referred to `torchvision`. Removing for now... * Fix link to doc in quaternion tutorial * Fix link to papers in tutorial * Fix Colab/GitHub URL for asr-metrics.ipynb * Fix more tutorial dead links to the web archive * Update ESC-50 dataset link and ignore dead URL false positive * Ignore dead URL false positive in DNS * Be more verbose about URL check errors * Ignore URL check true positive for urbansounddataset This is the actual README, i'll avoid tampering with it beyond adding a notice. Better indicate to the user the link might be dead. * Fix format string typo * Fix formatting * Add the web archive to ignored URLs for URL checks * Add arXiv to ignored URLs for URL checks * Disable TLS verification in URL checks This fixes the URL check for `https://sail.usc.edu`. We don't really care about the MITM risk here as we do nothing with the data. * Add kaggle to URL exclusion regex * VoxLingua107 pre-compiled shards are dead, add warning + ignore check * Undo broken ')' handling for URL check, just ignore one URL for now * Fix URL and typo in speech-classification-from-scratch * Formatting * Ignore TLS verify=False warning in URL check
diff --git a/README.md b/README.md
diff --git a/docs/tutorials/advanced/data-loading-for-big-datasets-and-shared-filesystems.ipynb b/docs/tutorials/advanced/data-loading-for-big-datasets-and-shared-filesystems.ipynb
@@ -27,7 +27,7 @@
     "\n",
     "Do you have a large dataset stored in a shared filesystem, and you want to use it for training a neural network? Is this dataset so large that it doesn't even fit into the local SSD of your computation nodes? If so, this tutorial will walk you through all the needed steps to manage reading large files from a shared filesystem.\n",
     "\n",
-    "In many compute clusters, the main data storage is a network filesystem (NFS), for example [Lustre](https://en.wikipedia.org/wiki/Lustre_(file_system)). The NFS can serve many users concurrently and provide high data throughput from a single file. However, opening or listing many different files is slow - and doing so may slow the whole system down for everyone, not just the offending user. Speech datasets usually consist of very many small recordings. Reading every file again and again is exactly the kind of data IO that can slow down an NFS.\n",
+    "In many compute clusters, the main data storage is a network filesystem (NFS), for example [Lustre](https://en.wikipedia.org/wiki/Lustre_(file_system)). <!-- ignore-url-check --> The NFS can serve many users concurrently and provide high data throughput from a single file. However, opening or listing many different files is slow - and doing so may slow the whole system down for everyone, not just the offending user. Speech datasets usually consist of very many small recordings. Reading every file again and again is exactly the kind of data IO that can slow down an NFS.\n",
     "\n",
     "One solution is to copy the dataset into the **local SSD** of the computing node. This can be done relatively efficiently by compressing the dataset into a single file (e.g. `dataset.tar.gz`), copying it into the local node, and finally, uncompressing (untarring) the file. Reading files from the local SSD is very efficient and does not harm the performance of the shared filesystem.\n",
     "The standard SpeechBrain data IO works well in this case, see [this tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/basics/data-loading-pipeline.html).\n",
diff --git a/docs/tutorials/advanced/inferring-on-your-own-speechbrain-models.ipynb b/docs/tutorials/advanced/inferring-on-your-own-speechbrain-models.ipynb
@@ -140,7 +140,7 @@
     "\n",
     "### 2. Using the `EndoderDecoderASR` interface\n",
     "\n",
-    "The [EncoderDecoderASR class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py#L353). interface allows you to decouple your trained model from the training recipe and to infer (or encode) on any new audio file in few lines of code. If you are not interested in ASR, you'll find many other interfaces to fit your purpose in the `interfaces.py` file. This solution must be preferred if you intend to deploy your model in a production fashion i.e. if you plan to use your model a lot and in a stable way. Of course, this will require you to slightly rework the yaml.\n",
+    "The [EncoderDecoderASR class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/ASR.py). interface allows you to decouple your trained model from the training recipe and to infer (or encode) on any new audio file in few lines of code. If you are not interested in ASR, you'll find many other interfaces to fit your purpose in the `interfaces.py` file. This solution must be preferred if you intend to deploy your model in a production fashion i.e. if you plan to use your model a lot and in a stable way. Of course, this will require you to slightly rework the yaml.\n",
     "\n",
     "The class has the following methods:\n",
     "\n",
@@ -441,7 +441,7 @@
     "\n",
     "While the `EncoderDecoderASR` class has been designed to be as generic as possible, your might require a more complex inference scheme that better fits your needs.  In this case, you have to develop your own interface. To do so, follow these steps:\n",
     "\n",
-    "1. Create your custom interface inheriting from `Pretrained` (code [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py)):\n",
+    "1. Create your custom interface inheriting from `Pretrained` (code [in this file](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/interfaces.py)):\n",
     "\n",
     "\n",
     "```python\n",
@@ -499,11 +499,11 @@
     "\n",
     "As you can see, this formalism is extremely flexible and enables you to create a holistic interface that can be used to do anything you want with your pretrained model.\n",
     "\n",
-    "We provide different generic interfaces for E2E ASR, speaker recognition, source separation, speech enhancement, etc. Please have a look [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py) if interested!\n",
+    "We provide different generic interfaces for E2E ASR, speaker recognition, source separation, speech enhancement, etc. Please have a look [here](https://github.com/speechbrain/speechbrain/tree/develop/speechbrain/inference) if interested!\n",
     "\n",
     "\n",
     "## General Pretraining Inference\n",
-    "In some cases, users might want to develop their inference interface in an external file. This can be done using the [foreign class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py#L28).\n",
+    "In some cases, users might want to develop their inference interface in an external file. This can be done using the [foreign class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/interfaces.py).\n",
     "You can take a look at the example reported [here](https://huggingface.co/speechbrain/emotion-recognition-wav2vec2-IEMOCAP):\n",
     "\n",
     "\n",
diff --git a/docs/tutorials/advanced/text-tokenizer.ipynb b/docs/tutorials/advanced/text-tokenizer.ipynb
@@ -41,7 +41,7 @@
     "\n",
     "\n",
     "SpeechBrain currently relies on a custom integration of the [*SentencePiece tokenizer*](https://github.com/google/sentencepiece) which treats the input as a raw input stream. The following tokenizer algorithms are supported:\n",
-    "1. [BPE](https://www.derczynski.com/papers/archive/BPE_Gage.pdf).\n",
+    "1. [BPE](https://web.archive.org/web/20230319172720/https://www.derczynski.com/papers/archive/BPE_Gage.pdf).\n",
     "2. [Unigram](https://arxiv.org/pdf/1804.10959.pdf) (Subword Regularization).\n",
     "\n",
     "\n",
diff --git a/docs/tutorials/basics/data-loading-pipeline.ipynb b/docs/tutorials/basics/data-loading-pipeline.ipynb
@@ -118,7 +118,7 @@
    },
    "source": [
     "### Dataset\n",
-    "The role of the Dataset is to produce single data points. Typically they are loaded off the disk, but they could also come from some more complex source or in some cases just from RAM. You can write your own Dataset subclass or sometimes you can use a standardized class, such as [this](https://pytorch.org/docs/stable/torchvision/datasets.html#datasetfolder). The training, validation, and test subsets get their own Dataset instances.\n",
+    "The role of the Dataset is to produce single data points. Typically they are loaded off the disk, but they could also come from some more complex source or in some cases just from RAM. You can write your own Dataset subclass or sometimes you can use a standardized class. The training, validation, and test subsets get their own Dataset instances.\n",
     "\n",
     "The Dataset interface is simple; it implements\n",
     "`__getitem__` and usually also `__len__`. Usually, \"map-style\" Datasets are used, but it's worth noting that PyTorch also has a notion of [IterableDataset](https://pytorch.org/docs/stable/data.html#iterable-style-datasets)s.\n",
diff --git a/docs/tutorials/nn/complex-and-quaternion-neural-networks.ipynb b/docs/tutorials/nn/complex-and-quaternion-neural-networks.ipynb
@@ -560,7 +560,7 @@
     "1. Compose a real-valued matrix from the different weight components\n",
     "2. Apply a matrix product between the input and this rotation matrix!\n",
     "\n",
-    "[Check the code!](http://www.darnault-parcollet.fr/Parcollet/hiddennoshare/speechbrain.github.io/documentation/speechbrain.nnet.quaternion_networks.q_ops.html#speechbrain.nnet.quaternion_networks.q_ops.quaternion_linear_rotation_op)\n",
+    "[Check the code!](https://speechbrain.readthedocs.io/en/latest/API/speechbrain.nnet.quaternion_networks.q_ops.html#speechbrain.nnet.quaternion_networks.q_ops.quaternion_linear_rotation_op)\n",
     "\n",
     "### Turning a quaternion layer into a spinor layer\n",
     "\n",
diff --git a/docs/tutorials/preprocessing/speech-features.ipynb b/docs/tutorials/preprocessing/speech-features.ipynb
@@ -392,7 +392,7 @@
    },
    "source": [
     "## References\n",
-    "[1] P. Mermelstein (1976), \"Distance measures for speech recognition, psychological and instrumental,\" in Pattern Recognition and Artificial Intelligence. [ArXiv](http://www.haskins.yale.edu/sr/SR047/SR047_07.pdf)\n",
+    "[1] P. Mermelstein (1976), \"Distance measures for speech recognition, psychological and instrumental,\" in Pattern Recognition and Artificial Intelligence. [pdf (Web Archive)](https://web.archive.org/web/20200714014004/http://www.haskins.yale.edu/sr/SR047/SR047_07.pdf)\n",
     "\n",
     "[2] X. Huang, A. Acero (Author), H.-W. Hon, \"Spoken Language Processing: A Guide to Theory, Algorithm and System Development Paperback – 2001\n",
     "\n",
diff --git a/docs/tutorials/tasks/asr-metrics.ipynb b/docs/tutorials/tasks/asr-metrics.ipynb
@@ -12,9 +12,9 @@
     "<!-- This cell is automatically updated by tools/tutorial-cell-updater.py -->\n",
     "<!-- The contents are initialized from tutorials/notebook-header.md -->\n",
     "\n",
-    "[<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>](https://colab.research.google.com/github/speechbrain/speechbrain/blob/develop/docs/tutorials/tasks/pr2451-new-metrics.ipynb)\n",
+    "[<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>](https://colab.research.google.com/github/speechbrain/speechbrain/blob/develop/docs/tutorials/tasks/asr-metrics.ipynb)\n",
     "to execute or view/download this notebook on\n",
-    "[GitHub](https://github.com/speechbrain/speechbrain/tree/develop/docs/tutorials/tasks/pr2451-new-metrics.ipynb)"
+    "[GitHub](https://github.com/speechbrain/speechbrain/tree/develop/docs/tutorials/tasks/asr-metrics.ipynb)"
    ]
   },
   {
diff --git a/docs/tutorials/tasks/speech-classification-from-scratch.ipynb b/docs/tutorials/tasks/speech-classification-from-scratch.ipynb
@@ -871,7 +871,7 @@
    "source": [
     "## Step 3: Inference\n",
     "\n",
-    "At this point, we can use the trained classifier to perform **predictions on new data**.  Speechbrain made available some classes ([take a look here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py)) such as the `EncoderClassifier` one that can make inference easier. The class can also be used to extract some embeddings at the output of the encoder.\n",
+    "At this point, we can use the trained classifier to perform **predictions on new data**.  Speechbrain made available some classes ([take a look here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/classifiers.py)) such as the `EncoderClassifier` one that can make inference easier. The class can also be used to extract some embeddings at the output of the encoder.\n",
     "\n",
     "Let's see first how can we used it to load our best xvector model (trained on Voxceleb and stored on HuggingFace) to compute some embeddings and perform a speaker classification:\n"
    ]
@@ -1256,7 +1256,7 @@
     "\n",
     "### Use the EncoderClassifier interface on your model\n",
     "\n",
-    "The [EncoderClassidier class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py#L591) takes a pre-trained model and performs inference on it with the following methods:\n",
+    "The [EncoderClassifier class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/classifiers.py) takes a pre-trained model and performs inference on it with the following methods:\n",
     "\n",
     "- **encode_batch**: applies the encoder to an input batch and returns some encoded embeddings.\n",
     "- **classify_batch**: performs a full classification step and returns the output probabilities of the classifier, the best score, the index of the best class, and its label in text format (see example above).\n",
diff --git a/docs/tutorials/tasks/speech-recognition-from-scratch.ipynb b/docs/tutorials/tasks/speech-recognition-from-scratch.ipynb
@@ -158,7 +158,7 @@
    "source": [
     "We encourage the readers not familiar enough with speech recognition to gain more familiarity with this technology before moving on. Beyond scientific papers, online you can find amazing tutorials and blog posts, such as:\n",
     "- [An Intuitive Explanation of Connectionist Temporal Classification](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c)\n",
-    "- [Connectionist Temporal Classification](https://machinelearning-blog.com/2018/09/05/753/)\n",
+    "- [Connectionist Temporal Classification](https://web.archive.org/web/20211017041333/https://machinelearning-blog.com/2018/09/05/753/)\n",
     "- [Sequence-to-sequence learning with Transducers](https://lorenlugosch.github.io/posts/2020/11/transducer/)\n",
     "- [Understanding Encoder-Decoder Sequence to Sequence Model](https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346)\n",
     "- [What is a Transformer?](https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04)\n",
@@ -1845,7 +1845,7 @@
    "source": [
     "## Step 5: Inference\n",
     "\n",
-    "At this point, we can use the trained speech recognizer. For this type of ASR model, speechbrain made available some classes ([take a look here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py)) such as the `EncoderDecoderASR` one that can make inference easier. For instance, we can transcribe an audio file with a pre-trained model hosted in our [HuggingFace repository](https://huggingface.co/speechbrain) in solely 4 lines of code:\n"
+    "At this point, we can use the trained speech recognizer. For this type of ASR model, speechbrain made available some classes ([take a look here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/ASR.py)) such as the `EncoderDecoderASR` one that can make inference easier. For instance, we can transcribe an audio file with a pre-trained model hosted in our [HuggingFace repository](https://huggingface.co/speechbrain) in solely 4 lines of code:\n"
    ]
   },
   {
@@ -2174,7 +2174,7 @@
     "\n",
     "While the `EncoderDecoderASR` class has been designed to be as generic as possible, your might require a more complex inference scheme that better fits your needs.  In this case, you have to develop your own interface. To do so, follow these steps:\n",
     "\n",
-    "1. Create your custom interface inheriting from `Pretrained` (code [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py)):\n",
+    "1. Create your custom interface inheriting from `Pretrained` (code [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/interfaces.py)):\n",
     "\n",
     "\n",
     "```python\n",
diff --git a/recipes/Aishell1Mix/prepare_data.py b/recipes/Aishell1Mix/prepare_data.py
@@ -74,7 +74,7 @@ def prepare_aishell1mix(
     if not os.path.exists(wham_dir):
         print("Download Wham noise dataset into %s" % datapath)
         urlretrieve(
-            "https://storage.googleapis.com/whisper-public/wham_noise.zip",
+            "https://my-bucket-a8b4b49c25c811ee9a7e8bba05fa24c7.s3.amazonaws.com/wham_noise.zip",
             os.path.join(datapath, "wham_noise.zip"),
             reporthook=reporthook,
         )
diff --git a/recipes/DNS/dns_download.py b/recipes/DNS/dns_download.py
@@ -164,9 +164,7 @@
     "datasets_fullband.dev_testset_000.tar.bz2",
 ]
 
-AZURE_URL = (
-    "https://dns4public.blob.core.windows.net/dns4archive/datasets_fullband"
-)
+AZURE_URL = "https://dns4public.blob.core.windows.net/dns4archive/datasets_fullband"  # noqa ignore-url-check
 
 # Impulse response and Blind testset
 OTHER_URLS = {
diff --git a/recipes/ESC50/esc50_prepare.py b/recipes/ESC50/esc50_prepare.py
@@ -2,7 +2,7 @@
 Creates data manifest files for ESC50
 If the data does not exist in the specified --data_folder, we download the data automatically.
 
-https://github.com/karoldvl/ESC-50/
+https://github.com/karolpiczak/ESC-50/
 
 Authors:
  * Cem Subakan 2022, 2023
@@ -25,7 +25,7 @@
 
 logger = get_logger(__name__)
 
-ESC50_DOWNLOAD_URL = "https://github.com/karoldvl/ESC-50/archive/master.zip"
+ESC50_DOWNLOAD_URL = "https://github.com/karolpiczak/ESC-50/archive/master.zip"
 MODIFIED_METADATA_FILE_NAME = "esc50_speechbrain.csv"
 
 ACCEPTABLE_FOLD_NUMS = [1, 2, 3, 4, 5]
@@ -49,7 +49,7 @@ def download_esc50(data_path):
         # download the data
         archive_path = fetch(
             "master.zip",
-            "https://github.com/karoldvl/ESC-50/archive/",
+            "https://github.com/karolpiczak/ESC-50/archive/",  # noqa ignore-url-check
             savedir=temp_path,
             # URL, so will be fetched directly in the savedir anyway
             local_strategy=LocalStrategy.COPY_SKIP_CACHE,
diff --git a/recipes/UrbanSound8k/SoundClassification/UrbanSound8k/UrbanSound8K_README.txt b/recipes/UrbanSound8k/SoundClassification/UrbanSound8k/UrbanSound8K_README.txt
@@ -7,7 +7,7 @@ Created By
 Justin Salamon*^, Christopher Jacoby* and Juan Pablo Bello*
 * Music and Audio Research Lab (MARL), New York University, USA
 ^ Center for Urban Science and Progress (CUSP), New York University, USA
-http://serv.cusp.nyu.edu/projects/urbansounddataset
+http://serv.cusp.nyu.edu/projects/urbansounddataset (dead link? ignore-url-check)
 http://cusp.nyu.edu/
 
 Version 1.0
diff --git a/recipes/VoxLingua107/lang_id/README.md b/recipes/VoxLingua107/lang_id/README.md
@@ -52,14 +52,20 @@ python create_wds_shards.py /data/voxlingua107/dev/ /data/voxlingua107_shards/de
 
 ### 2nd option: download the pre-compiled WebDataset shards
 
+> [!IMPORTANT]
+> As of 2024-09-19, according to the
+> [official website](https://bark.phon.ioc.ee/voxlingua107/), the pre-compiled
+> WebDataset shards are currently unavailable. As a result, this method is
+> likely broken. If you get a 503 error, it is because of that.
+
 Download the shards:
 
 ```
 # Select a place with around 1 TB of free space
 cd /data/
 mkdir voxlingua107_shards
 cd voxlingua107_shards
-wget  -r -nH --cut-dirs=4 --no-parent --reject="index.html*" http://bark.phon.ioc.ee/lw/korpused/voxlingua107/shards/
+wget  -r -nH --cut-dirs=4 --no-parent --reject="index.html*" http://bark.phon.ioc.ee/lw/korpused/voxlingua107/shards/  # ignore-url-check
 ```
 
 ## Installing Extra Dependencies
diff --git a/templates/speaker_id/README.md b/templates/speaker_id/README.md
@@ -28,4 +28,4 @@ Please reach out to the SpeechBrain
 team if any errors are found or clarification is needed about how
 parts of the template work. Good Luck!
 
-[For more information, please take a look into the "speaker-id from scratch" tutorial](https://speechbrain.readthedocs.io/en/latest/en/2685/tutorials/tasks/speech-classification-from-scratch.html)
+[For more information, please take a look into the "speaker-id from scratch" tutorial](https://speechbrain.readthedocs.io/en/latest/tutorials/tasks/speech-classification-from-scratch.html)
diff --git a/tests/utils/check_url.py b/tests/utils/check_url.py

Original file line number	Diff line number	Diff line change
`@@ -12,9 +12,9 @@`
`12`	`12`	`"<!-- This cell is automatically updated by tools/tutorial-cell-updater.py -->\n",`
`13`	`13`	`"<!-- The contents are initialized from tutorials/notebook-header.md -->\n",`
`14`	`14`	`"\n",`
`15`		`- "[<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>](https://colab.research.google.com/github/speechbrain/speechbrain/blob/develop/docs/tutorials/tasks/pr2451-new-metrics.ipynb)\n",`
	`15`	`+ "[<img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>](https://colab.research.google.com/github/speechbrain/speechbrain/blob/develop/docs/tutorials/tasks/asr-metrics.ipynb)\n",`
`16`	`16`	`"to execute or view/download this notebook on\n",`
`17`		`- "[GitHub](https://github.com/speechbrain/speechbrain/tree/develop/docs/tutorials/tasks/pr2451-new-metrics.ipynb)"`
	`17`	`+ "[GitHub](https://github.com/speechbrain/speechbrain/tree/develop/docs/tutorials/tasks/asr-metrics.ipynb)"`
`18`	`18`	`]`
`19`	`19`	`},`
`20`	`20`	`{`