Skip to content

Additional use cases

As you get familiar with ML-Git, you might feel the necessity of use advanced ML-Git features to solve your problems. Thus, this section aims to provide advanced scenarios and additional use cases.

Keeping track of a dataset

Often, users can share the same dataset. As the dataset improve, you will need to keep track of the changes. It is very simple to keep check what is new in a shared repository. You just need to navigate to the root of your project. Then, you can execute the command update, it will update the metadata repository, allowing visibility of what has been changed since the last update. For example, new ML entity and/or new versions.

ml-git repository update

In case something new exists in this repository, you will see a output like:

INFO - Metadata Manager: Pull [/home/Documents/my-mlgit-project-config/.ml-git/datasets/metadata]
INFO - Metadata Manager: Pull [/home/Documents/my-mlgit-project-config/.ml-git/labels/metadata]

Then, you can checkout the new available data.

Linking labels to a dataset

ML-Git provides support for users link an entitity to another. In this example, we show how to link labels to a dataset. To accomplish this use case, you will need to have a dataset versioned by ML-Git in your repository.

First, you need to configure your remote repository. Then, you can configure your storage. It is a similarly process as you did to configure your repository and storage for your dataset.

ml-git repository remote labels add git@github.com:example/your-mlgit-labels.git
ml-git repository storage add mlgit-labels --endpoint-url=<minio-endpoint-url>
ml-git labels init

Even, we are using a different bucket to store the labels data. It would be possible to store both datasets and labels into the same bucket.

If you look at your config file using the command:

ml-git repository config

You should see something similar to the following config file:

config:
{'batch_size': 20,
 'cache_path': '',
 'datasets': {'git': 'git@github.com:example/your-mlgit-datasets.git'},
 'index_path': '',
 'labels': {'git': 'git@github.com:example/your-mlgit-labels.git'},
 'metadata_path': '',
 'mlgit_conf': 'config.yaml',
 'mlgit_path': '.ml-git',
 'models': {'git': ''},
 'object_path': '',
 'refs_path': '',
 'storages': {'s3h': {'mlgit-datasets': {'aws-credentials': {'profile': 'default'},
                                      'endpoint-url': <minio-endpoint-url>,
                                      'region': 'us-east-1'}}},
              's3h': {'mlgit-labels': {'aws-credentials': {'profile': 'default'},
                                      'endpoint-url': <minio-endpoint-url>,
                                      'region': 'us-east-1'}}},
 'verbose': 'info'}

Then, you can create your first set of labels. As an example, we will use your-labels. ML-Git expects any set of labels to be specified under the labels/ directory of your project. Also, it expects a specification file with the name of the labels.

ml-git labels create your-labels --categories="computer-vision, labels" --mutability=mutable --storage-type=s3h --bucket-name=mlgit-labels --version=1

After create the entity, you can create the README.md describing your set of labels. Below, we show an example of caption labels for the your-labels directory and file structure:

your-labels/
├── README.md
├── annotations
│   ├── captions_train2014.json
│   └── captions_val2014.json
└── your-labels.spec

Now, you are ready to version the new set of labels. For this, do:

ml-git labels add your-labels
ml-git labels commit your-labels --dataset=your-datasets
ml-git labels push your-labels

The commands are very similar to dataset operations. However, you can note one particular change in the commit command. There is an option "--dataset" which is used to tell ML-Git that the labels should be linked to the specified dataset. With the following command, it is possible to see what datasets are associated with this labels

ml-git labels show your-labels

The output will looks like:

-- labels : your-labels --
categories:
- computer-vision
- captions
dataset:
  sha: 607fa818da716c3313a6855eb3bbd4587e412816
  tag: computer-vision__images__mscoco__1
manifest:
  files: MANIFEST.yaml
  storage: s3h://mlgit-datasets
name: your-captions
version: 1

As you can see, there is a new section "dataset" that has been added by ML-Git with the sha & tag fields. It can be used to checkout the exact version of the dataset for that set of labels.

Uploading labels related to a dataset:

asciicast

Adding special credentials AWS

Depending the project you are working on, you might need to use special credentials to restrict access to your entities (e.g., datases) stored inside a S3/MinIO bucket. The easiest way to configure and use a different credentials for the AWS storage is installing the AWS command-line interface (awscli). First, install the awscli. Then, run the following command:

aws configure --profile=mlgit

You will need to inform the fields listed below:

AWS Access Key ID [None]: your-access-key
AWS Secret Access Key [None]: your-secret-access-key
Default region name [None]: bucket-region
Default output format [None]: json

These commands will create the files ~/.aws/credentials and ~/.aws/config.

Below, you can see a short video on how to configure the AWS profile:

asciicast

After you have created your special credentials (e.g., mlgit profile)

You can use this profile as parameter to add your storages. Following, you can see an exaple of how to attach the profile to the storage mlgit-datasets.

ml-git repository storage add mlgit-datasets --credentials=mlgit --endpoint-url=<minio-endpoint-url>

Resources Inicialization using script

You can find the script following the step below. It remotely creates the configurations, and during the execution it will generate a config repository containing the configurations pointing to the metadata repository (GitHub) and storage (AWS S3 or Azure Blob).

If you are using Linux, execute on the terminal:

cd ml-git
./scripts/resources_initialization/resources_initialization.sh

If you are using Windows, execute on the CMD or Powershell:

cd ml-git
.\scripts\resources_initialization\resources_initialization.bat

At the end of executing this script, you will be able to directly execute a clone command to download your ML-Git project.

Checking Data Integrity

If at some point you want to check the integrity of the metadata repository (e.g. computer shutdown during a process), simply type the following command:

ml-git datasets fsck

That command will walk through the internal ML-Git directories (index & local repository) and will check the integrity of all blobs under its management. It will return the list of blobs that are corrupted.

Checking Data Integrity:

asciicast


Last update: September 29, 2023