[gpu] Install NVIDIA Container Toolkit #1025 #1067

cjac · 2023-06-30T21:25:09Z

Continuation of #1025

cjac · 2023-06-30T21:33:38Z

TO DO:

ensure that the DOCKER optional component configures everything such that it can be used with yarn containers launched using nvidia-docker from the container toolkit
Develop a working example of creating a cluster using the DOCKER optional component which is able to launch successfully completed pyspark jobs:
Create the cluster

  time gcloud dataproc clusters create ${CLUSTER_NAME} \
    --optional-components DOCKER \
    --scopes 'https://www.googleapis.com/auth/cloud-platform'

Launch the job

gsutil cp test.py gs://${BUCKET}/
gcloud dataproc jobs submit pyspark \
  --properties="spark:spark.executor.resource.gpu.amount=1" \
  --properties="spark:spark.task.resource.gpu.amount=1" \
  --properties="spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${YARN_DOCKER_IMAGE}" \
  --cluster=${CLUSTER_NAME} \
  --region ${REGION} gs://${BUCKET}/test.py

test.py:

# Copyright 2022,2023 Google LLC and contributors
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("torch - tensorflow").getOrCreate()

import torch
print("get CUDA details : == : ")
use_cuda = torch.cuda.is_available()
if use_cuda:
    print('__CUDNN VERSION:', torch.backends.cudnn.version())
    print('__Number CUDA Devices:', torch.cuda.device_count())
    print('__CUDA Device Name:',torch.cuda.get_device_name(0))
    print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9)

import tensorflow as tf
print("Get GPU Details : ")
print(tf.test.is_gpu_available())

if tf.test.gpu_device_name():
    print('Default GPU Device:{}'.format(tf.test.gpu_device_name()))
    print("Please install GPU version of TF")

gpu_available = tf.test.is_gpu_available()
print("gpu_available : " + str(gpu_available))

is_cuda_gpu_available = tf.test.is_gpu_available(cuda_only=True)
print("is_cuda_gpu_available : " + str(is_cuda_gpu_available))

is_cuda_gpu_min_3 = tf.test.is_gpu_available(True, (3,0))
print("is_cuda_gpu_min_3 : " + str(is_cuda_gpu_min_3))

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']


print("Run GPU Funtions Below : ")
print(get_available_gpus())

patch install_gpu_driver.sh to check whether the DOCKER optional component has been enabled and on true, trigger the installation and testing of the nvidia container toolkit.

cjac · 2023-06-30T21:37:14Z

hmmm... maybe we should patch in a metadata argument to the installer: driver-only
if set to true, do not install cuda or any of the other analytics infrastructure on the worker itself. These will be assumed to be installed in the container on which the workload will execute.

Clusters with this argument included will not be able to perform hardware assisted workloads directly. Jobs which expect hardware assisted workloads will need to manually install the libraries themselves or better yet, execute in a container.

jayadeep-jayaraman · 2024-04-19T05:03:47Z

/gcbrun

commit work in progress

8051085

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gpu] Install NVIDIA Container Toolkit #1025 #1067

[gpu] Install NVIDIA Container Toolkit #1025 #1067

cjac commented Jun 30, 2023

cjac commented Jun 30, 2023 •

edited

cjac commented Jun 30, 2023

jayadeep-jayaraman commented Apr 19, 2024

[gpu] Install NVIDIA Container Toolkit #1025 #1067

Are you sure you want to change the base?

[gpu] Install NVIDIA Container Toolkit #1025 #1067

Conversation

cjac commented Jun 30, 2023

cjac commented Jun 30, 2023 • edited

cjac commented Jun 30, 2023

jayadeep-jayaraman commented Apr 19, 2024

cjac commented Jun 30, 2023 •

edited