Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset performance #547

Open
KeremTurgutlu opened this issue Apr 20, 2023 · 0 comments
Open

Dataset performance #547

KeremTurgutlu opened this issue Apr 20, 2023 · 0 comments

Comments

@KeremTurgutlu
Copy link

KeremTurgutlu commented Apr 20, 2023

I am having difficult time getting my data pipeline to the throughput levels that I would like before starting training with the t5x library.

Initially I planned to use a mixture of ~40 tasks (1-2 TB text) for training and started doing some benchmarking following general TPU and dataset performance tips. Here are some useful guides that I tried to follow:

All of my datasets/tasks are json line files (output from earlier dataflow jobs) varying from 200 to 1000 files.

I used colab notebooks or an E2 32 cpu instance during my benchmarking experiments where I mounted my bucket which has all the ~40 datasets that I plan to use. I sampled 16 different files as training files for each task source because it is recommended not to have to read too many files form GCS.

FileDataSource

I switched from FunctionDataSource to FileDataSource , This is mainly to use individuals files during sharding without needing to read all the data which I assume would be slower especially for larger datasets.

import json
@tf.autograph.experimental.do_not_convert
def read_file_fn(file):
  """
  """
  def _read_json(file):
    # file = file.numpy().decode()
    with open(file) as f:
      for line in f:
        yield json.loads(line)['text']
  
  return tf.data.Dataset.from_generator(_read_json, args=(file,),
      output_signature=tf.TensorSpec(
          shape=(), dtype=tf.string, name=name)
  )      

source = seqio.FileDataSource(
  read_file_fn = read_file_fn,
  split_to_filepattern=dict(train=train_files, validation=validation_files))

Here we can see the reading and deserialization performance of a single task source.

dataset = source.get_dataset("train", shard_info=seqio.ShardInfo(0,16))
tfds.benchmark(dataset, num_iter=10000)
Examples/sec (First included) 1622.67 ex/sec (total: 10001 ex, 6.16 sec)
Examples/sec (First only) 0.95 ex/sec (total: 1 ex, 1.05 sec)
Examples/sec (First excluded) 1954.66 ex/sec (total: 10000 ex, 5.12 sec)

Single Task

Then I register my seqio tasks with full pipeline (including preprocessors) and test the performance of a single task.

dataset = seqio.get_mixture_or_task('task').get_dataset(
                    sequence_length={"inputs": 512, "targets": 512},
                    split="train",
                    shuffle=False,
                    num_epochs=1,
                    shard_info=seqio.ShardInfo(index=0, num_shards=16),
                    use_cached=False,
                    seed=42)
tfds.benchmark(dataset, num_iter=10000)
Examples/sec (First included) 485.21 ex/sec (total: 10001 ex, 20.61 sec)
Examples/sec (First only) 0.47 ex/sec (total: 1 ex, 2.11 sec)
Examples/sec (First excluded) 540.50 ex/sec (total: 10000 ex, 18.50 sec)

Mixture

When I benchmark the performance of the mixture it drops significantly (10x).

dataset = seqio.get_mixture_or_task("maana_version1.0_mixture").get_dataset(
                    sequence_length={"inputs": 512, "targets": 512},
                    split="train",
                    shuffle=False,
                    num_epochs=1,
                    shard_info=seqio.ShardInfo(index=0, num_shards=16),
                    use_cached=False,
                    seed=42)
tfds.benchmark(dataset, num_iter=10000)
Examples/sec (First included) 140.55 ex/sec (total: 10001 ex, 71.16 sec)
Examples/sec (First only) 0.09 ex/sec (total: 1 ex, 11.49 sec)
Examples/sec (First excluded) 167.60 ex/sec (total: 10000 ex, 59.67 sec)

Follow Up Thoughts

Please let me know if you have any feedback regarding the following comments and questions:

  1. In my experiments reading from GCS vs local files didn't differ much. So streaming directly from GCS is probably the better option (not having to download TB size data) as long as bucket is in the same zone as TPU and number of files is not too much. Documents state (10s to 100s MB) and (10s to 100s files), in my case I have datasets with 200-1000 files (100 MB-1 GB range), should I reduce the number of files maybe by making each 1 GB, would this help pipeline performance?

  2. I also experimented with TFExampleDataSource vs FileDataSource didn't see any performance gain from TFExample compared to json. Is there an absolute best way to store data for seqio pipeline performance, e.g. would registering a tfds be better - as explained here? In my experience dataflow jobs output number of files equal to the number of workers, so it can be much higher than 100s. Is this ok or should we keep the number of files in 128-256 range?

  3. This is more of a T5X question but still might be related. My understanding is that when we get dataset from a mixture each task is iterated and if there is shard info specified that shard is returned as data, later same sample_fn is used for sampling from these task datasets with the given rates. I don't fully know how data parallelism plays together with model parallelism in t5x and maybe it might depend on the model size and # of tpus cores we have. Is it correct to assume each TPU core is a worker and data gets distributed to them when sharding? So would it make sense to have as many files as a multiple of core numbers (e.g. 8x for v3-8, 32x for v3-32). I also read that batch is automatically distributed across tpu cores when doing computation that is why I guess 8 x 128 is emphasized, then does it mean we don't need to necessarily care about number of files / sharding and still can use a single source file?

Notes from codelab:

The rule of thumb is to split your data across several (10s to 100s) larg-ish files (10s to 100s of MB). If you have too many files, thousands of files for example, the time to access each file might start getting in the way. If you have too few files, like one or two, then you are not getting the benefits of streaming from multiple files in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant