Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorBoard logs: "Unexpected error: corrupted record at 0" #924

Closed
karlschriek opened this issue Nov 21, 2019 · 11 comments
Closed

TensorBoard logs: "Unexpected error: corrupted record at 0" #924

karlschriek opened this issue Nov 21, 2019 · 11 comments

Comments

@karlschriek
Copy link

/kind bug

What steps did you take and what happened:

I am trying to construct a simple example MNIST example that makes use Keras with TFJob and logs out metrics using the TensorBoard callback. However, the TensorFlowEvent collector is unable to pick up the logs.

The workers reach completed stage. Calling kubectl -n kubeflow logs pod/tfjob-example-tf-events-xxxxxxx-worker-0 metrics-collector then yields the following

/tensorboard/train/events.out.tfevents.1574336795.tfjob-example-tf-events-xxxxxxx-worker-0.8.140821.v2 will be parsed.
/tensorboard/train/events.out.tfevents.1574336799.tfjob-example-tf-events-xxxxxxx-worker-0.profile-empty will be parsed.
/tensorboard/train/plugins/profile/2019-11-21_11-46-39/local.trace will be parsed.
Unexpected error: corrupted record at 0
In tfjob-example-tf-events-xxxxxxx 0 metrics will be reported.

Below follows the full code and details of what I am doing

1. model.py

This sits within a Docker image (let's just call it my_images/keras_mnist for simplicity sake) running TensorFlow 2.0 (its. based on tensorflow/tensorflow:latest-gpu-py3).

It is based on the official tutorial for running Keras in a distributed manner, as found here:
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/distribute/multi_worker_with_keras.ipynb#scrollTo=xIY9vKnUU82o

import tensorflow_datasets as tfds
tfds.disable_progress_bar()
import tensorflow as tf
import argparse


FLAGS = None

BUFFER_SIZE = 10000

class StdOutCallback(tf.keras.callbacks.ProgbarLogger):
    # a simple callback that picky-backs of the progress bar callback. It prints metrics to StdOut.
    def on_batch_end(self, batch, logs=None):
        logs = logs or {}
        for k in self.params['metrics']:
            if k in logs:
                print("{}={}".format(k,logs[k]))

def build_and_compile_cnn_model(dropout_rate, lr):
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dropout(rate=dropout_rate),
      tf.keras.layers.Dense(10, activation='softmax')
  ])

  model.compile(
      loss=tf.keras.losses.sparse_categorical_crossentropy,
      optimizer=tf.keras.optimizers.SGD(learning_rate=lr),
      metrics=['accuracy']
  )
  return model

def make_datasets_unbatched(dataset_name):
  # Scaling MNIST data from (0, 255] to (0., 1.]
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label

  datasets, info = tfds.load(name='mnist',
                            with_info=True,
                            as_supervised=True)

  return datasets[dataset_name].map(scale).cache().shuffle(BUFFER_SIZE)


def parse_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--input_size_x', type=int, default=10),
    parser.add_argument('--input_size_y', type=int, default=10)
    parser.add_argument('--lr', type=float, default=0.01)
    parser.add_argument('--dropout_rate', type=float, default=0.4)
    parser.add_argument('--log_dir', type=str, default="./tensorboard/metrics")
    parser.add_argument('--number_of_workers', type=int, default=1)
    args, unparsed = parser.parse_known_args()

    return args, unparsed

def main():
    FLAGS, unparsed = parse_arguments()
    GLOBAL_BATCH_SIZE = 512 * FLAGS.number_of_workers
    
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

    tensorboard = tf.keras.callbacks.TensorBoard(log_dir=FLAGS.log_dir, update_freq="batch")      
    std_out = StdOutCallback()

    with strategy.scope():
        train_datasets = make_datasets_unbatched("train").batch(GLOBAL_BATCH_SIZE)
        multi_worker_model = build_and_compile_cnn_model(FLAGS.dropout_rate, FLAGS.lr)
        
    multi_worker_model.fit(x=train_datasets,
                           epochs=3,
                           callbacks=[tensorboard, std_out],
                           verbose=0  # to disable progress bar
                           )

if __name__ == '__main__':
    main()

The above code has two callbacks: the vanilla TensorBoard callback, which writes to FLAGS.log_dir and is meant to be picked up by a collector of kind TensorFlowEvent. The second is a little custom StdOutCallback, used for testing purposes only, which writes out metrics to standard out in the format acc=0.71. This is meant to be picked up by a collector of type StdOut.

2. YAML files

I use two different YAML files. The first is with a StdOut collector, which runs through without any problems. It looks like this:

std_out.yaml:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example-std-out
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.1
    objectiveMetricName: loss
    additionalMetricNames:
      - acc
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    collector:
      kind: StdOut
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: --dropout_rate
      parameterType: double
      feasibleSpace:
        min: "0.3"
        max: "0.7"
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1"
          kind: TFJob
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
           tfReplicaSpecs:
            Worker:
              replicas: 1
              restartPolicy: OnFailure
              template:
                spec:
                  containers:
                    - name: tensorflow
                      image: my_images/keras_mnist:latest
                      imagePullPolicy: Always
                      command:
                        - "python"
                        - "model.py"
                        - "--input_size_x=10"
                        - "--input_size_y=10"
                        - "--log_dir=/tensorboard"
                        - "--number_of_workers=1"
                        {{- with .HyperParameters}}
                        {{- range .}}
                        - "{{.Name}}={{.Value}}"
                        {{- end}}
                        {{- end}}

The above is only really used for testing (and out of curiosity for how the different kinds of collectors work). What I would prefer would be to catch the TensorBoard logs. For that I've set up the following yaml:

'tf_events.yaml':

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example-tf-event
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.01
    objectiveMetricName: loss
    additionalMetricNames:
      - acc
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /tensorboard
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: --dropout_rate
      parameterType: double
      feasibleSpace:
        min: "0.3"
        max: "0.7"
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1"
          kind: TFJob
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
           tfReplicaSpecs:
            Worker:
              replicas: 1
              restartPolicy: OnFailure
              template:
                spec:
                  containers:
                    - name: tensorflow
                      image: my_images/keras_mnist:latest
                      imagePullPolicy: Always
                      command:
                        - "python"
                        - "model.py"
                        - "--input_size_x=10"
                        - "--input_size_y=10"
                        - "--log_dir=/tensorboard"
                        - "--number_of_workers=1"
                        {{- with .HyperParameters}}
                        {{- range .}}
                        - "{{.Name}}={{.Value}}"
                        {{- end}}
                        {{- end}}

Running kubectl apply -f tf_events.yaml results in the error being logged in the metrics-collector sidecar as written at the top of this post.

What did you expect to happen:

The metrics-collector's logs seem to suggest that it was able to find the tensorboard logs and that it will attempt to parse them. I would expect the parsing to work (or at the very least to receive a message explaining why it doesn't work)

Environment:

  • TensorFlow version: 2.0
  • Kubeflow version: 0,7.0
  • Kubernetes version: (use kubectl version): 1.12
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.9", GitCommit:"16236ce91790d4c75b79f6ce96841db1c843e7d2", GitTreeState:"clean", BuildDate:"2019-03-27T14:42:18Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-afa464", GitCommit:"afa464ce9760bd08a53a207f505b133b93366ea3", GitTreeState:"clean", BuildDate:"2019-10-22T21:42:57Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
@hougangliu
Copy link
Member

#920 had fixed this issue (unfortunately not included in 0.7.0)
You can try to kubectl edit configmap katib-config -n kubeflow to use image including this fix as below

"TensorFlowEvent": {
        "image": "gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector"
      }

@karlschriek
Copy link
Author

karlschriek commented Dec 2, 2019

Great, that's good to know thanks! I'll give the custom fix a try, otherwise I'll come back to this with the next release.

@eriklincoln
Copy link

eriklincoln commented Dec 3, 2019

@karlschriek were you able to test this out to see if it fixed your problem? I think I'm having a similar problem with the TensorFlowEvent collector. I'm also using TF 2.0 (Docker image tensorflow/tensorflow:nightly-py3) but with a custom training loop.

I'm writing out my metric scalars via a writer created from tf.summary.create_file_writer and actually writing the scalar with tf.summary.scalar. The writer gets flushed at the end of the epoch and I confirmed the events.out.tfevents... file is getting written when the writer is flushed at the end of the epoch in the trial worker container.

However, the metrics-collector container was not working either with the tfevent-metrics-collector image for 0.7.0 or using the latest suggested by @hougangliu. The error in the metrics-collector container with tfevent-metrics-collector:latest is:

Traceback (most recent call last): 
  File "main.py", line 39, in <module>
    WaitOtherMainProcesses(completed_marked_dir=opt.dir_path)
  File "/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1alpha3/common/pns.py", line 45, in WaitOtherMainProcesses 
    return WaitPIDs(GetOtherMainProcesses(), poll_interval_seconds, timeout_seconds, is_wait_all, completed_marked_dir)
  File "/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1alpha3/common/pns.py", line 31, in WaitPIDs
    with open(mark_file) as file_obj:
IOError: [Errno 2] No such file or directory: '/job/7.pid'
rpc error: code = Unknown desc = Error: No such container: 53a4eb8f10203672b3bd1f83ee6d0275064ff84d770767b6d34f1bfec9360fb1(base) 

When I use tfevent-metrics-collector:v0.7.0 and have the job only save a single tfevents file (no checkpoints, variables, or the model itself), the container returns with:

/job/logs/events.out.tfevents.1575411038.gcp-demo1-tune-l92bc8xm-worker-0.6.384.v2 will be parsed. 
In gcp-demo1-tune-l92bc8xm 1 metrics will be reported.

It appears to be parsing the events file correctly and finding the scalar set in spec.objective.objectiveMetricName beacause when I set objectiveMetricName to something random, it returns with ... 0 metrics will be reported.

However, the metrics do not appear to be getting saved correctly in the experiment as they're not appearing in the UI (values are always 0):

image

My Experiment config is:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: kubeflow
  name: gcp-demo1-tune
spec:
  parallelTrialCount: 5
  maxTrialCount: 20
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.00
    objectiveMetricName: test_loss
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /job
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: --learning-rate
      parameterType: discrete
      feasibleSpace: 
        list: ["0.0001", "0.0005", "0.001", "0.005", "0.01", "0.05", "0.1", "0.5", "1.0"]
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: "kubeflow.org/v1"
          kind: "TFJob"
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            cleanPodPolicy: None 
            tfReplicaSpecs:
              Worker:
                replicas: 1
                restartPolicy: OnFailure
                template:
                  spec:
                    containers:
                      - name: tensorflow
                        image: gcr.io/PROJECTID/gcp-demo1:training
                        imagePullPolicy: Always
                        command:
                          - "python"
                          - "-m"
                          - "trainer.task"
                          - "task"
                          - "--batch-size=128"
                          - "--epochs=1"
                          - "--chunk-size=5000000"
                          - "--cycle-length=8"
                          - "--job-dir=/job"
                          - "--table-id=finaltaxi_encoded_sampled_small"
                          {{- with .HyperParameters}}
                          {{- range .}}
                          - "{{.Name}}={{.Value}}"
                          {{- end}}
                          {{- end}}
                        resources:
                          limits:
                            cpu: '4'
                            memory: '40G'
                        env:
                        - name: GOOGLE_APPLICATION_CREDENTIALS
                          value: "/etc/secrets/user-gcp-sa.json"
                        volumeMounts:
                        - name: sa
                          mountPath: "/etc/secrets"
                          readOnly: true
                    volumes:
                    - name: sa
                      secret:
                        secretName: user-gcp-sa

I can provide my TF code as well if helpful.

@karlschriek
Copy link
Author

@eriklincoln I haven't really had a chance to try it yet. Also unlikely that I'll get to it before january

@karlschriek
Copy link
Author

karlschriek commented Jan 14, 2020

So I've finally had the chance to come back to this today. @hougangliu, I edited the katib-config ConfigMap to use: gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector:latest, but this still results in the exact same error message:

/tensorboard/train/events.out.tfevents.1578998178.tfjob-example-tf-event-wdlgctxj-worker-0.8.520.v2 will be parsed.
/tensorboard/train/events.out.tfevents.1578998181.tfjob-example-tf-event-wdlgctxj-worker-0.profile-empty will be parsed.
/tensorboard/train/plugins/profile/2020-01-14_10-36-21/local.trace will be parsed.
Unexpected error: corrupted record at 0
In tfjob-example-tf-event-wdlgctxj 0 metrics will be reported.

@swarajoturkar
Copy link

@karlschriek Hi Karl, I had the exact same issue which got resolved after using the tfevent-metrics-collector:latest image and setting the metric name (in tf_events.yaml ) to batch_accuracy instead of accuracy.
TFEvents summaries apparently contain a prefix based on update_freq parameter in tf.keras.callbacks.TensorBoard

@karlschriek
Copy link
Author

@swarajoturkar thanks for the tip will try it out! Do you know if this is documented somewhere? I take it epoch_{metric_name} would equally be relevant when looking at an update frequency per epoch?

@swarajoturkar
Copy link

swarajoturkar commented Jan 16, 2020

@karlschriek
I dont think so since I found no mention of it while looking for a solution. I manually tried to read the events file generated locally which is where I found the available keys for scalar metrics.

Yes, infact epoch_{metric_name} was logged even though I specified "batch" in the Tensorboard callback. Since I am pretty new to Tensorboard, I am not sure if this is the expected behaviour.

Here is some code i used to read the event files:

from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
event_acc = EventAccumulator(log_file)
event_acc.Reload()
print(event_acc.Tags())

{'images': [], 'audio': [], 'histograms': [], 'scalars': ['batch_loss', 'batch_accuracy', 'epoch_loss', 'epoch_accuracy'], 'distributions': [], 'tensors': [], 'graph': True, 'meta_graph': True, 'run_metadata': []}

@karlschriek
Copy link
Author

Tried using batch_loss this morning, looks like I am one step closer, but still not quite there:

/tensorboard/train/events.out.tfevents.1579515747.tfjob-example-tf-event-tlcljq2x-worker-0.8.140488.v2 will be parsed.
/tensorboard/train/events.out.tfevents.1579515749.tfjob-example-tf-event-tlcljq2x-worker-0.profile-empty will be parsed.
/tensorboard/train/plugins/profile/2020-01-20_10-22-29/local.trace will be parsed.
Unexpected error: corrupted record at 0
In tfjob-example-tf-event-tlcljq2x 30000 metrics will be reported.

Traceback (most recent call last):
  File "main.py", line 54, in <module>
    ), timeout=timeout_in_seconds)
  File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 309, in __call__
    self._request_serializer, self._response_deserializer)
  File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 195, in _blocking_unary_unary
    raise _abortion_error(rpc_error_call)
grpc.framework.interfaces.face.face.ExpirationError: ExpirationError(code=StatusCode.DEADLINE_EXCEEDED, details="Deadline Exceeded")

If I set this to epoch_loss instead it at least runs through, so looks like transmitting all 30.000 batches takes too long.

@stale
Copy link

stale bot commented Nov 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Dec 19, 2020

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@stale stale bot closed this as completed Dec 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants