Lifecycle callbacks

Pass callbacks under createTrainer({ callbacks: { ... } }). They run inside trainer.wait(), dispatched from the backend’s SSE event stream.

Signature

interface TrainerCallbacks {
  onStarted:    (ctx: { job: TrainingJob }) => unknown | Promise<unknown>;
  onLog:        (ctx: TrainingLogContext) => unknown | Promise<unknown>;
  onCheckpoint: (ctx: CheckpointContext) => unknown | Promise<unknown>;
  onCompleted:  (ctx: { job: TrainingJob; artifacts: unknown[] }) => unknown | Promise<unknown>;
  onFailed:     (ctx: { job: TrainingJob; error: string }) => unknown | Promise<unknown>;
}

The interface itself has all five properties as required. The callbacks field on createTrainer is typed Partial<TrainerCallbacks>, so you only specify the events you actually care about. Return values from a callback are discarded; the unknown | Promise<unknown> shape just means TypeScript will not complain if you return something.

When each callback fires

trainer.start()    submits the job and returns { jobId }. No callbacks yet.
   │
   ▼
trainer.wait()     opens the SSE stream. Callbacks dispatch from here.
   │
   ▼
onStarted          once, on the `training.started` event
onLog              many times, one per metrics frame
onCheckpoint       several times, one per checkpoint upload
onCompleted        once, on `training.completed`
        ── or ──
onFailed           once, on `training.failed` (backend-reported failure)

If you call start() without wait(), no callbacks ever run. arkor start calls both for you; programmatic callers must do the same.

Parameters

`onStarted({ job })`

Fires when the SSE stream reports training.started. Use it for log lines or a “training started” notification.

onStarted: ({ job }) => {
  // job: TrainingJob (id, name, status, config, ...)
}

`onLog({ step, loss, evalLoss, learningRate, epoch, samplesPerSecond, job })`

Fires repeatedly as training progresses. Each numeric field is number | null: backends only fill in fields they have on a given step (so evalLoss is null on non-eval steps, learningRate may be null between LR-scheduler updates, etc.).

onLog: ({ step, loss, evalLoss }) => {
  if (loss !== null) {
    forwardToMetrics({ step, loss, evalLoss });
  }
}

Common uses: forward metrics to your own pipeline (PostHog, Datadog), detect divergence early, and implement custom early-stopping (see the Early stopping recipe). For early-stopping, remember that aborting the abortSignal only stops your local wait(); call trainer.cancel() afterwards to actually stop the GPU on the backend.

`onCheckpoint({ step, adapter, job, infer, artifacts })`

Fires when an adapter checkpoint is saved on the backend, while the run is still going. adapter is { kind: "checkpoint", jobId, step }. infer is described in detail on the infer page; in short it takes a chat-style request and returns a raw Response.

onCheckpoint: async ({ step, infer }) => {
  const res = await infer({
    messages: [{ role: "user", content: "Can't log in" }],
  });
  const sample = await res.text();
  // Decide whether the model is on track
}

This is where most of the value of doing fine-tuning in TypeScript lives: you can run the half-trained model against a held-out prompt before the full run finishes.

`onCompleted({ job, artifacts })`

Fires once on success. artifacts is unknown[]: the raw artifact list the backend sent. Schemas evolve, so the SDK does not narrow it.

onCompleted: ({ job, artifacts }) => {
  saveAdapterId({ jobId: job.id, count: artifacts.length });
}

`onFailed({ job, error })`

Fires once on a backend-reported failure. error is a string (the message the backend sent), not an Error instance.

onFailed: ({ job, error }) => {
  // error: string
}

onFailed is only for backend-side failures. Exceptions thrown inside your other callbacks do not reach onFailed; see Exception handling below for what does happen to them.

Behavior

Sequencing

Each callback is awaited before the next event is dispatched. You can return a promise (writing to a database, posting to Slack, calling infer) and the SDK will wait for it before processing the next frame. There are no concurrent callback invocations for the same trainer.

Exception handling

Throwing inside a callback does not behave like a normal Promise rejection. The SDK’s event loop wraps dispatch in a try/catch and routes any throw to the SSE reconnect handler (packages/arkor/src/core/trainer.ts:335-364, then handleFailure at :307-320):

If abortSignal.aborted is set, the error re-throws and wait() rejects.
Otherwise, if maxReconnectAttempts was configured and the counter is exceeded, wait() rejects with a wrapping error.
Otherwise, the SDK delays and reopens the SSE stream.

maxReconnectAttempts defaults to undefined (unlimited). It is not configurable through TrainerInput; the only way to set it is the second context argument to createTrainer, which is annotated @internal and may change without notice. In practice, with default settings, a thrown callback is caught and retried, possibly indefinitely. If Last-Event-ID advances across the retry, the originally failing event is also skipped. For deterministic error handling, catch inside the callback (see the second example below).

Examples

Minimal: log every event.

const trainer = createTrainer({
  name: "support-bot-v1",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
  callbacks: {
    onStarted: ({ job }) => console.log(`run ${job.id} accepted`),
    onLog: ({ step, loss }) => {
      if (loss !== null) console.log(`step=${step} loss=${loss.toFixed(4)}`);
    },
    onCheckpoint: async ({ step, infer }) => {
      const res = await infer({
        messages: [{ role: "user", content: "Hello" }],
      });
      console.log(`ckpt @ ${step}:`, await res.text());
    },
    onCompleted: ({ job }) => console.log(`run ${job.id} done`),
    onFailed: ({ error }) => console.error(`failed: ${error}`),
  },
});

await trainer.start();
await trainer.wait();

Catch inside a callback to keep failures local instead of letting them trigger an SSE reconnect:

onCheckpoint: async ({ step, infer }) => {
  try {
    await sendToReview({ step, sample: await (await infer({ ... })).text() });
  } catch (err) {
    // log / metric / decide whether to fail the run yourself by calling
    // trainer.cancel() from outside the callback
  }
}

Type definitions

interface TrainingLogContext {
  step: number;
  loss: number | null;
  evalLoss: number | null;
  learningRate: number | null;
  epoch: number | null;
  samplesPerSecond: number | null;
  job: TrainingJob;
}

interface CheckpointContext {
  step: number;
  adapter: { kind: "checkpoint"; jobId: string; step: number };
  job: TrainingJob;
  infer: (args: InferArgs) => Promise<Response>;
  artifacts?: unknown[];
}

TrainingLogContext and CheckpointContext are not exported by name from arkor; mirror the shapes inline if you want typed callback parameters in your own code.

Documentation Index

​Lifecycle callbacks

​Signature

​When each callback fires

​Parameters

​onStarted({ job })

​onLog({ step, loss, evalLoss, learningRate, epoch, samplesPerSecond, job })

​onCheckpoint({ step, adapter, job, infer, artifacts })

​onCompleted({ job, artifacts })

​onFailed({ job, error })

​Behavior

​Sequencing

​Exception handling

​Examples

​Type definitions

​See also