Documentation Index
Fetch the complete documentation index at: https://arkor-92aeef0e-eng-736.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Lifecycle callbacks
Pass callbacks undercreateTrainer({ callbacks: { ... } }). They run inside trainer.wait(), dispatched from the backend’s SSE event stream.
Signature
callbacks field on createTrainer is typed Partial<TrainerCallbacks>, so you only specify the events you actually care about. Return values from a callback are discarded; the unknown | Promise<unknown> shape just means TypeScript will not complain if you return something.
When each callback fires
start() without wait(), no callbacks ever run. arkor start calls both for you; programmatic callers must do the same.
Parameters
onStarted({ job })
Fires when the SSE stream reports training.started. Use it for log lines or a “training started” notification.
onLog({ step, loss, evalLoss, learningRate, epoch, samplesPerSecond, job })
Fires repeatedly as training progresses. Each numeric field is number | null: backends only fill in fields they have on a given step (so evalLoss is null on non-eval steps, learningRate may be null between LR-scheduler updates, etc.).
abortSignal only stops your local wait(); call trainer.cancel() afterwards to actually stop the GPU on the backend.
onCheckpoint({ step, adapter, job, infer, artifacts })
Fires when an adapter checkpoint is saved on the backend, while the run is still going. adapter is { kind: "checkpoint", jobId, step }. infer is described in detail on the infer page; in short it takes a chat-style request and returns a raw Response.
onCompleted({ job, artifacts })
Fires once on success. artifacts is unknown[]: the raw artifact list the backend sent. Schemas evolve, so the SDK does not narrow it.
onFailed({ job, error })
Fires once on a backend-reported failure. error is a string (the message the backend sent), not an Error instance.
onFailed is only for backend-side failures. Exceptions thrown inside your other callbacks do not reach onFailed; see Exception handling below for what does happen to them.
Behavior
Sequencing
Each callback is awaited before the next event is dispatched. You can return a promise (writing to a database, posting to Slack, callinginfer) and the SDK will wait for it before processing the next frame. There are no concurrent callback invocations for the same trainer.
Exception handling
Throwing inside a callback does not behave like a normal Promise rejection. The SDK’s event loop wraps dispatch in a try/catch and routes any throw to the SSE reconnect handler (packages/arkor/src/core/trainer.ts:335-364, then handleFailure at :307-320):
- If
abortSignal.abortedis set, the error re-throws andwait()rejects. - Otherwise, if
maxReconnectAttemptswas configured and the counter is exceeded,wait()rejects with a wrapping error. - Otherwise, the SDK delays and reopens the SSE stream.
maxReconnectAttempts defaults to undefined (unlimited). It is not configurable through TrainerInput; the only way to set it is the second context argument to createTrainer, which is annotated @internal and may change without notice. In practice, with default settings, a thrown callback is caught and retried, possibly indefinitely. If Last-Event-ID advances across the retry, the originally failing event is also skipped.
For deterministic error handling, catch inside the callback (see the second example below).
Examples
Minimal: log every event.Type definitions
TrainingLogContext and CheckpointContext are not exported by name from arkor; mirror the shapes inline if you want typed callback parameters in your own code.
See also
createTrainerfor where these callbacks attach- Run lifecycle for the conceptual flow
inferfor the function exposed toonCheckpoint- Trainer control for
abortSignalandcancel() - Early stopping recipe