<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Jobset</title><link>/</link><description>Recent content on Jobset</description><generator>Hugo</generator><language>en</language><atom:link href="/index.xml" rel="self" type="application/rss+xml"/><item><title>Example Workloads</title><link>/docs/tasks/workload_examples/</link><pubDate>Wed, 23 Jul 2025 00:00:00 +0000</pubDate><guid>/docs/tasks/workload_examples/</guid><description>&lt;h2 id="pytorch-example">PyTorch Example&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://github.com/kubernetes-sigs/jobset/tree/main/site/static/examples/pytorch/cnn-mnist/mnist.yaml">Distributed Training of a CNN on the MNIST dataset using PyTorch and JobSet&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Note&lt;/strong>: Machine learning container images can be quite large so it may take some time to pull the images.&lt;/p>
&lt;h2 id="tensorflow-example">TensorFlow Example&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://github.com/kubernetes-sigs/jobset/tree/main/site/static/examples/tensorflow/mnist.yaml">Distributed Training of a Handwritten Digit Classifier on the MNIST dataset using TensorFlow and JobSet&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>This example runs an example job for a single epoch.
You can view the progress of your jobs via &lt;code>kubectl logs jobs/tensorflow-tensorflow-0&lt;/code>.&lt;/p></description></item><item><title>Simple Examples</title><link>/docs/tasks/simple_examples/</link><pubDate>Wed, 23 Jul 2025 00:00:00 +0000</pubDate><guid>/docs/tasks/simple_examples/</guid><description>&lt;p>Here we have some simple examples demonstrating core JobSet features.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://github.com/kubernetes-sigs/jobset/tree/main/site/static/examples/simple/success-policy.yaml">Success Policy&lt;/a> demonstrates an example of utilizing &lt;code>successPolicy&lt;/code>.
Success Policy allows one to specify when to mark a JobSet as completed successfully.
This example showcases how to use success policy to mark the JobSet as successful if the worker replicated job completes.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://github.com/kubernetes-sigs/jobset/blob/main/site/static/examples/simple/exclusive-placement.yaml">Exclusive Job Placement&lt;/a>
demonstrates how to configure a JobSet to have a 1:1 mapping between each child Job and a particular topology domain, such as a datacenter rack or zone. This means that all the pods belonging to a child job will be colocated in the same topology domain, while pods from other jobs will not be allowed to run within this domain. This gives the child job exclusive access to computer resources in this domain.&lt;/p></description></item><item><title>Development Environment Setup</title><link>/docs/contribution_guidelines/development/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/contribution_guidelines/development/</guid><description>&lt;h2 id="dependencies">Dependencies&lt;/h2>
&lt;ul>
&lt;li>&lt;code>go&amp;gt;=1.24.0&lt;/code>&lt;/li>
&lt;li>&lt;code>make&lt;/code>&lt;/li>
&lt;li>&lt;code>kubectl&lt;/code>&lt;/li>
&lt;li>&lt;code>git&lt;/code>&lt;/li>
&lt;li>&lt;code>docker&lt;/code>&lt;/li>
&lt;li>Kubernetes cluster running one of the last 3 Kubernetes minor versions
&lt;ul>
&lt;li>&lt;a href="https://kind.sigs.k8s.io/">kind&lt;/a> allows you to run a local Kubernetes cluster using Docker containers&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="building-and-deploying-jobset-from-source">Building and deploying JobSet from source&lt;/h2>
&lt;h3 id="building-the-image">Building the image&lt;/h3>
&lt;p>See &lt;a href="https://github.com/kubernetes-sigs/jobset/blob/main/Makefile">&lt;code>Makefile&lt;/code>&lt;/a> targets for more information.
In particular:&lt;/p>
&lt;ul>
&lt;li>&lt;code>make image-build&lt;/code>: Builds a JobSet image locally&lt;/li>
&lt;li>&lt;code>make image-push&lt;/code>: Builds a JobSet image locally AND pushes it to the registry&lt;/li>
&lt;/ul>
&lt;p>The &lt;code>make image-push&lt;/code> hook will attempt to push the built image to the public &lt;code>us-central1-docker.pkg.dev/k8s-staging-images/jobset&lt;/code>
registry with an image tag determined by &lt;code>git describe&lt;/code>. It is recommended to set your own &lt;code>GIT_TAG&lt;/code> and &lt;code>IMAGE_REGISTRY&lt;/code> environment
variables to ensure that your latest changes are pushed to an image registry your cluster can access.&lt;/p></description></item><item><title>Failure Policy</title><link>/docs/tasks/failure_policy/</link><pubDate>Wed, 23 Jul 2025 00:00:00 +0000</pubDate><guid>/docs/tasks/failure_policy/</guid><description>&lt;p>JobSet provides failure policy API to control how your workload behaves in response to child Job failures.&lt;/p>
&lt;p>The &lt;code>failurePolicy&lt;/code> is defined by a set of &lt;code>rules&lt;/code>. For any job failure, the rules are evaluated in order, and the first matching rule&amp;rsquo;s action is executed. If no rule matches, the default action is &lt;code>RestartJobSet&lt;/code>, which counts towards the &lt;code>maxRestarts&lt;/code> limit.&lt;/p>
&lt;h2 id="failure-policy-actions">Failure Policy Actions&lt;/h2>
&lt;h3 id="failjobset">&lt;code>FailJobSet&lt;/code>&lt;/h3>
&lt;p>This action immediately marks the entire JobSet as failed.&lt;/p></description></item><item><title>Volume Claim Policies</title><link>/docs/tasks/volume_claim_policies/</link><pubDate>Tue, 13 Jan 2026 00:00:00 +0000</pubDate><guid>/docs/tasks/volume_claim_policies/</guid><description>&lt;p>JobSet provides the VolumeClaimPolicies API to automatically create and manage shared
PersistentVolumeClaims (PVCs) across multiple ReplicatedJobs within a JobSet.
This enables stateful JobSets that require persistent storage for datasets, models, checkpoints, or
intermediate results.&lt;/p>
&lt;h2 id="basic-usage">Basic Usage&lt;/h2>
&lt;p>To use VolumeClaimPolicies, define them in the &lt;code>volumeClaimPolicies&lt;/code> field of your JobSet spec.
Each policy can contain one or more PVC templates.&lt;/p>
&lt;p>&lt;a href="https://github.com/kubernetes-sigs/jobset/blob/main/site/static/examples/volume-claim-policy/single-pvc.yaml">This example&lt;/a>
demonstrates creating shared PVCs with different retention policies:&lt;/p>
&lt;p>In this example:&lt;/p></description></item><item><title>Prometheus Metrics</title><link>/docs/reference/metrics/</link><pubDate>Mon, 14 Feb 2022 00:00:00 +0000</pubDate><guid>/docs/reference/metrics/</guid><description>&lt;h2 id="prometheus-metrics">Prometheus Metrics&lt;/h2>
&lt;p>JobSet exposes &lt;a href="https://prometheus.io">prometheus&lt;/a> metrics to monitor the health
of the controller.&lt;/p>
&lt;h2 id="installation-examples">Installation Examples&lt;/h2>
&lt;p>The following &lt;a href="https://github.com/kubernetes-sigs/jobset/tree/main/site/static/examples/prometheus-operator">example&lt;/a> show how to install the Prometheus Operator for JobSet system.&lt;/p>
&lt;h2 id="jobset-controller-health">JobSet controller health&lt;/h2>
&lt;p>Use the following metrics to monitor the health of the jobset controller:&lt;/p>
&lt;table>
 &lt;thead>
 &lt;tr>
 &lt;th>Metric name&lt;/th>
 &lt;th>Type&lt;/th>
 &lt;th>Description&lt;/th>
 &lt;th>Labels&lt;/th>
 &lt;/tr>
 &lt;/thead>
 &lt;tbody>
 &lt;tr>
 &lt;td>&lt;code>controller_runtime_reconcile_errors_total&lt;/code>&lt;/td>
 &lt;td>Counter&lt;/td>
 &lt;td>The total number of reconciliation errors encountered by each controller.&lt;/td>
 &lt;td>&lt;code>controller&lt;/code>: name of controller (i.e. use value &lt;code>jobset&lt;/code> to obtain metrics for jobset controller)&lt;/td>
 &lt;/tr>
 &lt;tr>
 &lt;td>&lt;code>controller_runtime_reconcile_time_seconds&lt;/code>&lt;/td>
 &lt;td>Histogram&lt;/td>
 &lt;td>The latency of a reconciliation attempt in seconds.&lt;/td>
 &lt;td>&lt;code>controller&lt;/code>: name of controller (i.e. use value &lt;code>jobset&lt;/code> to obtain metrics for jobset controller)&lt;/td>
 &lt;/tr>
 &lt;/tbody>
&lt;/table>
&lt;h2 id="jobset-metrics">JobSet metrics&lt;/h2>
&lt;p>Use the following metrics to monitor the health of the jobsets created by the jobset controller:&lt;/p></description></item><item><title>JobSet API</title><link>/docs/reference/jobset.v1alpha2/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/docs/reference/jobset.v1alpha2/</guid><description>&lt;h2 id="resource-types">Resource Types&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="/docs/reference/jobset.v1alpha2/#jobset-x-k8s-io-v1alpha2-JobSet">JobSet&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="jobset-x-k8s-io-v1alpha2-JobSet">&lt;code>JobSet&lt;/code> &lt;/h2>
&lt;p>&lt;strong>Appears in:&lt;/strong>&lt;/p>
&lt;p>JobSet is the Schema for the jobsets API&lt;/p>
&lt;table class="table">
&lt;thead>&lt;tr>&lt;th width="30%">Field&lt;/th>&lt;th>Description&lt;/th>&lt;/tr>&lt;/thead>
&lt;tbody>
&lt;tr>&lt;td>&lt;code>apiVersion&lt;/code>&lt;br/>string&lt;/td>&lt;td>&lt;code>jobset.x-k8s.io/v1alpha2&lt;/code>&lt;/td>&lt;/tr>
&lt;tr>&lt;td>&lt;code>kind&lt;/code>&lt;br/>string&lt;/td>&lt;td>&lt;code>JobSet&lt;/code>&lt;/td>&lt;/tr>
&lt;tr>&lt;td>&lt;code>spec&lt;/code> &lt;B>[Required]&lt;/B>&lt;br/>
&lt;a href="#jobset-x-k8s-io-v1alpha2-JobSetSpec">&lt;code>JobSetSpec&lt;/code>&lt;/a>
&lt;/td>
&lt;td>
 &lt;p>spec is the specification for jobset&lt;/p>
&lt;/td>
&lt;/tr>
&lt;tr>&lt;td>&lt;code>status&lt;/code> &lt;B>[Required]&lt;/B>&lt;br/>
&lt;a href="#jobset-x-k8s-io-v1alpha2-JobSetStatus">&lt;code>JobSetStatus&lt;/code>&lt;/a>
&lt;/td>
&lt;td>
 &lt;p>status is the status of the jobset&lt;/p>
&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="jobset-x-k8s-io-v1alpha2-Coordinator">&lt;code>Coordinator&lt;/code> &lt;/h2>
&lt;p>&lt;strong>Appears in:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;a href="/docs/reference/jobset.v1alpha2/#jobset-x-k8s-io-v1alpha2-JobSetSpec">JobSetSpec&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Coordinator defines which pod can be marked as the coordinator for the JobSet workload.&lt;/p>
&lt;table class="table">
&lt;thead>&lt;tr>&lt;th width="30%">Field&lt;/th>&lt;th>Description&lt;/th>&lt;/tr>&lt;/thead>
&lt;tbody>
&lt;tr>&lt;td>&lt;code>replicatedJob&lt;/code> &lt;B>[Required]&lt;/B>&lt;br/>
&lt;code>string&lt;/code>
&lt;/td>
&lt;td>
 &lt;p>replicatedJob is the name of the ReplicatedJob which contains
the coordinator pod.&lt;/p></description></item><item><title>Search Results</title><link>/search/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/search/</guid><description/></item></channel></rss>