<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<!-- generated by https://github.com/cabo/kramdown-rfc version 1.7.31 (Ruby 3.2.3) -->
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902" docName="draft-li-cats-intellinode-network-scheduling-00" category="info" consensus="true" submissionType="IETF" tocInclude="true" sortRefs="true" symRefs="true" version="3">
  <!-- xml2rfc v2v3 conversion 3.31.0 -->
  <front>
    <title abbrev="IntelliNode">IntelliNode: In-Network Intelligent Scheduling Extensions for CATS</title>
    <seriesInfo name="Internet-Draft" value="draft-li-cats-intellinode-network-scheduling-00"/>
    <author fullname="Qing Li">
      <organization>Pengcheng Laboratory</organization>
      <address>
        <email>liq@pcl.ac.cn</email>
      </address>
    </author>
    <author fullname="Teng gao">
      <organization>Pengcheng Laboratory</organization>
      <address>
        <email>gaot@pcl.ac.cn</email>
      </address>
    </author>
    <author fullname="Yong Jiang">
      <organization>Tsinghua Shenzhen International Graduate School &amp; Pengcheng Laboratory</organization>
      <address>
        <email>jiangy@sz.tsinghua.edu.cn</email>
      </address>
    </author>
    <date year="2026" month="March" day="01"/>
    <area>Routing</area>
    <workgroup>Computing-Aware Traffic Steering</workgroup>
    <keyword>AI Networks</keyword>
    <keyword>In-Network Scheduling</keyword>
    <keyword>Tensor</keyword>
    <keyword>RoCEv2</keyword>
    <abstract>
      <?line 46?>

<t>This document introduces IntelliNode, an in-network intelligent scheduling mechanism built upon the Computing-Aware Traffic Steering (CATS) framework. Modern large-scale AI training and inference heavily rely on distributed heterogeneous clusters (GPU/CPU/FPGA). However, existing networks lack awareness of tensor semantics, training phases, and heterogeneous computing capabilities, leading to high communication latency, low resource utilization, and pipeline stalls.</t>
      <t>IntelliNode shifts away from the traditional passive scheduling paradigms that rely on probes and controllers. By bypassing traditional paths and integrating FPGAs alongside programmable Switch ASICs, it constructs a rapid data-plane closed loop of "Perception-Inference-Decision-Execution". This architecture performs feature extraction at line rate, leverages lightweight prediction models to infer short-term network behavior, and drives real-time heuristic scheduling decisions (e.g., path selection, tensor slicing, and compute matching). This document defines the four core functional layers and extension signaling that support this architecture, laying the foundation for an AI-native, scalable distributed computing network.</t>
    </abstract>
  </front>
  <middle>
    <?line 52?>

<section anchor="introduction">
      <name>Introduction</name>
      <t>The CATS framework primarily addresses the selection of service instances and computing-aware traffic steering in general distributed systems. However, when confronting large-scale AI training, distributed inference, and heterogeneous computing clusters, AI workloads exhibit traffic dynamics on the order of microseconds to milliseconds, accompanied by highly diverse tensor types (e.g., gradients, activations, parameters).</t>
      <t>Traditional CATS models (control-plane decisions combined with service-level steering) are inadequate for these next-generation computing workloads. IntelliNode proposes an extended architecture deeply embedded in the data plane. It not only natively processes RoCEv2 protocol semantics but also transforms the network from a "passive data pipe" into an "active, computing-aware collaborative engine".</t>
    </section>
    <section anchor="problem-statement">
      <name>Problem Statement</name>
      <t>Applying existing network scheduling mechanisms to AI training and heterogeneous computing networks reveals the following fundamental limitations:</t>
      <ul spacing="normal">
        <li>
          <t>Tensor Semantic Blind Spot: Existing mechanisms cannot distinguish specific semantics within data streams, such as gradients, activations, or parameter updates.</t>
        </li>
        <li>
          <t>Lag in End-to-End Feedback: Mechanisms like ECN misinterpret "passive feedback" as "active scheduling" and assume that "rate reduction is the correct response." This completely ignores the computing semantics in AI inference, where certain flows "cannot slow down, but must wait or degrade precision."</t>
        </li>
        <li>
          <t>Excessive Control-Plane Decision Latency: Control-plane routing updates, which take hundreds of milliseconds to seconds, cannot handle transient congestion or iterative bursts within a 1-5 millisecond window.</t>
        </li>
        <li>
          <t>Conflict between Isomorphic Assumptions and Heterogeneous Reality: In cross-domain computing networks, node capabilities are highly uneven (e.g., GPU/FPGA hybrids). The network must possess global state awareness to accurately match computing power with communication workloads rather than relying on isomorphic computing assumptions.</t>
        </li>
      </ul>
    </section>
    <section anchor="architecture">
      <name>Architecture</name>
      <t>The IntelliNode architecture consists of four tightly coordinated functional layers that perfectly align with CATS's abstractions for information collection, decision engine, and steering plane. This architecture fuses the capabilities of programmable switches, FPGAs, and CPUs at the local node, enabling a microsecond-level closed loop without interrupting the packet forwarding path.</t>
      <section anchor="feature-extraction-layer-switch-asic">
        <name>Feature Extraction Layer (Switch ASIC)</name>
        <t>Deployed on Tofino-class programmable switch ASICs, this layer actively participates in RoCEv2 traffic management. It maintains a high-performance Queue Pair (QP) flow state machine. The switch collects and parses real-time features at line-rate, including:</t>
        <ul spacing="normal">
          <li>
            <t>Basic Network Features: Ingress port, transmission rate, flow size, queue depth, and link utilization.</t>
          </li>
          <li>
            <t>AI Semantic Features: Tensor type (gradient / activation / normal traffic), tensor position within a batch/iteration, the stage of the model-parallel pipeline, and whether it is cross-node gradient-sync traffic.</t>
          </li>
          <li>
            <t>Flow State Classification: The hardware identifies the flow's current state as UNALLOCATED, SMALL_FLOW (delay-sensitive/control), LARGE_FLOW (high-bandwidth parameter synchronization), or DRAINING (tail-end flushing).</t>
          </li>
        </ul>
        <t>These features are extracted, normalized, and encoded at line-rate, then written into a high-speed featureFIFO to be sent directly to the onboard FPGA. Simultaneously, the pipeline incorporates real-time checksum updates and validation logic for mutable fields in RoCEv2 (such as ECN and TTL markings) to ensure protocol legitimacy.</t>
      </section>
      <section anchor="state-prediction-layer-fpga">
        <name>State Prediction Layer (FPGA)</name>
        <t>The FPGA reads features from the featureFIFO and executes an ultra-low latency, lightweight prediction model (e.g., State-GNN based on Graph Neural Networks, linear regression, or heuristic models). This layer focuses on predicting short-term network and load states within the next 1-5 milliseconds (ms):</t>
        <ul spacing="normal">
          <li>
            <t>Network State Prediction: Imminent congestion risks on switch ports and the available bandwidth of candidate routing paths in the next window.</t>
          </li>
          <li>
            <t>Computing Load Prediction: The arrival time of the next batch of periodic tensor traffic, and the probability of queuing backlogs or pipeline stalls at the downstream GPU.</t>
          </li>
        </ul>
        <t>These forward-looking prediction fields serve as the core input for the subsequent scheduling engine.</t>
      </section>
      <section anchor="heuristic-scheduling-layer">
        <name>Heuristic Scheduling Layer</name>
        <t>The scheduling engine integrates the currently extracted AI semantics with the predicted states output by the FPGA, approximating Pareto Optimality amidst conflicting multi-objective goals (e.g., computing latency vs. communication overhead). The decision logic is based on:</t>
        <ul spacing="normal">
          <li>
            <t>Tensor type and structural priority.</t>
          </li>
          <li>
            <t>Operator dependency.</t>
          </li>
          <li>
            <t>Heterogeneous computing capabilities of target nodes.</t>
          </li>
          <li>
            <t>1-5ms network and congestion predictions.</t>
          </li>
        </ul>
        <t>The decision outputs (execution actions) include:</t>
        <ul spacing="normal">
          <li>
            <t>Path and Priority: Outputs the optimal path set. SMALL_FLOWs are prioritized based on arrival rate, while LARGE_FLOWs dynamically allocate bandwidth based on target computing power using a Weighted Deficit Round Robin (WDRR) policy.</t>
          </li>
          <li>
            <t>Tensor Slicing: Determines if tensor slicing is necessary, defining the number of slices and the independent routing path for each.</t>
          </li>
          <li>
            <t>Multipath Aggregation: Decides whether to enable data-plane multipath aggregation.</t>
          </li>
          <li>
            <t>In-Network Offloading: Decides whether to offload specific operators (e.g., Sum/Reduce) to in-network FPGAs or edge nodes.</t>
          </li>
        </ul>
      </section>
      <section anchor="steering-plane">
        <name>Steering Plane</name>
        <t>The output of the heuristic scheduling must be applied to the entire network data plane via a lightweight signaling mechanism (potentially as an extension to CATS-SR or CATS-Overlay):</t>
        <ul spacing="normal">
          <li>
            <t>Control Plane Interface: Triggers the local CPU to update the routing/forwarding tables, applying the latest policies to the next batch of traffic automatically.</t>
          </li>
          <li>
            <t>Data Plane Labels / TLVs: Pushes extended Metadata TLVs carrying tensor types, training phases, and compute resource requests into the packet header.</t>
          </li>
          <li>
            <t>Fragment Routing Encapsulation: Provides necessary Encapsulation information for traffic requiring Tensor Slicing.</t>
          </li>
        </ul>
      </section>
    </section>
    <section anchor="security-considerations">
      <name>Security Considerations</name>
      <t>Given that IntelliNode introduces granular TLV fields for tensor semantics and active data-plane scheduling, the system MUST:</t>
      <ul spacing="normal">
        <li>
          <t>Provide integrity protection for TLV fields to prevent malicious nodes from tampering with "Tensor Types" to preempt high-priority queues.</t>
        </li>
        <li>
          <t>Introduce encrypted control-plane channels for telemetry and configuration.</t>
        </li>
        <li>
          <t>Implement authentication to prevent unauthorized nodes from falsely broadcasting their Compute-Capability within the computing network.</t>
        </li>
      </ul>
    </section>
    <section anchor="iana-considerations">
      <name>IANA Considerations</name>
      <t>This document requests that IANA allocate new TLV types for AI-native CATS deployments, including but not limited to:</t>
      <ul spacing="normal">
        <li>
          <t>TENSOR_TYPE TLV</t>
        </li>
        <li>
          <t>TRAINING_PHASE TLV</t>
        </li>
        <li>
          <t>COMPUTE_CAPABILITY TLV</t>
        </li>
        <li>
          <t>PATH_PREDICTION TLV</t>
        </li>
      </ul>
    </section>
  </middle>
  <back>








  </back>
  <!-- ##markdown-source:
H4sIAAAAAAAAA51aa28jxxH8LkD/YUADiRRwqYuD5IMQBKZ1Op0CnURLdAwD
Bozh7pCc3L48sysd/etT1TOzu9Sd8wpwl+M+Znq6q6ure51l2elJZ7vSXKrZ
bd2ZsrT3TYFft3V2b7qXxn1U8frO1J16yvem6Etb79T1p87U3ja1V9vGqavl
+ml2eqI3G2eej1fD5Vx3Zte4w6Wy9bY5PTk9KZq81hV2KpzedllpMzzjMxte
q/FaVgcDMj9smr15c3ri+01lPXfuDi1NvV6/Oz2p+2pj3CUWxlaX6us3X/8l
e/On7M0fsTlshKm9v1Sd683pCez7E0x1RsPQx6bvsDSM5GY71/Qtrl41VSvX
s+ULHlRrWLm1uXrqjHHh8Y/mgDcKbKkytbxV0V9efk/8N/pM7qxhSuPkn4/N
1fXz1/TGs6l7g5X+h+3D4Wc/YAuG44Zv4nKlbYnLf6U3v7Gm2y4at/sbI+Py
PW/su671lxcXfJDX7LNZpOcueOFi45oXby64wgXfxLt9t2+cnBR/8L9tX5Yh
fLPvuPudnYUbWETX9lfdITyXamXqHU7PB/SmcboDBMJzJtpZ2l++afNyofNF
Xsc1vrjNmqvsdPN/7oM3u6ONvrjJjw2W+LvV4uDPt1l7nHXfa/WEvX7FH8kN
V8ttXSIGuugBP4a8aUr1u//CsH9yt8M3/tdFF1dfAC3JxrpxFVZ/JjZOT5g7
099Zlim98Z3Tecff6731ConVV8xVpJJrij43Xk2Sca50jVspuZSdZPeYaKoy
+R4n95Xa9LbsVN82ter2Rv0nZKozMsG52jq4lDss1Ads62oFtO0MklmXhukC
q23NF3RdkBWMM3Vu1N7oZ1selDP4C3sWFuezm74zBe7B2w1MNU3vVV72Hr+9
OrtZfX9xhT/vVjfL84V637yYZ+PmynzCy9wiHtbDhvyj0jS8Nt6rZqs6SUfl
EY+6s7mfj4a1e+2Nn4uBr7ZOTlC5bvXGlrazfLI0uuDVrlF7u9vzuaqvbS4A
webYLD/gseYF5/NN73BgrFNGhIWtWtsaxMAo3+my9AtGdhJA5fd223me4gAv
N5WEBUYXNsKw1WDHZzMNZ6t5f1d5PKy7wbmtazaAB3cFSQIuZQmHLtS3B7U5
yDI8zNHS3d7HiIHRAWg+Qb/jaons8RYWYlncqiq9QaifXmyX79Xy6fYKHrId
d0JI+5xnUE63tlDgbJ21pcah87LxCHXZNC3DM1sZOKnl9tltAkn21uSWBSC7
/mTynjdnCyXoF0brTN71wGZrHBMGBcpouWA+Sa4wGPCCOBlHMAwcAKN38AVS
Yd+9GP6Nc5jChscreL70DKxAFUFoXJcBE1UCl9qYPaDbuBDFwiEEHp7WZdbZ
isDuHeGYT+NSxIMAxGaxW8zFwQBjafIAiATP0uZ4fh5DRfQZBSbAYevdeTz7
kPmF2eJoXoCxBcrwBg6/7es8hrHUByYOFzOpjCtvd7gnESdGfN+2OCN+vHLr
nK+Hx2T5ugjwpgoAtyxvs1ooaq6Y64KBaRKPuRMdt0hUVtmiKA1/fUXGEvLi
yoHajEiMkVcQHVtpR67QRYF08vHEg/eIH2/cs0WWWWBO1/kA9kRiwgWEuJCY
TyRma8Vcd/DV1HZ/wBOVn1DMC6sAEI1ErOVQv8Fy86N1Brr7D+QSGW7OlXjm
stGFR8j2doNESlYXB5QvUJeKFA1RAoTi8LjokE0wrxDoVhYcEn9j55w7geRh
0OYghAVfFoic8yYBjypjAOeOTACAycsIscTdz4VdKp7Bn0sw1xPGkKDF7DmL
JBNTfQQ/DNkAsYUCV+xTzDImZTnE5FwxUrbWhflFiizxhuPC1hogzkK8JO6j
BwenLaY1kATVNl7AEBKgwN5H1FEY08IbBqKyKCRi4loSlRLrsWCn6qaD0/Fc
QDz+gZXzAMUg7niha3KIgaHCoKJ2IEvfMIC1DwzFxRORCKlrNUs8HjZFVZiR
dRsaPRP/Az6vkYydyqA0+CaEB9w6W4ScWoHtS1OhVMN9JApeXrY4J131ulh+
UQ0IjF4X79+C71B10RGABhMflSh/vL0lc9AMEpKtbBfgJMrmD1BIQSirp+g3
9S1MKdRT23SXaD6isRPLcl0zHkW401sPJLWAmCT24HwiDMEUnyIhja4AYN+j
Qmn/mwCHGQPGIYXYYrAu08o7LWxxXRdZ12T4P/XOmGIDnXGpPozGlfajUddX
98hBz9LpUFu6McTb+M6MVsTgTgIwEz/jWfB7YOcZ6xYcGylS2eBdEL0Dgqku
WnY9i1koDQxKCeuBULA8qkF6PMVqdJAlhU8ZChRHYBnXIeZqi+jBxOhsTylT
NC+oVER1Bb5SLxrcBI8V1AeSbDHPF7PgsutPTBGe8CrywUr4IFV1+FSE0uVw
P/CFC51aCgAts4hbp+HaPcAEb/hAeyPNEa8D40WjERPUmZB9jDb5G6U/lAwH
iWJi+mx657sBMlr9MfvzdHHcqHH2CATYukWV7qADoB/YG/imalwLG9WSkRMR
E+rP+6OMeURy2O7ApluRsH1WNBV9/XkqzRU74yPVKbQYybuvkWl1ouubKInV
/rBxtvCiE0aakWCBBj2V8K5sNppkS1iNAplsk+c9wYblRXBMrGpRBV3g7GOl
O9YqvAn4ELO1qE6+JnAdfDMup0cvRcZaTig5qYApjx9RNnWlZbwAAVE9HUUc
zM4bVEQUDlbez0WQpBOVIpahloAArMOZWLp+74cOaxh1DG2YlJpy0GqpnkXa
DbV9kBSxanyuUrd9Ui5HYcUpjnS0Fx1N2IvcDquj6/GUsny7bCA6BB/ofGq8
Iz6dioBYUacSmwdFWomcd65vuyTsWrARGAonBRiK0EN0+xCXr8BxQVBfj4L6
ju5UZxO5f86H36KONgdsh0fWDXRpk+UlIv2lw6UmQSSnhCfQsBRW7UBPtmXm
k6NieU0aCOwF+c5qIoWZ6UO6YovB3MhiK0AJqL7rTW/USluY+93qXCgtIr/S
VNQmJEo0KoY4ZC7M8EeyPrYWPjUUWWgobA3xRrcN5exb7WFnGgpFB3om/Y7i
VVFszwMpxfFW7E2CefZX/PMXsbxAP7QP8ceOH6cdZOQiEPhQNsed1qOkU2ep
1KmLSanDD5k4lMmv50MHAqKwIbcTG27IBheRLaVX2UvPujPSV+OHyL6MlRMe
LIfGNpiOsiLUgGLBAiW8J+SWLMv8oc6TIfFg7+gL0S/qiihicU/DGWy4B1RF
BqELxeG3NjVBeA2ZDCJzMucILOfV9/fLu7sH5Pn127l6+oAfP/387u7hB3UG
w/Uh49jQEoAXUbnCH3fLx5vr9Jhga4PjvNgCjDGKBNq+R08Q43IuGuLt4/L2
/vb+Rp0Bm2UG2QnLeh96uMhvfgqpsWU1xTyGBkAoggdRIxsRrkfI69iRvDjb
IXBRL4YUgBgiAYbF392+eyC7b9gvsWW0LvAfrkkPUW8aOFO4ZqGebNWX6J9Y
rspDiPQwpgDUQeWUnEeJAa7KP4LQU7kWk59hf2wXy2YHdJJPq74TEkC4ymKa
3GdJl1E48fX1+g4pKgNPf05TOdZ1ZhTZpdkhXsjiQ2KqAJbV2MlHnpJZUaop
UiVheuFH5w+jlanHQsfMmUNoHuAVpzOCchzu/JsBQqrMYlR2c3+PJPKBG2+c
bvegh5495/1Q7ulh7WCbkISkGTw2zhJCZ5UmAIEzt00uJUUGPMEAKrzPJxbC
ICjTISEGoRN6kU/da72DBq7y5wOhDfPtVx4GpVUVzD4WVrD3o5gUWZV0FzDB
7fQzp9EEwZhMIBEItoJ4GbVfGEBNjXwlwpKYuOO5pkYxzto5MB3YjQCNJCWL
CJdJyUWxbvDO0P8G+pkPlnJkFor0gc+TkbkdxTsQ7aVVOB7gpfpMmRw6Diqz
ab6HGgsYNTLKn6AmpgQ7YiGsqPGZdDhnaoHRv2w8uuJXM9wgQ1IivB8wM/mI
I8mQsuCzV4chX9IngT/ZFydWYqk5bq6il+QIZoAWwkeDNwe5zXyDS1s485Ot
whBxBbZDRj+0zF9xr64gWgVEIqyl3UPC2azZ/NOEFmnXsK+MWTUqyZiM6hl9
/7EybZ6N2yPRoxYeJFtgI6RQSsjXbahUzSDoOLuUNG2BFRDtIYLvoTUy4WeB
5kihztOd9//F9FjwyOFRJxou9ZfIQbSP04ydJNWIFJ/wNB4puJzeSUNSFVXs
eZQnZjjkimNHLr6KR7pUD/F1qQYhKGk6CY01KZahTkVfsDqNpJbyLVQmtGtI
8En99Gl0hTSh8qaC7aYUMCwUHfO69eh90Lg/CN3i0bcGyQpF8cixJP7egCnO
fnj7+HiONwCiFJA0Wwhj1Uu8R2KUoandvhq6EhW1Ycuq3WEehqtJJYevjjJo
xMNmJDSQUgRBd8RdkrIGKjMa8oGAlhvLHSh+F8UMW+GCjBxFkpS6MEodR+XV
8K4e343rTr4/Pmy3JPl4zs/WbcLtcVbSRBgPefXUVxePnDSY8zAAHz4ehcE/
T1RA9SXYxrob+x7p7RM4Iw1E6v3iSFzaUsgSsEPJ0WQUJNRzbmxex0GcerYa
EJiW3XGSPX7GOmubjmsEqI2TP0kV7MFmL3t6VPFTdvYAokA5HctdnEWE84QP
f1udGxQWZ3e70EmmJgxtGdcMykeuRwxcTPopET1eaLAc5ulkLmnLiT3j0+mP
i1RqenTfNWRPyZ8Y+Ld0TLDxTm84db1Q67t/QP6voDSNH+edH0ynxY28DSpy
LlgxGfz+xsew9AFi+IblWHzYeovenLSPpFrjknh3eiffJ+Ind3VdgwB9X6av
uK55FnQO2Xb8xFHjvR2Ls2xvBWzHeR2nCE+gP3IaQ8jvU6Ff8bx5YzkvkRHA
dK4w+XSK8ldjf0cvpXIse7/6bhgmdKEqTXJ0BHbsj+QLgvrw/dN6JN9w7lht
aSj1bPyGwb0mW8O9LcepNXtcnpIFRRIvClZdtSHvpBbPokPWjOYsvm2qtotd
cWT70Fj6gTvi6dlhuEMbvtpMJ3FMqprYCp4o0Xh3iFasTlu7690RGXH4KJHn
f0jALIzVeHKavg7/kYEUkMmBtijw7P83DiSVa58GFGjeg9gz2VWqooepgv3y
d6av1O3yfvkFJBx/QRsAHaDBV4byVJsXiUj4NEIPDJ+8wteOQkYeVZgjD4MA
GZBy/CizbiG2UWNc3z89PP708/rH1TXXjldju/jTz6v3y6fJjauHD6vv1yii
V8vV8tvbu9v1j+PN1XL9Hm88Xr+9vVrfPtyHO6cn/wKAIxIh7CMAAA==

-->

</rfc>
