Condividi tramite


Esempi di applicazioni stateful

Questo articolo contiene esempi di codice per applicazioni con stato personalizzate. Databricks consiglia di usare metodi predefiniti con stato per operazioni comuni, ad esempio aggregazioni e join.

I modelli in questo articolo usano l'operatore transformWithState e le classi associate disponibili in anteprima pubblica in Databricks Runtime 16.2 e versioni successive. Vedi Crea un'applicazione personalizzata con stato.

Nota

Python usa l'operatore transformWithStateInPandas per fornire la stessa funzionalità. Gli esempi seguenti forniscono codice in Python e Scala.

Requisiti

L'operatore transformWithState e le API e le classi correlate hanno i requisiti seguenti:

  • Disponibile in Databricks Runtime 16.2 e versioni successive.
  • Il calcolo deve usare la modalità di accesso dedicata o la modalità di accesso senza isolamento.
  • È necessario usare il provider dell'archivio di stato RocksDB. Databricks consiglia di abilitare RocksDB come parte della configurazione di calcolo.

Nota

Per abilitare il provider dell'archivio stati RocksDB per la sessione corrente, eseguire quanto segue:

spark.conf.set("spark.sql.streaming.stateStore.providerClass", "org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider")

Dimensione a cambiamento lento (SCD) tipo 1

Il codice seguente è un esempio di implementazione del tipo SCD 1 usando transformWithState. Il tipo SCD 1 tiene traccia solo del valore più recente per un determinato campo.

Nota

È possibile usare tabelle di streaming e APPLY CHANGES INTO per implementare il tipo SCD 1 o tipo SCD 2 usando le tabelle basate su Delta Lake. Questo esempio implementa il tipo SCD 1 nell'archivio stati, che offre una latenza inferiore per le applicazioni quasi in tempo reale.

Pitone

# Import necessary libraries
import pandas as pd
from pyspark.sql.streaming import StatefulProcessor, StatefulProcessorHandle
from pyspark.sql.types import StructType, StructField, LongType, StringType
from typing import Iterator

# Set the state store provider to RocksDB
spark.conf.set("spark.sql.streaming.stateStore.providerClass", "org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider")

# Define the output schema for the streaming query
output_schema = StructType([
    StructField("user", StringType(), True),
    StructField("time", LongType(), True),
    StructField("location", StringType(), True)
])

# Define a custom StatefulProcessor for slowly changing dimension type 1 (SCD1) operations
class SCDType1StatefulProcessor(StatefulProcessor):
    def init(self, handle: StatefulProcessorHandle) -> None:
        # Define the schema for the state value
        value_state_schema = StructType([
            StructField("user", StringType(), True),
            StructField("time", LongType(), True),
            StructField("location", StringType(), True)
        ])
        # Initialize the state to store the latest location for each user
        self.latest_location = handle.getValueState("latestLocation", value_state_schema)

    def handleInputRows(self, key, rows, timer_values) -> Iterator[pd.DataFrame]:
        # Find the row with the maximum time value
        max_row = None
        max_time = float('-inf')
        for pdf in rows:
            for _, pd_row in pdf.iterrows():
                time_value = pd_row["time"]
                if time_value > max_time:
                    max_time = time_value
                    max_row = tuple(pd_row)

        # Check if state exists and update if necessary
        exists = self.latest_location.exists()
        if not exists or max_row[1] > self.latest_location.get()[1]:
            # Update the state with the new max row
            self.latest_location.update(max_row)
            # Yield the updated row
            yield pd.DataFrame(
                {"user": (max_row[0],), "time": (max_row[1],), "location": (max_row[2],)}
            )
        # Yield an empty DataFrame if no update is needed
        yield pd.DataFrame()

    def close(self) -> None:
        # No cleanup needed
        pass

# Apply the stateful transformation to the input DataFrame
(df.groupBy("user")
  .transformWithStateInPandas(
      statefulProcessor=SCDType1StatefulProcessor(),
      outputStructType=output_schema,
      outputMode="Update",
      timeMode="None",
  )
  .writeStream...  # Continue with stream writing configuration
)

Scala

// Define a case class to represent user location data
case class UserLocation(
    user: String,
    time: Long,
    location: String)

// Define a stateful processor for slowly changing dimension type 1 (SCD1) operations
class SCDType1StatefulProcessor extends StatefulProcessor[String, UserLocation, UserLocation] {
  import org.apache.spark.sql.{Encoders}

  // Transient value state to store the latest location for each user
  @transient private var _latestLocation: ValueState[UserLocation] = _

  // Initialize the state store
  override def init(
      outputMode: OutputMode,
      timeMode: TimeMode): Unit = {
    // Create a value state named "locationState" using UserLocation encoder
    // TTLConfig.NONE means the state has no expiration
    _latestLocation = getHandle.getValueState[UserLocation]("locationState",
      Encoders.product[UserLocation], TTLConfig.NONE)
  }

  // Process input rows and update state
  override def handleInputRows(
      key: String,
      inputRows: Iterator[UserLocation],
      timerValues: TimerValues): Iterator[UserLocation] = {
    // Find the location with the maximum timestamp from input rows
    val maxNewLocation = inputRows.maxBy(_.time)

    // Update state and emit output if:
    // 1. No previous state exists, or
    // 2. New location has a more recent timestamp than the stored one
    if (_latestLocation.getOption().isEmpty || maxNewLocation.time > _latestLocation.get().time) {
      _latestLocation.update(maxNewLocation)
      Iterator.single(maxNewLocation)  // Emit the updated location
    } else {
      Iterator.empty  // No update needed, emit nothing
    }
  }
}
}

Dimensioni lentamente variabili (SCD) tipo 2

I notebook seguenti contengono un esempio di implementazione del tipo SCD 2 usando transformWithState in Python o Scala.

SCD Type 2 Python

Prendi notebook

Scala di tipo 2 SCD

Prendi notebook

Rilevatore di tempi di inattività

transformWithState implementa timer per consentire di intervenire in base al tempo trascorso, anche se non vengono elaborati record per una determinata chiave in un microbatch.

Nell'esempio seguente viene implementato un modello per un rilevatore di tempo di inattività. Ogni volta che viene visualizzato un nuovo valore per una determinata chiave, aggiorna il valore dello stato lastSeen, cancella eventuali timer esistenti e reimposta un timer per il futuro.

Quando un timer scade, l'applicazione genera il tempo trascorso dall'ultimo evento osservato per la chiave. Imposta quindi un nuovo timer per generare un aggiornamento di 10 secondi dopo.

Pitone

import datetime
import time

class DownTimeDetectorStatefulProcessor(StatefulProcessor):
    def init(self, handle: StatefulProcessorHandle) -> None:
        # Define schema for the state value (timestamp)
        state_schema = StructType([StructField("value", TimestampType(), True)])
        self.handle = handle
        # Initialize state to store the last seen timestamp for each key
        self.last_seen = handle.getValueState("last_seen", state_schema)

    def handleExpiredTimer(self, key, timerValues, expiredTimerInfo) -> Iterator[pd.DataFrame]:
        latest_from_existing = self.last_seen.get()
        # Calculate downtime duration
        downtime_duration = timerValues.getCurrentProcessingTimeInMs() - int(time.time() * 1000)
        # Register a new timer for 10 seconds in the future
        self.handle.registerTimer(timerValues.getCurrentProcessingTimeInMs() + 10000)
        # Yield a DataFrame with the key and downtime duration
        yield pd.DataFrame(
            {
                "id": key,
                "timeValues": str(downtime_duration),
            }
        )

    def handleInputRows(self, key, rows, timerValues) -> Iterator[pd.DataFrame]:
        # Find the row with the maximum timestamp
        max_row = max((tuple(pdf.iloc[0]) for pdf in rows), key=lambda row: row[1])

        # Get the latest timestamp from existing state or use epoch start if not exists
        if self.last_seen.exists():
            latest_from_existing = self.last_seen.get()
        else:
            latest_from_existing = datetime.fromtimestamp(0)

        # If new data is more recent than existing state
        if latest_from_existing < max_row[1]:
            # Delete all existing timers
            for timer in self.handle.listTimers():
                self.handle.deleteTimer(timer)
            # Update the last seen timestamp
            self.last_seen.update((max_row[1],))

        # Register a new timer for 5 seconds in the future
        self.handle.registerTimer(timerValues.getCurrentProcessingTimeInMs() + 5000)

        # Get current processing time in milliseconds
        timestamp_in_millis = str(timerValues.getCurrentProcessingTimeInMs())

        # Yield a DataFrame with the key and current timestamp
        yield pd.DataFrame({"id": key, "timeValues": timestamp_in_millis})

    def close(self) -> None:
        # No cleanup needed
        pass

Scala

import java.sql.Timestamp
import org.apache.spark.sql.Encoders

// The (String, Timestamp) schema represents an (id, time). We want to do downtime
// detection on every single unique sensor, where each sensor has a sensor ID.
class DowntimeDetector(duration: Duration) extends
  StatefulProcessor[String, (String, Timestamp), (String, Duration)] {

  @transient private var _lastSeen: ValueState[Timestamp] = _

  override def init(outputMode: OutputMode, timeMode: TimeMode): Unit = {
    _lastSeen = getHandle.getValueState[Timestamp]("lastSeen", Encoders.TIMESTAMP, TTLConfig.NONE)
  }

  // The logic here is as follows: find the largest timestamp seen so far. Set a timer for
  // the duration later.
  override def handleInputRows(
      key: String,
      inputRows: Iterator[(String, Timestamp)],
      timerValues: TimerValues): Iterator[(String, Duration)] = {
    val latestRecordFromNewRows = inputRows.maxBy(_._2.getTime)

    // Use getOrElse to initiate state variable if it doesn't exist
    val latestTimestampFromExistingRows = _lastSeen.getOption().getOrElse(new Timestamp(0))
    val latestTimestampFromNewRows = latestRecordFromNewRows._2

    if (latestTimestampFromNewRows.after(latestTimestampFromExistingRows)) {
      // Cancel the one existing timer, since we have a new latest timestamp.
      // We call "listTimers()" just because we don't know ahead of time what
      // the timestamp of the existing timer is.
      getHandle.listTimers().foreach(timer => getHandle.deleteTimer(timer))

      _lastSeen.update(latestTimestampFromNewRows)
      // Use timerValues to schedule a timer using processing time.
      getHandle.registerTimer(timerValues.getCurrentProcessingTimeInMs() + duration.toMillis)
    } else {
      // No new latest timestamp, so no need to update state or set a timer.
    }

    Iterator.empty
  }

  override def handleExpiredTimer(
    key: String,
    timerValues: TimerValues,
    expiredTimerInfo: ExpiredTimerInfo): Iterator[(String, Duration)] = {
      val latestTimestamp = _lastSeen.get()
      val downtimeDuration = new Duration(
        timerValues.getCurrentProcessingTimeInMs() - latestTimestamp.getTime)

      // Register another timer that will fire in 10 seconds.
      // Timers can be registered anywhere but init()
      getHandle.registerTimer(timerValues.getCurrentProcessingTimeInMs() + 10000)

      Iterator((key, downtimeDuration))
  }
}

Eseguire la migrazione delle informazioni sullo stato esistenti

Nell'esempio seguente viene illustrato come implementare un'applicazione con stato che accetta uno stato iniziale. È possibile aggiungere la gestione dello stato iniziale a qualsiasi applicazione con stato, ma lo stato iniziale può essere impostato solo quando si inizializza per la prima volta l'applicazione.

In questo esempio viene usato il lettore statestore per caricare le informazioni sullo stato esistenti da un percorso del checkpoint. Un caso d'uso di esempio per questo modello è la migrazione da applicazioni con stato legacy a transformWithState.

Pitone

# Import necessary libraries
import pandas as pd
from pyspark.sql.streaming import StatefulProcessor, StatefulProcessorHandle
from pyspark.sql.types import StructType, StructField, LongType, StringType, IntegerType
from typing import Iterator

# Set RocksDB as the state store provider for better performance
spark.conf.set("spark.sql.streaming.stateStore.providerClass", "org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider")

"""
Input schema is as below

input_schema = StructType(
    [StructField("id", StringType(), True)],
    [StructField("value", StringType(), True)]
)
"""

# Define the output schema for the streaming query
output_schema = StructType([
    StructField("id", StringType(), True),
    StructField("accumulated", StringType(), True)
])

class AccumulatedCounterStatefulProcessorWithInitialState(StatefulProcessor):

    def init(self, handle: StatefulProcessorHandle) -> None:
        # Define schema for the state value (integer)
        state_schema = StructType([StructField("value", IntegerType(), True)])
        # Initialize state to store the accumulated counter for each id
        self.counter_state = handle.getValueState("counter_state", state_schema)
        self.handle = handle

    def handleInputRows(self, key, rows, timerValues) -> Iterator[pd.DataFrame]:
        # Check if state exists for the current key
        exists = self.counter_state.exists()
        if exists:
            value_row = self.counter_state.get()
            existing_value = value_row[0]
        else:
            existing_value = 0

        accumulated_value = existing_value

        # Process input rows and accumulate values
        for pdf in rows:
            value = pdf["value"].astype(int).sum()
            accumulated_value += value

        # Update the state with the new accumulated value
        self.counter_state.update((accumulated_value,))

        # Yield a DataFrame with the key and accumulated value
        yield pd.DataFrame({"id": key, "accumulated": str(accumulated_value)})

    def handleInitialState(self, key, initialState, timerValues) -> None:
        # Initialize the state with the provided initial value
        init_val = initialState.at[0, "initVal"]
        self.counter_state.update((init_val,))

    def close(self) -> None:
        # No cleanup needed
        pass

# Load initial state from a checkpoint directory
initial_state = spark.read.format("statestore")
  .option("path", "$checkpointsDir")
  .load()

# Apply the stateful transformation to the input DataFrame
df.groupBy("id")
  .transformWithStateInPandas(
      statefulProcessor=AccumulatedCounterStatefulProcessorWithInitialState(),
      outputStructType=output_schema,
      outputMode="Update",
      timeMode="None",
      initialState=initial_state,
  )
  .writeStream...  # Continue with stream writing configuration

Scala

// Import necessary libraries
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.{Dataset, Encoder, Encoders , DataFrame}
import org.apache.spark.sql.types._

// Define a stateful processor that can handle initial state
class InitialStateStatefulProcessor extends StatefulProcessorWithInitialState[String, (String, String, String), (String, String), (String, Int)] {
  // Transient value state to store the accumulated value
  @transient protected var valueState: ValueState[Int] = _

  // Initialize the state store
  override def init(
      outputMode: OutputMode,
      timeMode: TimeMode): Unit = {
    // Create a value state named "valueState" using Int encoder
    // TTLConfig.NONE means the state has no automatic expiration
    valueState = getHandle.getValueState[Int]("valueState",
      Encoders.scalaInt, TTLConfig.NONE)
  }

  // Process input rows and update state
  override def handleInputRows(
      key: String,
      inputRows: Iterator[(String, String, String)],
      timerValues: TimerValues): Iterator[(String, String)] = {
    var existingValue = 0
    // Retrieve existing value from state if it exists
    if (valueState.exists()) {
      existingValue += valueState.get()
    }
    var accumulatedValue = existingValue
    // Accumulate values from input rows
    for (row <- inputRows) {
      accumulatedValue += row._2.toInt
    }
    // Update the state with the new accumulated value
    valueState.update(accumulatedValue)
    // Return the key and accumulated value as a string
    Iterator((key, accumulatedValue.toString))
  }

  // Handle initial state when provided
  override def handleInitialState(
      key: String, initialState: (String, Int), timerValues: TimerValues): Unit = {
    // Update the state with the initial value
    valueState.update(initialState._2)
  }
}

Eseguire la migrazione della tabella Delta nell'archivio di stato per l'inizializzazione

I seguenti notebook contengono un esempio di inizializzazione dei valori dell'archivio di stato da una tabella Delta tramite l'uso di transformWithState in Python o Scala.

Inizializzare lo stato con Delta Python

Prendi notebook

Inizializzare lo stato da Delta Scala

Ottenere notebook

Rilevamento delle sessioni

I notebook seguenti contengono un esempio di rilevamento delle sessioni usando transformWithState in Python o Scala.

Rilevamento delle sessioni python

Prendi quaderno

Monitoraggio delle sessioni Scala

Prendi notebook

Join di flusso personalizzato con transformWithState

Il codice seguente dimostra un join tra flussi personalizzato tra più stream utilizzando transformWithState. È possibile usare questo approccio anziché un operatore join predefinito per i motivi seguenti:

  • È necessario utilizzare la modalità di output di aggiornamento che non supporta i join tra flussi. Ciò è particolarmente utile per le applicazioni a bassa latenza.
  • Dovresti continuare a eseguire operazioni di join per le righe in arrivo in ritardo (dopo la scadenza della filigrana).
  • È necessario eseguire unioni tra flussi molti-a-molti.

Questo esempio fornisce all'utente il controllo completo sulla logica di scadenza dello stato, consentendo all'estensione del periodo di conservazione dinamico di gestire gli eventi non ordinati anche dopo la filigrana.

Pitone

# Import necessary libraries
import pandas as pd
from pyspark.sql.streaming import StatefulProcessor, StatefulProcessorHandle
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
from typing import Iterator

# Define output schema for the joined data
output_schema = StructType([
    StructField("user_id", StringType(), True),
    StructField("event_type", StringType(), True),
    StructField("timestamp", TimestampType(), True),
    StructField("profile_name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("preferred_category", StringType(), True)
])

class CustomStreamJoinProcessor(StatefulProcessor):
    # Initialize stateful storage for user profiles, preferences, and event tracking.
    def init(self, handle: StatefulProcessorHandle) -> None:

        # Define schemas for different types of state data
        profile_schema = StructType([
            StructField("name", StringType(), True),
            StructField("email", StringType(), True),
            StructField("updated_at", TimestampType(), True)
        ])
        preferences_schema = StructType([
            StructField("preferred_category", StringType(), True),
            StructField("updated_at", TimestampType(), True)
        ])
        activity_schema = StructType([
            StructField("event_type", StringType(), True),
            StructField("timestamp", TimestampType(), True)
        ])

        # Initialize state storage for user profiles, preferences, and activity
        self.profile_state = handle.getMapState("user_profiles", "string", profile_schema)
        self.preferences_state = handle.getMapState("user_preferences", "string", preferences_schema)
        self.activity_state = handle.getMapState("user_activity", "string", activity_schema)

    # Process incoming events and update state
    def handleInputRows(self, key, rows: Iterator[pd.DataFrame], timer_values) -> Iterator[pd.DataFrame]:
        df = pd.concat(rows, ignore_index=True)
        output_rows = []

        for _, row in df.iterrows():
            user_id = row["user_id"]

            if "event_type" in row:  # User activity event
                self.activity_state.update_value(user_id, row.to_dict())
                # Set a timer to process this event after a 10-second delay
                self.getHandle().registerTimer(timer_values.get_current_processing_time_in_ms() + (10 * 1000))

            elif "name" in row:  # Profile update
                self.profile_state.update_value(user_id, row.to_dict())

            elif "preferred_category" in row:  # Preference update
                self.preferences_state.update_value(user_id, row.to_dict())

        # No immediate output; processing will happen when timer expires
        return iter([])

    # Perform lookup after delay, handling out-of-order and late-arriving events.
    def handleExpiredTimer(self, key, timer_values, expired_timer_info) -> Iterator[pd.DataFrame]:

        # Retrieve stored state for the user
        user_activity = self.activity_state.get_value(key)
        user_profile = self.profile_state.get_value(key)
        user_preferences = self.preferences_state.get_value(key)

        if user_activity:
            # Combine data from different states into a single output row
            output_row = {
                "user_id": key,
                "event_type": user_activity["event_type"],
                "timestamp": user_activity["timestamp"],
                "profile_name": user_profile.get("name") if user_profile else None,
                "email": user_profile.get("email") if user_profile else None,
                "preferred_category": user_preferences.get("preferred_category") if user_preferences else None
            }
            return iter([pd.DataFrame([output_row])])

        return iter([])

    def close(self) -> None:
        # No cleanup needed
        pass

# Apply transformWithState to the input DataFrame
(df.groupBy("user_id")
  .transformWithStateInPandas(
      statefulProcessor=CustomStreamJoinProcessor(),
      outputStructType=output_schema,
      outputMode="Append",
      timeMode="ProcessingTime"
  )
  .writeStream...  # Continue with stream writing configuration
)

Scala

// Import necessary libraries
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.types.TimestampType
import java.sql.Timestamp

// Define a case class for enriched user events, combining user activity with profile and preference data
case class EnrichedUserEvent(
    user_id: String,
    event_type: String,
    timestamp: Timestamp,
    profile_name: Option[String],
    email: Option[String],
    preferred_category: Option[String]
)

// Custom stateful processor for stream-stream join
class CustomStreamJoinProcessor extends StatefulProcessor[String, UserEvent, EnrichedUserEvent] {
  // Transient state variables to store user profiles, preferences, and activities
  @transient private var _profileState: MapState[String, UserProfile] = _
  @transient private var _preferencesState: MapState[String, UserPreferences] = _
  @transient private var _activityState: MapState[String, UserEvent] = _

  // Initialize state stores
  override def init(outputMode: OutputMode, timeMode: TimeMode): Unit = {
    _profileState = getHandle.getMapState[String, UserProfile]("profileState", Encoders.product[UserProfile], TTLConfig.NONE)
    _preferencesState = getHandle.getMapState[String, UserPreferences]("preferencesState", Encoders.product[UserPreferences], TTLConfig.NONE)
    _activityState = getHandle.getMapState[String, UserEvent]("activityState", Encoders.product[UserEvent], TTLConfig.NONE)
  }

  // Handle incoming user events
  override def handleInputRows(
      key: String,
      inputRows: Iterator[UserEvent],
      timerValues: TimerValues): Iterator[EnrichedUserEvent] = {

    inputRows.foreach { event =>
      if (event.event_type.nonEmpty) {
        // Update activity state and set a timer for 10 seconds in the future
        _activityState.update(key, event)
        getHandle.registerTimer(timerValues.getCurrentProcessingTimeInMs() + 10000)
      }
    }
    Iterator.empty
  }

  // Handle expired timers to produce enriched events
  override def handleExpiredTimer(
      key: String,
      timerValues: TimerValues,
      expiredTimerInfo: ExpiredTimerInfo): Iterator[EnrichedUserEvent] = {

    // Retrieve user data from state stores
    val userEvent = _activityState.getOption(key)
    val userProfile = _profileState.getOption(key)
    val userPreferences = _preferencesState.getOption(key)

    if (userEvent.isDefined) {
      // Create and return an enriched event if user activity exists
      val enrichedEvent = EnrichedUserEvent(
        user_id = key,
        event_type = userEvent.get.event_type,
        timestamp = userEvent.get.timestamp,
        profile_name = userProfile.map(_.name),
        email = userProfile.map(_.email),
        preferred_category = userPreferences.map(_.preferred_category)
      )
      Iterator.single(enrichedEvent)
    } else {
      Iterator.empty
    }
  }
}

// Apply the custom stateful processor to the input DataFrame
val enrichedStream = df
  .groupByKey(_.user_id)
  .transformWithState(
    new CustomStreamJoinProcessor(),
    TimeMode.ProcessingTime(),
    OutputMode.Append()
  )

// Write the enriched stream to Delta Lake
enrichedStream.writeStream
  .format("delta")
  .outputMode("append")
  .option("checkpointLocation", "/mnt/delta/checkpoints")
  .start("/mnt/delta/enriched_events")

Top-K calcolo

Nell'esempio seguente viene usata una ListState con una coda di priorità per gestire e aggiornare gli elementi K principali in un flusso per ogni chiave di gruppo quasi in tempo reale.

Top-K Python

Prendi il notebook

Top-K Scala

Prendi notebook