System Design: Distributed Task Scheduler
What is a task scheduler?
first, what is task?
task is a piece of computational work that requires resources(CPU, RAM, Storage, Bandwidth).
In a system, many tasks compete for the limited resources, task scheduler is the system that mediates between tasks and resources by intelligently allocating resources to tasks so that task-level and system-level goals are met.
Why we need task scheduler?
it allows us to complete a large number of tasks using limited resources, provide users with an uninterrupted execution experience.
Design
Requirements
- task wise
- submit task
- remove task
- monitor task
- allocate resources
Components
- clients: every system need a client, in our case, client can initiate the task
- rate limiter: it’s important to limit the number of tasks for the reliability of our service.
- task submitter: admit the task if it’s passed rate limiter
- there is a cluster manager to which each node report their status
- each node will also update manager regarding the tasks they admits
- sequencer: assign unique ID to newly created tasks
- DB: tasks and its metadata need to be stored in distributed database.
- metadata can be stored in relational database
- the dependency relationships can be stored in graph database.
- batching and prioritization: after we store the tasks information into database, prioritization is based on the attribution of tasks.
- top K priority tasks are pushed into distributed queue
- distributed queue: consist queue manager and queue
- queue manager adds, updates or delete tasks in queue, it also keep the task in queue until it success, if task fails, it will be visible again
- resource manager: know which resources are free, it pulls the task from distributed queue, and assign them the resources.
- it also keep track the execution status of each task, and report their status to queue manager.
- monitoring service: check the health of the resource manager and the resources.
Considerations
let discuss the scheduling strategy, where we determine which tasks to push into the queue.
and we cannot rely on FIFO queue due to the lack of flexibility, instead, we should categorize tasks and assign them to proper queues.
- tasks cannot be delayed
- tasks can be delayed
- tasks need to be executed periodically
by using a delay tolerance parameter, we can postpone the tasks with longer delay tolerance for urgent tasks during the peak time.
for malicious tasks, we can
- use authentication and resource authorization
- consider code sandboxing technology
Evaluation
the most important feature is availability, otherwise all system cannot provide any services
Availability
- we add the rate limiter
- task submitter is distributed service
- the queue to rank the tasks is also distributed
- also we have monitoring system to ensure the availability of system
Scalability
- system is modularized properly