Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We've been developing a generic REST job server here at Ooyala and would like to outline its goals, architecture, and API, so we could get feedback from the community and hopefully contribute it back.

        Activity

        Hide
        Evan Chan added a comment -

        Document explaining our current overall architecture / API / features for the job server.

        Show
        Evan Chan added a comment - Document explaining our current overall architecture / API / features for the job server.
        Hide
        Henry Saputra added a comment - - edited

        This is good start, thanks for driving this. Thanks.

        Some immediate questions:
        How would the API server communicate with the driver?
        Will it share the same context as the driver?

        Show
        Henry Saputra added a comment - - edited This is good start, thanks for driving this. Thanks. Some immediate questions: How would the API server communicate with the driver? Will it share the same context as the driver?
        Hide
        Evan Chan added a comment -

        @Henry:

        > API server communicate with driver?

        I assume by "driver" you mean the SparkContext within which each job is running right? This is created by the job server itself. You can think of the workflow like this (we can post one we've been working on to make things more clear):

        • User does a POST /jobs to initiate a job. Either this is ad-hoc job (temporary context) or runs in a pre-created context.
        • Job server finds or creates the context
        • Job server loads the class for the job, which must implement a trait, and invokes a method, passing in the SparkContext instance.

        > Will it share the same context as the driver?

        Yes. So, all jobs passed to the job server should implement a trait, and the trait has a method like this:

        /**

        • This is the entry point for a Spark Job Server to execute Spark jobs.
        • This function should create or reuse RDDs and return the result at the end, which the
        • Job Server will cache or display.
        • @param sc a SparkContext for the job. May be reused across jobs.
        • @param config the Typesafe Config object passed into the job request
        • @return the job result
          */
          def runJob(sc: SparkContext, config: Config): Any

        The user can submit multiple jobs to the same context – for example, the first job can create a cached RDD, and the second one can query it.

        Hope that answers your questions, and looking forward to more feedback.

        Show
        Evan Chan added a comment - @Henry: > API server communicate with driver? I assume by "driver" you mean the SparkContext within which each job is running right? This is created by the job server itself. You can think of the workflow like this (we can post one we've been working on to make things more clear): User does a POST /jobs to initiate a job. Either this is ad-hoc job (temporary context) or runs in a pre-created context. Job server finds or creates the context Job server loads the class for the job, which must implement a trait, and invokes a method, passing in the SparkContext instance. > Will it share the same context as the driver? Yes. So, all jobs passed to the job server should implement a trait, and the trait has a method like this: /** This is the entry point for a Spark Job Server to execute Spark jobs. This function should create or reuse RDDs and return the result at the end, which the Job Server will cache or display. @param sc a SparkContext for the job. May be reused across jobs. @param config the Typesafe Config object passed into the job request @return the job result */ def runJob(sc: SparkContext, config: Config): Any The user can submit multiple jobs to the same context – for example, the first job can create a cached RDD, and the second one can query it. Hope that answers your questions, and looking forward to more feedback.
        Hide
        Nick Pentreath added a comment -

        This sounds great and is something that I quite urgently need. I'm currently designing something similar for my fairly simple use case, and am running into issues with Jetty versioning (I think) between the job server and Spark.

        When do you think this might be ready for release / use?

        Show
        Nick Pentreath added a comment - This sounds great and is something that I quite urgently need. I'm currently designing something similar for my fairly simple use case, and am running into issues with Jetty versioning (I think) between the job server and Spark. When do you think this might be ready for release / use?
        Hide
        Henry Saputra added a comment - - edited

        Hi Evan, "driver" means the application where construct SparkContext to submit request to Spark master where the job will be running.

        I assume we will not build full RESTful API server but more into HTTP based APIs. I am asking this because of the details of links and media representations as REST(ful) APIs.

        What is the backend proposed implementation to route the request? Use Akka remote Actor (directly) or Spray.io routing?

        Thx again for the effort.

        Show
        Henry Saputra added a comment - - edited Hi Evan, "driver" means the application where construct SparkContext to submit request to Spark master where the job will be running. I assume we will not build full RESTful API server but more into HTTP based APIs. I am asking this because of the details of links and media representations as REST(ful) APIs. What is the backend proposed implementation to route the request? Use Akka remote Actor (directly) or Spray.io routing? Thx again for the effort.
        Hide
        Evan Chan added a comment -

        @Nick: Thanks for the interest! We are hoping to submit this soon, in bits and pieces, with the first one coming in as soon as about a week.
        It is usable now, but it depends on some internal libraries which we need to factor out (or open source).
        We use Spray and put it first in the classpath, and haven't had any issues.

        @Henry:
        Yeah this server is RESTful in the sense that the route refers to resources - contexts or jobs or jars - and we use the HTTP verbs (GET, POST, DELETE) as the actions.

        "What is the backend proposed implementation to route the request? Use Akka remote Actor (directly) or Spray.io routing?"

        Well Spray handles the HTTP requests, and the job server internally routes the requests to actors which starts and stops contexts and jobs.

        Show
        Evan Chan added a comment - @Nick: Thanks for the interest! We are hoping to submit this soon, in bits and pieces, with the first one coming in as soon as about a week. It is usable now, but it depends on some internal libraries which we need to factor out (or open source). We use Spray and put it first in the classpath, and haven't had any issues. @Henry: Yeah this server is RESTful in the sense that the route refers to resources - contexts or jobs or jars - and we use the HTTP verbs (GET, POST, DELETE) as the actions. "What is the backend proposed implementation to route the request? Use Akka remote Actor (directly) or Spray.io routing?" Well Spray handles the HTTP requests, and the job server internally routes the requests to actors which starts and stops contexts and jobs.
        Hide
        Josh Rosen added a comment -

        There's now an open pull request for this: https://github.com/apache/incubator-spark/pull/222

        Show
        Josh Rosen added a comment - There's now an open pull request for this: https://github.com/apache/incubator-spark/pull/222
        Hide
        Evan Chan added a comment - - edited

        An update: we have put up the final job server here:
        https://github.com/ooyala/spark-jobserver

        The plan is to have a spark-contrib repo/github account and this would be one of the first projects.

        See SPARK-1283 for the ticket to track spark-contrib.

        Show
        Evan Chan added a comment - - edited An update: we have put up the final job server here: https://github.com/ooyala/spark-jobserver The plan is to have a spark-contrib repo/github account and this would be one of the first projects. See SPARK-1283 for the ticket to track spark-contrib.

          People

          • Assignee:
            Unassigned
            Reporter:
            Evan Chan
          • Votes:
            2 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

            • Created:
              Updated: