Uploaded image for project: 'Apache Spark'
  1. Apache Spark
  2. SPARK-1100

saveAsTextFile shouldn't clobber by default

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.0.0
    • Component/s: Input/Output
    • Labels:
      None

      Description

      If I call rdd.saveAsTextFile with an existing directory, it will cheerfully and silently overwrite the files in there. This is bad enough if it means I've accidentally blown away the results of a job that might have taken minutes or hours to run. But it's worse if the second job happens to have fewer partitions than the first...in that case, my output directory now contains some "part" files from the earlier job, and some "part" files from the later job. The only way to know the difference is timestamp.

      I wonder if Spark's saveAsTextFile shouldn't work more like Hadoop MapReduce which insists that the output directory not exist before the job starts. Similarly HDFS won't override files by default. Perhaps there could be an optional argument for saveAsTextFile that indicates if it should delete the existing directory before starting. (I can't see any time I'd want to allow writing to an existing directory with data already in it. Would the mix of output from different tasks ever be desirable?)

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            CodingCat Nan Zhu added a comment -

            indeed, overwriting silently is risky

            but some users have been used to this, e.g. their jobs are started periodically to update the data in a directory...

            I think we'd better make it as an option, if spark.overwrite = true (by default, current behaviour), then it is directly overwrite the directory, if spark.overwrite = false, it will reject the writing

            I would like to work on it, any admin can assign it to me?

            Show
            CodingCat Nan Zhu added a comment - indeed, overwriting silently is risky but some users have been used to this, e.g. their jobs are started periodically to update the data in a directory... I think we'd better make it as an option, if spark.overwrite = true (by default, current behaviour), then it is directly overwrite the directory, if spark.overwrite = false, it will reject the writing I would like to work on it, any admin can assign it to me?
            Hide
            rxin Reynold Xin added a comment -

            I added you Nan Zhu to the developer list so you can assign tickets to yourself in the future.

            Show
            rxin Reynold Xin added a comment - I added you Nan Zhu to the developer list so you can assign tickets to yourself in the future.
            Hide
            CodingCat Nan Zhu added a comment -

            many thanks @Reynold Xin

            Show
            CodingCat Nan Zhu added a comment - many thanks @Reynold Xin
            Hide
            dcarroll@cloudera.com Diana Carroll added a comment -

            How about making it an option on the function call, that way it can be set by the developer for each individual write, rather than for a whole application or site? Also maybe a warning or at least info message indicating that files are being overwritten?

            I still think it's questionable to overwrite individual files in a directory rather than clear out the directory, even if, as you say, I want a job to "periodically update the data". If my first job's data has three partitions, it will save three files: part-00000, part-00001 and part-00002. If my second job has two partitions it will overwrite part-00000 and part-00001 from the first job, but leave the third file part-00003 hanging around. If I look in the directory, it's not easy to tell that my output data from the second job does NOT have three partitions/files, it only has two. If I build a report from all three files, or have a script that assembles them into a single file, two thirds of the data in the report would correctly be from the most recent job, and one third would incorrectly be from the older job. I'm having trouble imagining a use case where this would be desired behavior...

            Thanks for looking at this!

            Show
            dcarroll@cloudera.com Diana Carroll added a comment - How about making it an option on the function call, that way it can be set by the developer for each individual write, rather than for a whole application or site? Also maybe a warning or at least info message indicating that files are being overwritten? I still think it's questionable to overwrite individual files in a directory rather than clear out the directory, even if, as you say, I want a job to "periodically update the data". If my first job's data has three partitions, it will save three files: part-00000, part-00001 and part-00002. If my second job has two partitions it will overwrite part-00000 and part-00001 from the first job, but leave the third file part-00003 hanging around. If I look in the directory, it's not easy to tell that my output data from the second job does NOT have three partitions/files, it only has two. If I build a report from all three files, or have a script that assembles them into a single file, two thirds of the data in the report would correctly be from the most recent job, and one third would incorrectly be from the older job. I'm having trouble imagining a use case where this would be desired behavior... Thanks for looking at this!
            Hide
            CodingCat Nan Zhu added a comment -

            Good point, so if the option is set to true, we will clear out the directory first, if false, we will reject the writing

            but I'm still concerning about that since Spark has been running for years, there would be someone is occasionally utilizing this "feature"....

            anyway, I would like to make a PR first (maybe in this week) and then revise it according to the feedbacks from others

            Show
            CodingCat Nan Zhu added a comment - Good point, so if the option is set to true, we will clear out the directory first, if false, we will reject the writing but I'm still concerning about that since Spark has been running for years, there would be someone is occasionally utilizing this "feature".... anyway, I would like to make a PR first (maybe in this week) and then revise it according to the feedbacks from others
            Hide
            CodingCat Nan Zhu added a comment -
            Show
            CodingCat Nan Zhu added a comment - made a PR https://github.com/apache/incubator-spark/pull/626
            Hide
            CodingCat Nan Zhu added a comment -

            Hi, Diana Carroll,

            Can you confirm the left over old file problem happening on HDFS? mridulm said this should not happen

            I reproduced it in Local file system and S3, but I don't have a HDFS environment,

            See our discussion here:https://github.com/apache/incubator-spark/pull/626

            Best,

            Nan

            Show
            CodingCat Nan Zhu added a comment - Hi, Diana Carroll, Can you confirm the left over old file problem happening on HDFS? mridulm said this should not happen I reproduced it in Local file system and S3, but I don't have a HDFS environment, See our discussion here: https://github.com/apache/incubator-spark/pull/626 Best, Nan
            Hide
            patrick Patrick Wendell added a comment -

            Ya I don't think we want Spark ever deleting HDFS directories. I'd propose doing something like Hadoop MR, except maybe just check if the directory already has output files in it rather than if it exists. This would avoid regressing behavior for users that, e.g., output data into a directory that contains other stuff.

            Show
            patrick Patrick Wendell added a comment - Ya I don't think we want Spark ever deleting HDFS directories. I'd propose doing something like Hadoop MR, except maybe just check if the directory already has output files in it rather than if it exists. This would avoid regressing behavior for users that, e.g., output data into a directory that contains other stuff.

              People

              • Assignee:
                patrick Patrick Wendell
                Reporter:
                dcarroll@cloudera.com Diana Carroll
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: