Have been trying to set it up for hours now. Nothing works.

  • Latest version does not seem to have winutils support, and using it causes errors when using some important methods. (EDIT: this is likely wrong, and the winutils stuff that I have should probably be fine.)
  • Older versions require to be built with Maven. However, that just gives me a PluginExecutionException.

I need to do this ASAP, preferably within the next 3 hours.

I have nowhere else to ask for help, it seems, especially considering that reddit-logo suspended an account I set up specifically for asking questions after I edited a relevant post.

Highly doubt that anybody will be able to help me.

EDIT2: the issue has, thankfully, been resolved. I was using Python 3.12, and switched to 3.11.8. That made the problem go away.

  • Tomorrow_Farewell [any, they/them]@hexbear.netOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    4 months ago

    Just in case, if I install the library the first way, for the same piece of code the logs start with this:

    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    root
     |-- a: long (nullable = true)
     |-- b: double (nullable = true)
     |-- c: string (nullable = true)
     |-- d: date (nullable = true)
     |-- e: timestamp (nullable = true)
    
    24/07/22 19:04:46 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
    org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:612)
    	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:594)
    	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:789)
    	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
    	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
    	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCo*removed*tage1.processNext(Unknown Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
    	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
    	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
    	at org.apache.spark.scheduler.Task.run(Task.scala:141)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    	at java.base/java.lang.Thread.run(Thread.java:842)
    Caused by: java.io.EOFException
    	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:398)
    	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
    	... 26 more