-
Notifications
You must be signed in to change notification settings - Fork 280
PermissionError: [Errno 13] Permission denied: '/home/pythainlp-data' #475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, Thanks a lot for reporting the issue. As far as I know, we haven't tested much on such a distributed system, e.g. Spark. Would you mind elaborating a bit more about your system, e.g. what is the setup of your cluster? |
@COLDkl have you tried changing |
Hello everyone, we use the company's machine to submit the Spark task to translate Thai data, so we cannot have root privileges. I changed the pythainlp-data directory to the current directory to avoid program errors due to permission issues |
In order to speed up the translation, we changed the code to run distributed. Currently, it is not possible to change the PYTHAINLP_DATA_DIR path of each computer in the Spark task. Spark submits to use Pyspark, so we can only modify the PYTHAINLP_DATA_DIR of the driver node in the pyspark script through os.environ ['PYTHAINLP_DATA_DIR'] = "./", but cannot explicitly specify the executor node PYTHAINLP_DATA_DIR. |
Spark 2.4.5 |
If Does it ok for this case? As it is home directory and not root. Or you like to have a unique directory for each computer? |
We use the Spark framework for distributed computing and translation The Spark computing framework consists of Driver nodes and Executor nodes Currently, Spark tasks are started in the Home directory by default |
Why not consider migrating the default directory from |
When the Spark program is running, a temporary directory will be generated on the Driver node. This directory stores some data and dependencies of the program. After the program ends, the current directory will be automatically deleted. I don't know if it is possible to set the default path like this PYTHAINLP_DEFAULT_DATA_DIR = "pythainlp-data" |
Do we also have to make sure executor nodes have this directory? |
For distributed tasks, you must ensure that each running node has a target directory. |
@COLDkl Did you encounter same problem with nltk? (nltk-data) |
Not used in a distributed environment |
any updates on this issue? |
I've encountered the same problem for using |
@COLDkl I've found the solution. Actually, because Spark will distribute your function in rdd.map() to each executors. The problem will be solved if you code like below in you distributed defined function.
this will distribute the os environ to all executors. |
It worked for us for pySpark code :) import os |
For NTLK error in PySpark: PermissionError: [Errno 13] Permission denied: '/home/nltk_data' Add this in code: import os |
pyspark
py file settings os.environ['PYTHAINLP_DATA_DIR'] = "./"
Distributed operations to executor nodes to report permissions
How to set the environment variables of each executor node in a distributed environment
The text was updated successfully, but these errors were encountered: