Skip to content

PermissionError: [Errno 13] Permission denied: '/home/pythainlp-data' #475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
COLDkl opened this issue Sep 4, 2020 · 18 comments
Closed
Labels
bug bugs in the library

Comments

@COLDkl
Copy link

COLDkl commented Sep 4, 2020

pyspark

py file settings os.environ['PYTHAINLP_DATA_DIR'] = "./"
Distributed operations to executor nodes to report permissions

How to set the environment variables of each executor node in a distributed environment

@p16i
Copy link
Contributor

p16i commented Sep 4, 2020

Hi,

Thanks a lot for reporting the issue. As far as I know, we haven't tested much on such a distributed system, e.g. Spark.

Would you mind elaborating a bit more about your system, e.g. what is the setup of your cluster?

@bact
Copy link
Member

bact commented Sep 7, 2020

@COLDkl have you tried changing ./ to something else and see if it works?

@COLDkl
Copy link
Author

COLDkl commented Sep 9, 2020

Hello everyone, we use the company's machine to submit the Spark task to translate Thai data, so we cannot have root privileges.

I changed the pythainlp-data directory to the current directory to avoid program errors due to permission issues

@COLDkl
Copy link
Author

COLDkl commented Sep 9, 2020

In order to speed up the translation, we changed the code to run distributed. Currently, it is not possible to change the PYTHAINLP_DATA_DIR path of each computer in the Spark task.

Spark submits to use Pyspark, so we can only modify the PYTHAINLP_DATA_DIR of the driver node in the pyspark script through os.environ ['PYTHAINLP_DATA_DIR'] = "./", but cannot explicitly specify the executor node PYTHAINLP_DATA_DIR.

@COLDkl
Copy link
Author

COLDkl commented Sep 9, 2020

Spark 2.4.5
python 3.6
pythainlp 2.2.3

@bact
Copy link
Member

bact commented Sep 9, 2020

If PYTHAINLP_DATA_DIR is not defined, PyThaiNLP will use ~/pythainlp-data,
see https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tools/path.py

Does it ok for this case? As it is home directory and not root.

Or you like to have a unique directory for each computer?

@COLDkl
Copy link
Author

COLDkl commented Sep 10, 2020

We use the Spark framework for distributed computing and translation

The Spark computing framework consists of Driver nodes and Executor nodes

Currently, Spark tasks are started in the Home directory by default
PermissionError: [Errno 13] Permission denied: '/home/pythainlp-data'

@COLDkl
Copy link
Author

COLDkl commented Sep 10, 2020

Why not consider migrating the default directory from (home) to. (Current) Directory?
pythainlp_data_dir = os.getenv("PYTHAINLP_DATA_DIR", os.path.join("
", PYTHAINLP_DEFAULT_DATA_DIR))
pythainlp_data_dir = os.getenv("PYTHAINLP_DATA_DIR", os.path.join(".", PYTHAINLP_DEFAULT_DATA_DIR))

@COLDkl
Copy link
Author

COLDkl commented Sep 10, 2020

When the Spark program is running, a temporary directory will be generated on the Driver node. This directory stores some data and dependencies of the program. After the program ends, the current directory will be automatically deleted.

I don't know if it is possible to set the default path like this

PYTHAINLP_DEFAULT_DATA_DIR = "pythainlp-data"
pythainlp_data_dir = os.getenv("PYTHAINLP_DATA_DIR", os.path.join(".", PYTHAINLP_DEFAULT_DATA_DIR))
path = os.path.expanduser(pythainlp_data_dir)
os.makedirs(path, exist_ok=True)

@p16i
Copy link
Contributor

p16i commented Sep 10, 2020

Do we also have to make sure executor nodes have this directory?

@COLDkl
Copy link
Author

COLDkl commented Sep 10, 2020

For distributed tasks, you must ensure that each running node has a target directory.

@wannaphong
Copy link
Member

wannaphong commented Sep 10, 2020

@COLDkl Did you encounter same problem with nltk? (nltk-data)

@COLDkl
Copy link
Author

COLDkl commented Sep 10, 2020

Not used in a distributed environment

@ldong87
Copy link

ldong87 commented Feb 8, 2021

any updates on this issue?

@zegzag
Copy link

zegzag commented Oct 11, 2021

I've encountered the same problem for using pythainlp on Spark

@zegzag
Copy link

zegzag commented Oct 12, 2021

I've encountered the same problem for using pythainlp on Spark

@COLDkl I've found the solution. Actually, because Spark will distribute your function in rdd.map() to each executors. The problem will be solved if you code like below in you distributed defined function.

def your_func(): 
    import os 
    os.environ['PYTHAINLP_DATA_DIR'] = './pythainlp-data'
    import pythainlp
    # your logic

rdd.map(your_func)

this will distribute the os environ to all executors.

@Rajiv140689
Copy link

It worked for us for pySpark code :)

import os
os.environ['PYTHAINLP_DATA_DIR'] = './pythainlp-data'
import pythainlp
from pythainlp.tokenize import word_tokenize as thai_tokenizer

@Rajiv140689
Copy link

Rajiv140689 commented Jun 26, 2024

For NTLK error in PySpark: PermissionError: [Errno 13] Permission denied: '/home/nltk_data'
raise LookupError(resource_not_found)

Add this in code:

import os
os.environ['NLTK_DATA'] = './nltk_data'
nltk.download("punkt", download_dir='./nltk_data')
nltk.data.path.append(nltk_data_path)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs in the library
Projects
None yet
Development

No branches or pull requests

7 participants