PermissionError: [Errno 13] Permission denied: '/home/pythainlp-data' #475

COLDkl · 2020-09-04T13:05:53Z

pyspark

py file settings os.environ['PYTHAINLP_DATA_DIR'] = "./"
Distributed operations to executor nodes to report permissions

How to set the environment variables of each executor node in a distributed environment

p16i · 2020-09-04T13:53:12Z

Hi,

Thanks a lot for reporting the issue. As far as I know, we haven't tested much on such a distributed system, e.g. Spark.

Would you mind elaborating a bit more about your system, e.g. what is the setup of your cluster?

bact · 2020-09-07T16:03:19Z

@COLDkl have you tried changing ./ to something else and see if it works?

COLDkl · 2020-09-09T13:01:48Z

Hello everyone, we use the company's machine to submit the Spark task to translate Thai data, so we cannot have root privileges.

I changed the pythainlp-data directory to the current directory to avoid program errors due to permission issues

COLDkl · 2020-09-09T13:03:40Z

In order to speed up the translation, we changed the code to run distributed. Currently, it is not possible to change the PYTHAINLP_DATA_DIR path of each computer in the Spark task.

Spark submits to use Pyspark, so we can only modify the PYTHAINLP_DATA_DIR of the driver node in the pyspark script through os.environ ['PYTHAINLP_DATA_DIR'] = "./", but cannot explicitly specify the executor node PYTHAINLP_DATA_DIR.

COLDkl · 2020-09-09T13:07:53Z

Spark 2.4.5
python 3.6
pythainlp 2.2.3

bact · 2020-09-09T22:09:34Z

If PYTHAINLP_DATA_DIR is not defined, PyThaiNLP will use ~/pythainlp-data,
see https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tools/path.py

Does it ok for this case? As it is home directory and not root.

Or you like to have a unique directory for each computer?

COLDkl · 2020-09-10T02:55:55Z

We use the Spark framework for distributed computing and translation

The Spark computing framework consists of Driver nodes and Executor nodes

Currently, Spark tasks are started in the Home directory by default
PermissionError: [Errno 13] Permission denied: '/home/pythainlp-data'

COLDkl · 2020-09-10T02:59:30Z

Why not consider migrating the default directory from (home) to. (Current) Directory?
pythainlp_data_dir = os.getenv("PYTHAINLP_DATA_DIR", os.path.join("", PYTHAINLP_DEFAULT_DATA_DIR))
pythainlp_data_dir = os.getenv("PYTHAINLP_DATA_DIR", os.path.join(".", PYTHAINLP_DEFAULT_DATA_DIR))

COLDkl · 2020-09-10T03:16:43Z

When the Spark program is running, a temporary directory will be generated on the Driver node. This directory stores some data and dependencies of the program. After the program ends, the current directory will be automatically deleted.

I don't know if it is possible to set the default path like this

PYTHAINLP_DEFAULT_DATA_DIR = "pythainlp-data"
pythainlp_data_dir = os.getenv("PYTHAINLP_DATA_DIR", os.path.join(".", PYTHAINLP_DEFAULT_DATA_DIR))
path = os.path.expanduser(pythainlp_data_dir)
os.makedirs(path, exist_ok=True)

p16i · 2020-09-10T06:25:05Z

Do we also have to make sure executor nodes have this directory?

COLDkl · 2020-09-10T11:59:07Z

For distributed tasks, you must ensure that each running node has a target directory.

wannaphong · 2020-09-10T12:23:21Z

@COLDkl Did you encounter same problem with nltk? (nltk-data)

COLDkl · 2020-09-10T13:07:12Z

Not used in a distributed environment

ldong87 · 2021-02-08T23:29:08Z

any updates on this issue?

zegzag · 2021-10-11T09:40:29Z

I've encountered the same problem for using pythainlp on Spark

zegzag · 2021-10-12T08:12:44Z

I've encountered the same problem for using pythainlp on Spark

@COLDkl I've found the solution. Actually, because Spark will distribute your function in rdd.map() to each executors. The problem will be solved if you code like below in you distributed defined function.

def your_func(): 
    import os 
    os.environ['PYTHAINLP_DATA_DIR'] = './pythainlp-data'
    import pythainlp
    # your logic

rdd.map(your_func)

this will distribute the os environ to all executors.

Rajiv140689 · 2024-06-25T16:50:48Z

It worked for us for pySpark code :)

import os
os.environ['PYTHAINLP_DATA_DIR'] = './pythainlp-data'
import pythainlp
from pythainlp.tokenize import word_tokenize as thai_tokenizer

Rajiv140689 · 2024-06-26T03:25:00Z

For NTLK error in PySpark: PermissionError: [Errno 13] Permission denied: '/home/nltk_data'
raise LookupError(resource_not_found)

Add this in code:

import os
os.environ['NLTK_DATA'] = './nltk_data'
nltk.download("punkt", download_dir='./nltk_data')
nltk.data.path.append(nltk_data_path)

alexcombessie mentioned this issue Dec 3, 2020

Considerations for language model inclusion in default package or download them later #298

Open

bact added the bug bugs in the library label Jan 7, 2021

wannaphong closed this as completed May 16, 2022

PermissionError: [Errno 13] Permission denied: '/home/pythainlp-data' #475

PermissionError: [Errno 13] Permission denied: '/home/pythainlp-data' #475

Comments

COLDkl commented Sep 4, 2020

p16i commented Sep 4, 2020

Uh oh!

bact commented Sep 7, 2020

Uh oh!

COLDkl commented Sep 9, 2020

Uh oh!

COLDkl commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

COLDkl commented Sep 9, 2020

Uh oh!

bact commented Sep 9, 2020

Uh oh!

COLDkl commented Sep 10, 2020

Uh oh!

COLDkl commented Sep 10, 2020

Uh oh!

COLDkl commented Sep 10, 2020

Uh oh!

p16i commented Sep 10, 2020

Uh oh!

COLDkl commented Sep 10, 2020

Uh oh!

wannaphong commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

COLDkl commented Sep 10, 2020

Uh oh!

ldong87 commented Feb 8, 2021

Uh oh!

zegzag commented Oct 11, 2021

Uh oh!

zegzag commented Oct 12, 2021

Uh oh!

Rajiv140689 commented Jun 25, 2024

Uh oh!

Rajiv140689 commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

COLDkl commented Sep 9, 2020 •

edited

Loading

wannaphong commented Sep 10, 2020 •

edited

Loading

Rajiv140689 commented Jun 26, 2024 •

edited

Loading